תויעבט תופש דוביע - ישיש רועיש Viterbi Tagging

Download Report

Transcript תויעבט תופש דוביע - ישיש רועיש Viterbi Tagging

‫‪Text Books‬‬
‫עיבוד שפות טבעיות ‪ -‬שיעור שישי‬
‫‪Viterbi Tagging‬‬
‫‪Syntax‬‬
‫עידו דגן‬
‫המחלקה למדעי המחשב‬
‫אוניברסיטת בר אילן‬
‫‪1‬‬
‫‪88-680‬‬
‫‪Text Books‬‬
Text Books
Stochastic POS Tagging
• POS tagging:
For a given sentence W = w1…wn
Find the matching POS tags T = t1…tn
• In a statistical framework:
T' = arg max P(T|W)
T
88-680
Text Books
2
Text Books
T *  arg max P (t1.. n | w1.. n )
t1..n
Bayes’ Rule
P ( w1.. n | t1.. n ) P (t1.. n )
 arg max
P ( w1.. n )
Denominator doesn’t depend on tags
 arg max P ( w1.. n | t1.. n ) P (t1.. n )
Words are independent of each other
n
 arg max  P ( wi | t1.. n )  P (t1.. n )
i 1
A word’s identity depends only on its own tag
n
 arg max  P ( wi | ti )  P (t1.. n )
i 1
Chaining rule
n
 arg max  P ( wi | ti )  P (t1 )P (t 2 | t1 ) P (t3 | t1t 2 )...P (t n | t1.. n 1 )
i 1
Markovian assumptions (bigram)
n
 arg max  P ( wi | ti )  P (t1 )P (t 2 | t1 ) P (t3 | t 2 )...P (t n | t n 1 )
i 1
n
 arg max  P ( wi | ti ) P (ti | ti 1 )
t1..n
i 1
88-680
Text Books
Notation: P(t1) = P(t1 | t0)
3
Text Books
The Markovian assumptions
• Limited Horizon
– P(Xi+1 = tk |X1,…,Xi) = P(Xi+1 = tk | Xi)
• Time invariant
– P(Xi+1 = tk | Xi) = P(Xj+1 = tk | Xj)
88-680
Text Books
4
Text Books
Maximum Likelihood
Estimations
• In order to estimate P(wi|ti), P(ti|ti-1)
we can use the maximum likelihood
estimation
– P(wi|ti) = c(wi,ti) / c(ti)
– P(ti|ti-1) = c(ti-1ti) / c(ti-1)
• Notice estimation for i=1
88-680
Text Books
5
Text Books
Unknown Words
• Many words will not appear in the
training corpus.
• Unknown words are a major problem for
taggers (!)
• Solutions –
– Incorporate Morphological Analysis
– Consider words appearing once in
training data as UNKOWNs
88-680
Text Books
6
Text Books
“Add-1/Add-Constant”
Smoothing
c( x)
N
c( x) - the count for event x (e.g. word occurrence )
N - the total count for all x  X (e.g. corpus length)
 pMLE ( x)  0 for many low probabilit y events (sparsenes s)
pMLE ( x) 
Smoothing - discountin g and redistribu tion (probabili ties sum to 1) :
c( x)  
pS ( x) 
N  | X |
  1 : Laplace, assuming uniform prior.
In natural language events : usually   1
88-680
Text Books
7
Text Books
Smoothing for Tagging
• For P(ti|ti-1)
– Including one estimate for each t for all
count(t,t*)=0
• Optionally – for P(wi|ti)
88-680
Text Books
8
Text Books
Viterbi
• Finding the most probable tag sequence
can be done with the viterbi algorithm.
• No need to calculate every single
possible tag sequence (!)
88-680
Text Books
9
Text Books
Hmms
• Assume a state machine with
– Nodes that correspond to tags
– A start and end state (<s>, </s>)
– Arcs corresponding to transition
probabilities - P(ti|ti-1)
– A set of observations likelihoods for each
state - P(wi|ti)
88-680
Text Books
10
Text Books
P(likes)=0.3
P(flies)=0.1
…
P(eats)=0.5
VBZ
P(like)=0.2
P(fly)=0.3
…
P(eat)=0.36
RB
VB
NN
NNS
0.6
AT
0.4
88-680
Text Books
P(the)=0.4
P(a)=0.3
P(an)=0.2
…
11
Text Books
HMMs
• An HMM is similar to an Automaton
augmented with probabilities
• Note that the states in an HMM do not
correspond to the input symbols.
• The input (output) symbols don’t
uniquely determine the next state.
88-680
Text Books
12
Text Books
HMM definition
• HMM=(S,K,A,B)
–
–
–
–
–
Set of states S={s1,…sn}
Output alphabet K={k1,…kn}
State transition probabilities A={aij} i,jS
Symbol emission probabilities B=b(i,k) iS,kK
start and end states (Non emitting)
• Alternatively: initial state probabilities
• Note: for a given i- aij=1 & b(i,k)=1
88-680
Text Books
13
Text Books
Why Hidden?
• Because we only observe the output the underlying states are hidden
• Decoding:
The problem of part-of-speech tagging
can be viewed as a decoding problem:
Given an observation sequence
W=w1,…,wn find a state sequence
T=t1,…,tn that best explains the
observation.
88-680
Text Books
14
Text Books
w1
t1
-2.3
t0
-1.7 2
t
w2
-1.7
t1
-3
-1.7
-6
-0.3
t2 -4.7
-7.3
t2 -10.3
-1.3
-1.3
t3 -2.7
-log
t1
m
t2
t3
t0  2.3
1.7
1
t1  1.7
1
2.3
t2  0.3
3.3
3.3
t3  1.3
t1
-0.3
-3.4
-1
w3
t3 -6.7
1.3
2.3
w1
w2
w3
t1 0.7
2.3
2.3
t2 1.7
0.7
3.3
t3 1.7
1.7
1.315
-log
m
88-680
Text Books
t3 -9.3
Text Books
Viterbi Algorithm
1.
2.
3.
D(0, START) = 0
for each tag t != START do: D(0, t) = -
for i  1 to N do:
a. for each tag tj do:
D(i, tj)  maxk [ D(i-1,tk) + lm(tj | tk) ] + lm(wi | tj)
Record best(i,j)=k which yielded the max
4.
5.
log P(W,T) = maxj D(N, tj)
Reconstruct path from maxj backwards
Where: lm(.) = log m(.) and D(i, tj) – max joint probability of
state and word sequences till position i, ending at tj.
Complexity: O(Nt2 N)
88-680
Text Books
16
Text Books
A*, N-best decoding
• Sometimes one wants not just the best
state sequence for a given input but
rather the top – n best sequences.
e.g. as input for a different model
• A* / stack decoding is an alternative to
viterbi.
88-680
Text Books
17
Text Books
Up from bigrams
• The POS tagging model we described
used a history of just the previous tag:
P(ti|t1,…,ti-1) ≈ P(ti|ti-1)
i.e. a First Order Markovian Assumption
• In this case each state in the HMM
corresponds to a POS tag
• One can build an HMM for POS
trigrams P(ti|t1,…,ti-1) ≈ P(ti|ti-2,ti-1)
88-680
Text Books
18
Text Books
POS Trigram HMM Model
• More accurate (also empirically) than a
bigram model
– He clearly marked
– is clearly marked
• Sparseness problem – smoothing, back-off
• In such a model the HMM states do NOT
correspond to POS tags.
• Why not 4-grams?
– Too many states, not enough data!
88-680
Text Books
19
Text Books
Supervised/Unsupervised
• Is the HMM based tagging a supervised
algorithm?
– Yes, because we need a tagged corpus to
estimate the transition and emission probabilities
(!)
• What do we do if we don’t have an annotated
corpus but,
– Have a dictionary
– Have an annotated corpus from a different domain
and an un-annotated corpus in desired domain.
88-680
Text Books
20
Text Books
Baum-Welch Algorithm
• also known as the Forward-Backward
Algorithm
• An EM algorithm for HMMs.
• Maximization by Iterative hill climbing
• The algorithm iteratively improves the
model parameters based on unannotated training data.
88-680
Text Books
21
Text Books
Baum-Welch Algorithm…
• Start of with parameters based on the
dictionary:
– P(w|t) = 1 if t is possible tag for w
– P(w|t) = 0 otherwise
– Uniform distribution on state transitions
• This is enough to bootstrap from.
• Could also be used to tune a system to a new
domain.
• But best results, and common practice, is
using supervised estimation
88-680
Text Books
22
Text Books
Completely unsupervised?
• What if there is no dictionary and no
annotated corpus?
 Clustering – doesn’t correspond to
linguistic POS
88-680
Text Books
23
‫‪Text Books‬‬
‫עיבוד שפות טבעיות ‪ -‬שיעור שבע‬
‫‪Partial Parsing‬‬
‫אורן גליקמן‬
‫המחלקה למדעי המחשב‬
‫אוניברסיטת בר אילן‬
‫‪24‬‬
‫‪88-680‬‬
‫‪Text Books‬‬
Text Books
Syntax
• The study of grammatical relations
between words and other units within
the sentence.
The Concise Oxford Dictionary of Linguistics
• the way in which linguistic elements (as
words) are put together to form
constituents (as phrases or clauses)
Merriam-Webster Dictionary
88-680
Text Books
25
Text Books
Brackets
• “I prefer a morning flight”
• [S [NP [pro I]][VP [V prefer][NP [Det a] [Nom [N
morning] [ N flight]]]]]]
88-680
Text Books
26
Text Books
Parse Tree
S
VP
NP
NP
Nom
Pronoun
Verb
Det
Noun
Noun
I
prefer
a
morning
flight
88-680
Text Books
27
Text Books
Parsing
• The problem of mapping from a string of
words to to its parse tree is called
parsing.
88-680
Text Books
28
Text Books
Generative Grammar
• A set of rules which indicate precisely
what can be and cannot be a sentence
in a language.
• A grammar which precisely specifies
the membership of the set of all the
grammatical sentences in the language
in question and therefore excludes all
the ungrammatical sentences.
88-680
Text Books
29
Text Books
Formal Languages
• The set of all grammatical sentences in
a given natural language.
• Are natural languages regular?
88-680
Text Books
30
Text Books
English is not a regular
language!
• anbn is not regular
• Look at the following English sentences:
– John and Mary like to eat and sleep, respectively.
– John, Mary, and Sue like to eat, sleep, and dance,
respectively.
– John, Mary, Sue, and Bob like to eat, sleep,
dance, and cook, respectively.
• Anti missile missile
88-680
Text Books
31
Text Books
Constituents
• Certain groupings of words behave as
constituents.
• Constituents are able to occur in various
sentence positions:
‫– ראיתי את הילד הרזה‬
‫– ראיתי אותו מדבר עם הילד הרזה‬
‫– הילד הרזה גר ממול‬
88-680
Text Books
32
Text Books
The Noun Phrase (NP)
• Examples:
– He
– Ariel Sharon
– The prime minister
– The minister of defense during the war in
Lebanon.
• They can all appear in a similar context:
___ was born in Kfar-Malal
88-680
Text Books
33
Text Books
Prepositional Phrases
• Examples:
–
–
–
–
the man in the white suit
Come and look at my paintings
Are you fond of animals?
Put that thing on the floor
88-680
Text Books
34
Text Books
Verb Phrases
• Examples:
– He went
– He was trying to keep his temper.
– She quickly showed me the way to hide.
88-680
Text Books
35