N-Gram Model Formulas • Word sequences w1 w1 ... w n n • Chain rule of probability P ( w ) P (
Download
Report
Transcript N-Gram Model Formulas • Word sequences w1 w1 ... w n n • Chain rule of probability P ( w ) P (
N-Gram Model Formulas
• Word sequences
w1 w1 ... w n
n
• Chain rule of probability
P ( w ) P ( w1 ) P ( w 2 | w1 ) P ( w 3 | w )... P ( w n | w
n
1
2
1
• Bigram approximation
n
P (w )
n
1
P (w
k
| w k 1 )
k 1
• N-gram approximation
n
P (w )
n
1
k 1
k 1
P ( w k | w k N 1 )
n 1
1
n
)
k 1
k 1
P ( w k | w1
)
Estimating Probabilities
• N-gram conditional probabilities can be estimated
from raw text based on the relative frequency of
word sequences.
Bigram:
P ( w n | w n 1 )
C ( w n 1 w n )
C ( w n 1 )
n 1
N-gram:
n 1
P ( w n | w n N 1 )
C ( w n N 1 w n )
n 1
C ( w n N 1 )
• To have a consistent probabilistic model, append a
unique start (<s>) and end (</s>) symbol to every
sentence and treat these as additional words.
Perplexity
• Measure of how well a model “fits” the test data.
• Uses the probability that the model assigns to the
test corpus.
• Normalizes for the number of words in the test
corpus and takes the inverse.
PP (W )
1
N
P ( w1 w 2 ... w N )
• Measures the weighted average branching factor
in predicting the next word (lower is better).
Laplace (Add-One) Smoothing
• “Hallucinate” additional training data in which each
possible N-gram occurs exactly once and adjust
estimates accordingly.
Bigram:
P ( w n | w n 1 )
C ( w n 1 w n ) 1
C ( w n 1 ) V
n 1
N-gram:
n 1
P ( w n | w n N 1 )
C ( w n N 1 w n ) 1
n 1
C ( w n N 1 ) V
where V is the total number of possible (N1)-grams
(i.e. the vocabulary size for a bigram model).
• Tends to reassign too much mass to unseen events,
so can be adjusted to add 0<<1 (normalized by V
instead of V).
Interpolation
• Linearly combine estimates of N-gram
models of increasing order.
Interpolated Trigram Model:
Pˆ ( w | w w ) P ( w | w
n
n2,
n 1
Where:
1
n
i
n2,
w n 1 ) 2 P ( w n | w n 1 ) 3 P ( w n )
1
i
• Learn proper values for i by training to
(approximately) maximize the likelihood of
an independent development (a.k.a. tuning)
corpus.
Formal Definition of an HMM
• A set of N +2 states S={s0,s1,s2, … sN, sF}
– Distinguished start state: s0
– Distinguished final state: sF
• A set of M possible observations V={v1,v2…vM}
• A state transition probability distribution A={aij}
a ij P ( q t 1 s j | q t s i )
1 i , j N and i 0 , j F
N
a
ij
a iF 1
0i N
j 1
• Observation probability distribution for each state j
B={bj(k)}
b j ( k ) P ( v k at t | q t s j )
• Total parameter set λ={A,B}
1 j N 1 k M
6
Forward Probabilities
• Let t(j) be the probability of being in state
j after seeing the first t observations (by
summing over all initial paths leading to j).
t ( j ) P ( o1 , o 2 ,... o t , q t s j | )
7
Computing the Forward Probabilities
• Initialization
1 ( j ) a 0 j b j ( o1 )
1 j N
• Recursion
N
t ( j ) t 1 ( i ) a ij b j ( o t )
i 1
1 j N, 1 t T
• Termination
N
P ( O | ) T 1 ( s F )
T
( i ) a iF
i 1
8
Viterbi Scores
• Recursively compute the probability of the most
likely subsequence of states that accounts for the
first t observations and ends in state sj.
vt ( j )
max
q 0 , q 1 ,..., q t 1
P ( q 0 , q 1 ,..., q t 1 , o1 ,..., o t 1 , q t s j | )
• Also record “backpointers” that subsequently allow
backtracing the most probable state sequence.
btt(j) stores the state at time t-1 that maximizes the
probability that system was in state sj at time t (given
the observed sequence).
9
Computing the Viterbi Scores
• Initialization
v 1 ( j ) a 0 j b j ( o1 )
1 j N
• Recursion
N
v t ( j ) max v t 1 ( i ) a ij b j ( o t )
i 1
1 j N, 1 t T
• Termination
N
P * v T 1 ( s F ) max v T ( i ) a iF
i 1
Analogous to Forward algorithm except take max instead of sum
10
Computing the Viterbi Backpointers
• Initialization
bt 1 ( j ) s 0
1 j N
• Recursion
N
bt t ( j ) argmax v t 1 ( i ) a ij b j ( o t )
1 j N, 1 t T
i 1
• Termination
N
q T * bt T 1 ( s F ) argmax v T ( i ) a iF
i 1
Final state in the most probable state sequence. Follow
backpointers to initial state to construct full sequence.
11
Supervised Parameter Estimation
• Estimate state transition probabilities based on tag
bigram and unigram statistics in the labeled data.
a ij
C ( q t s i , q t 1 s j )
C ( q t si )
• Estimate the observation probabilities based on
tag/word co-occurrence statistics in the labeled data.
b j (k )
C ( qi s j , oi vk )
C (qi s j )
• Use appropriate smoothing if training data is sparse.
12
Context Free Grammars (CFG)
• N a set of non-terminal symbols (or variables)
• a set of terminal symbols (disjoint from N)
• R a set of productions or rules of the form
A→, where A is a non-terminal and is a
string of symbols from ( N)*
• S, a designated non-terminal called the start
symbol
Estimating Production Probabilities
• Set of production rules can be taken directly
from the set of rewrites in the treebank.
• Parameters can be directly estimated from
frequency counts in the treebank.
P ( | )
count( )
count(
)
count( )
count( )
14