N-Gram Model Formulas • Word sequences w1  w1 ... w n n • Chain rule of probability P ( w )  P (

Download Report

Transcript N-Gram Model Formulas • Word sequences w1  w1 ... w n n • Chain rule of probability P ( w )  P (

N-Gram Model Formulas
• Word sequences
w1  w1 ... w n
n
• Chain rule of probability
P ( w )  P ( w1 ) P ( w 2 | w1 ) P ( w 3 | w )... P ( w n | w
n
1
2
1
• Bigram approximation
n
P (w ) 
n
1
 P (w
k
| w k 1 )
k 1
• N-gram approximation
n
P (w ) 
n
1

k 1
k 1
P ( w k | w k  N 1 )
n 1
1
n
)

k 1
k 1
P ( w k | w1
)
Estimating Probabilities
• N-gram conditional probabilities can be estimated
from raw text based on the relative frequency of
word sequences.
Bigram:
P ( w n | w n 1 ) 
C ( w n 1 w n )
C ( w n 1 )
n 1
N-gram:
n 1
P ( w n | w n  N 1 ) 
C ( w n  N 1 w n )
n 1
C ( w n  N 1 )
• To have a consistent probabilistic model, append a
unique start (<s>) and end (</s>) symbol to every
sentence and treat these as additional words.
Perplexity
• Measure of how well a model “fits” the test data.
• Uses the probability that the model assigns to the
test corpus.
• Normalizes for the number of words in the test
corpus and takes the inverse.
PP (W ) 
1
N
P ( w1 w 2 ... w N )
• Measures the weighted average branching factor
in predicting the next word (lower is better).
Laplace (Add-One) Smoothing
• “Hallucinate” additional training data in which each
possible N-gram occurs exactly once and adjust
estimates accordingly.
Bigram:
P ( w n | w n 1 ) 
C ( w n 1 w n )  1
C ( w n 1 )  V
n 1
N-gram:
n 1
P ( w n | w n  N 1 ) 
C ( w n  N 1 w n )  1
n 1
C ( w n  N 1 )  V
where V is the total number of possible (N1)-grams
(i.e. the vocabulary size for a bigram model).
• Tends to reassign too much mass to unseen events,
so can be adjusted to add 0<<1 (normalized by V
instead of V).
Interpolation
• Linearly combine estimates of N-gram
models of increasing order.
Interpolated Trigram Model:
Pˆ ( w | w w )   P ( w | w
n
n2,
n 1
Where:
1

n
i
n2,
w n 1 )   2 P ( w n | w n 1 )   3 P ( w n )
1
i
• Learn proper values for i by training to
(approximately) maximize the likelihood of
an independent development (a.k.a. tuning)
corpus.
Formal Definition of an HMM
• A set of N +2 states S={s0,s1,s2, … sN, sF}
– Distinguished start state: s0
– Distinguished final state: sF
• A set of M possible observations V={v1,v2…vM}
• A state transition probability distribution A={aij}
a ij  P ( q t  1  s j | q t  s i )
1  i , j  N and i  0 , j  F
N
a
ij
 a iF  1
0i N
j 1
• Observation probability distribution for each state j
B={bj(k)}
b j ( k )  P ( v k at t | q t  s j )
• Total parameter set λ={A,B}
1 j  N 1 k  M
6
Forward Probabilities
• Let t(j) be the probability of being in state
j after seeing the first t observations (by
summing over all initial paths leading to j).
 t ( j )  P ( o1 , o 2 ,... o t , q t  s j |  )
7
Computing the Forward Probabilities
• Initialization
 1 ( j )  a 0 j b j ( o1 )
1 j  N
• Recursion
 N

 t ( j )     t 1 ( i ) a ij  b j ( o t )
 i 1

1 j  N, 1 t T
• Termination
N
P ( O |  )   T 1 ( s F ) 

T
( i ) a iF
i 1
8
Viterbi Scores
• Recursively compute the probability of the most
likely subsequence of states that accounts for the
first t observations and ends in state sj.
vt ( j ) 
max
q 0 , q 1 ,..., q t 1
P ( q 0 , q 1 ,..., q t 1 , o1 ,..., o t 1 , q t  s j |  )
• Also record “backpointers” that subsequently allow
backtracing the most probable state sequence.
 btt(j) stores the state at time t-1 that maximizes the
probability that system was in state sj at time t (given
the observed sequence).
9
Computing the Viterbi Scores
• Initialization
v 1 ( j )  a 0 j b j ( o1 )
1 j  N
• Recursion
N
v t ( j )  max v t 1 ( i ) a ij b j ( o t )
i 1
1 j  N, 1 t T
• Termination
N
P *  v T  1 ( s F )  max v T ( i ) a iF
i 1
Analogous to Forward algorithm except take max instead of sum
10
Computing the Viterbi Backpointers
• Initialization
bt 1 ( j )  s 0
1 j  N
• Recursion
N
bt t ( j )  argmax v t 1 ( i ) a ij b j ( o t )
1 j  N, 1 t T
i 1
• Termination
N
q T *  bt T  1 ( s F )  argmax v T ( i ) a iF
i 1
Final state in the most probable state sequence. Follow
backpointers to initial state to construct full sequence.
11
Supervised Parameter Estimation
• Estimate state transition probabilities based on tag
bigram and unigram statistics in the labeled data.
a ij 
C ( q t  s i , q t 1  s j )
C ( q t  si )
• Estimate the observation probabilities based on
tag/word co-occurrence statistics in the labeled data.
b j (k ) 
C ( qi  s j , oi  vk )
C (qi  s j )
• Use appropriate smoothing if training data is sparse.
12
Context Free Grammars (CFG)
• N a set of non-terminal symbols (or variables)
•  a set of terminal symbols (disjoint from N)
• R a set of productions or rules of the form
A→, where A is a non-terminal and  is a
string of symbols from ( N)*
• S, a designated non-terminal called the start
symbol
Estimating Production Probabilities
• Set of production rules can be taken directly
from the set of rewrites in the treebank.
• Parameters can be directly estimated from
frequency counts in the treebank.
P (   |  ) 
count(    )
 count(
  )

count(    )
count(  )

14