Corpora and Statistical Methods Lecture 15

Transcript Corpora and Statistical Methods Lecture 15

Corpora and Statistical Methods
Albert Gatt
Acknowledgement
 Some of the examples in this lecture are taken from a tutorial
on HMMs by Wolgang Maass
Talking about the weather
 Suppose we want to predict tomorrow’s weather. The
possible predictions are:
 sunny
 foggy
 rainy
 We might decide to predict tomorrow’s outcome based on
earlier weather
 if it’s been sunny all week, it’s likelier to be sunny tomorrow than if it
had been rainy all week
 how far back do we want to go to predict tomorrow’s weather?
Statistical weather model
 Notation:
 S: the state space, a set of possible values for the weather: {sunny, foggy,
rainy}
 (each state is identifiable by an integer i)
 X: a sequence of random variables, each taking a value from S
 these model weather over a sequence of days
 t is an integer standing for time
 (X1, X2, X3, ... XT) models the value of a series of random variables
 each takes a value from S with a certain probability P(X=si)
 the entire sequence tells us the weather over T days
Statistical weather model
 If we want to predict the weather for day t+1, our model might
look like this:
P( X t 1  sk | X1...X t )
 E.g. P(weather tomorrow = sunny), conditional on the weather
in the past t days.
 Problem: the larger t gets, the more calculations we have to
make.
Markov Properties I: Limited horizon

The probability that we’re in state si at time t+1 only
depends on where we were at time t:
P( X t 1  si | X 1... X t )  P( X t 1  si | X t )

Given this assumption, the probability of any sequence is
just:
T
P( X 1 ,..., X T ) 
 P( X
i 1
i
| X i 1 )
Markov Properties II: Time invariance
 The probability of being in state si given the previous state does
not change over time:
P( X t 1  si | X t )  P( X 2  si | X 1 )
Concrete instantiation
Day t
sunny
rainy
foggy
sunny
0.8
0.2
0.2
Day t+1
rainy
0.05
0.6
0.3
foggy
0.15
0.2
0.5
This is essentially a transition matrix, which gives us
probabilities of going from one state to the other.
We can denote state transition probabilities as aij (prob.
of going from state i to state j)
Graphical view

1.
2.
3.
4.
Components of the model:
states (s)
transitions
transition probabilities
initial probability distribution
for states
Essentially, a non-deterministic finite
state automaton.
Example continued
 If the weather today (Xt) is sunny, what’s the probability that
tomorrow (Xt+1) is sunny and the day after (Xt+2) is rainy?
Markov
assumption
P( X t 1  s, X t  2  r | X t  s )
 P( X t  2  r | X t 1  s, X t  s )  P( X t 1  s | X t  s)
 P( X t  2  r | X t 1  s)  P( X t 1  s | X t  s )
 0.05 0.8
 0.04
Formal definition
 A Markov Model is a triple (S, , A) where:
 S is the set of states
  are the probabilities of being initially in some state
 A are the transition probabilities
Part 2
Hidden Markov Models
A slight variation on the example
 You’re locked in a room with no windows
 You can’t observe the weather directly
 You only observe whether the guy who brings you food is carrying an umbrella
or not
 Need a model telling you the probability of seeing the umbrella, given the weather
 distinction between observations and their underlying emitting state.
 Define:
 Ot as an observation at time t
 K = {+umbrella, -umbrella} as the possible outputs
 We’re interested in P(Ot=k|Xt=si)
 i.e. p. of a given observation at t given that the underlying weather state at t is si
Symbol emission probabilities
weather
sunny
Probability
of umbrella
0.1
rainy
0.8
foggy
0.3
This is the hidden model, telling us the probability that
Ot = k given that Xt = si
We assume that each underlying state Xt = si emits an
observation with a given probability.
Using the hidden model
 Model gives:P(Ot=k|Xt=si)
 Then, by Bayes’ Rule we can compute: P(Xt=si|Ot=k)
P (Ot  k | X t  si ) P ( X t  si )
P( X t  si | Ot  k ) 
P (Ot  k )
 Generalises easily to an entire sequence
HMM in graphics
 Circles indicate states
 Arrows indicate probabilistic dependencies between states
HMM in graphics
 Green nodes are hidden states
 Each hidden state depends only on the
previous state (Markov assumption)
Why HMMs?
 HMMs are a way of thinking of underlying events
probabilistically generating surface events.
 Example: Parts of speech
 a POS is a class or set of words
 we can think of language as an underlying Markov Chain of
parts of speech from which actual words are generated
(“emitted”)
 So what are our hidden states here, and what are the
observations?
HMMs in POS Tagging
DET
ADJ
N
V
 Hidden layer (constructed through
training)
 Models the sequence of POSs in the
training corpus
HMMs in POS Tagging
DET
ADJ
N
V
the
tall
lady
is
 Observations are words.
 They are “emitted” by their corresponding
hidden state.
 The state depends on its previous state.
Why HMMs
 There are efficient algorithms to train HMMs using
Expectation Maximisation
 General idea:
 training data is assumed to have been generated by some HMM
(parameters unknown)
 try and learn the unknown parameters in the data
 Similar idea is used in finding the parameters of some n-gram
models, especially those that use interpolation.
Part 3
Formalisation of a Hidden Markov model
Crucial ingredients (familiar)
 Underlying states: S = {s1,…,sN}
 Output alphabet (observations): K = {k1,…,kM}
 State transition probabilities:
A = {aij}, i,j Є S
 State sequence: X = (X1,…,XT+1)
+ a function mapping each Xt to a state s
 Output sequence: O = (O1,…,OT)
 where each ot Є K
Crucial ingredients (additional)
 Initial state probabilities:
Π = {πi}, i Є S
(tell us the initial probability of each state)
 Symbol emission probabilities:
B = {bijk}, i,j Є S, k Є K
(tell us the probability b of seeing observation Ot=k, given
that Xt=si and Xt+1 = sj)
Trellis diagram of an HMM
s1
a1,1
a1,2
s2
s3
a1,3
Trellis diagram of an HMM
a1,1
s1
a1,2
s2
a1,3
s3
Obs. seq:
o1
time:
t1
o2
t2
o3
t3
Trellis diagram of an HMM
b1,1,k
a1,1
s1
a1,2
s2
b1,1,k
b1,2,k
a1,3
b1,3,k
s3
Obs. seq:
o1
time:
t1
o2
t2
o3
t3
The fundamental questions for HMMs
1.
Given a model μ = (A, B, Π), how do we compute the
likelihood of an observation P(O| μ)?
2.
Given an observation sequence O, and model μ, which is the
state sequence (X1,…,Xt+1) that best explains the observations?

3.
This is the decoding problem
Given an observation sequence O, and a space of possible
models μ = (A, B, Π), which model best explains the observed
data?
Application of question 1 (ASR)

Given a model μ = (A, B, Π), how do we compute the
likelihood of an observation P(O| μ)?

Input of an ASR system: a continuous stream of sound
waves, which is ambiguous

Need to decode it into a sequence of phones.


is the input the sequence [n iy d] or [n iy]?
which sequence is the most probable?
Application of question 2 (POS Tagging)

Given an observation sequence O, and model μ, which is the state sequence
(X1,…,Xt+1) that best explains the observations?


this is the decoding problem
Consider a POS Tagger

Input observation sequence:
 I can read

need to find the most likely sequence of underlying POS tags:
 e.g. is can a modal verb, or the noun?
 how likely is it that can is a noun, given that the previous word is
a pronoun?
Part 4
Finding the probability of an observation sequence
Simplified trellis diagram representation
o1
start
ot-1
n
ot
ot+1
iy
dh
oT
end
 Hidden layer: transitions between sounds
forming the words need, knee…
 This is our model   ( A, B, )
Simplified trellis diagram representation
o1
start
ot-1
n
ot
ot+1
iy
dh
oT
end
 Visible layer is what ASR is given as input
Computing the probability of an observation
o1
start
ot-1
n
ot
ot+1
iy
dh
O  (o1...oT ),   ( A, B, )
ComputeP(O |  )
oT
end
Computing the probability of an observation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | X ,  )  bx1x2o1 bx2 x3o2 ...bxT xT 1oT
Computing the probability of an observation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | X ,  )  bx1x2o1 bx2 x3o2 ...bxT xT 1oT
P( X |  )   x1 a x1 x2 a x2 x3 ...a xT xT 1
Computing the probability of an observation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | X ,  )  bx1x2o1 bx2 x3o2 ...bxT xT 1oT
P( X |  )   x1 a x1x2 a x2 x3 ...a xT xT 1
P(O, X |  )  P(O | X ,  ) P( X |  )
Computing the probability of an observation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | X ,  )  bx1x2o1 bx2 x3o2 ...bxT xT 1oT
P( X |  )   x1 a x1x2 a x2 x3 ...a xT xT 1
P(O, X |  )  P(O | X ,  ) P( X |  )
P(O |  )   P(O | X ,  ) P( X |  )
X
Computing the probability of an observation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O |  ) 
T 1
  a
x1
{ x1 ...xT }
t 1
xt xt 1 bxt xt 1ot 1
A final word on observation probabilities
 Since we’re computing the probability of an observation
given a model, we can use these methods to compare
different models
 if we take observations in our corpus as given, then the best
model is the one which maximises the probability of these
observations
 (useful for training/parameter setting)
Part 5
The forward procedure
Forward Procedure
 Given our phone input, how do we decide whether the actual
word is need, knee, …?
 Could compute p(O|μ) for every single word
 Highly expensive in terms of computation
Forward procedure
 An efficient solution to resolving the problem
 based on dynamic programming (memoisation)
 rather than perform separate computations for all possible
sequences X, keep in memory partial solutions
Forward procedure
 Network representation of all sequences (X) of states that could generate
the observations
 sum of probabilities for those sequences
 E.g. O=[n iy] could be generated by
 X1 = [n iy d] (need)
 X2 = [n iy t] (neat)
 shared histories can help us save on memory
 Fundamental assumption:
 Given several state sequences of length t+1 with shared history up to t
 probability of first t observations is the same in all of them
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
• Probability of the first t observations is the same
for all possible t+1 length state sequences.
Probability of ending up
• Define a forward variable:
in state s at time t after
 i (t )  P(o1...ot 1 , xt  i |  )
i
observations 1 to t-1
Forward Procedure: initialisation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
• Probability of the first t observations is the same
for all possible t+1 length state sequences.
Probability of being in
• Define:
state s first is just equal
 i (1)   i
i
to the initialisation
probability
Forward Procedure (inductive step)
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 j (t  1)



N
i 1

N
i 1
 i (t )aij bijot
 i (t ) aij bijot
Looking backward
 The forward procedure caches the probability of sequences
of states leading up to an observation (left to right).
 The backward procedure works the other way:
 probability of seeing the rest of the obs sequence given that we
were in some state at some time
Backward procedure: basic structure
 Define:
 i (t )  P(ot ...oT | X t  i,  )
 probability of the remaining observations given that current obs is
emitted by state i
 Initialise:
 i (T  1)  1
 probability at the final state
 Inductive step:
 i (t ) 
N
a b
j 1
 Total:
P(O |  ) 
ij ijot  j (t
N
  (1)
i i
i 1
 1)
Combining forward & backward variables
 Our two variables can be combined:
P(O, X t  i |  )   i (t )  i (t )
 the likelihood of being in state i at time t with our sequence of
observations is a function of:
 the probability of ending up in i at t given what came previously
 the probability of being in i at t given the rest
 Therefore: P(O |  ) 
N
 (t ) (t ), 1  t  T  1
i
i 1
i
Part 6
Decoding: Finding the best state sequence
Best state sequence: example
 Consider the ASR problem again
 Input observation sequence:
 [aa n iy dh ax]
 (corresponds to I need the…)
 Possible solutions:
 I need a…
 I need the…
 I kneed a…
 …
Problem is to find best word
segmentation and most likely
underlying phonetic input.
 NB: each possible solution corresponds to a state sequence.
Some difficulties…
 If we focus on the likelihood of each individual state, we run
into problems
 context effects mean that what is individually likely may
together yield an unlikely sequence
 the ASR program needs to look at the probability of entire
sequences
Viterbi algorithm
 Given an observation sequence O and a model , find:
 argmaxX P(X,O|)
 the sequence of states X such that P(X,O|) is highest
 Basic idea:
 run a type of forward procedure (computes probability of all possible
paths)
 store partial solutions
 at the end, look back to find the best path
Illustration: path through the trellis
S1
S2
S3
S4
t= 1
2
3
4
5
6
7
At every node (state) and time, we store:
• the likelihood of reaching that state at that time by the most
probable path leading to that state (denoted )
• the preceding state leading to the current state (denoted )
Viterbi Algorithm: definitions
x1
xt-1
j
o1
ot-1
ot
ot+1
 j (t )  max P( X 1... X t 1 , o1...ot 1 , X t  j |  )
x1 ...xt 1
The probability of the most probable path from
observation 1 to t-1, landing us in state j at t
oT
Viterbi Algorithm: initialisation
x1
xt-1
j
o1
ot-1
ot
ot+1
oT
 j (1)   j
The probability of being in state j at the beginning is just
the initialisation probability of state j.
Viterbi Algorithm: inductive step
x1
xt-1
xt
xt+1
o1
ot-1
ot
ot+1
 j (t )  max P( X 1... X t 1 , o1...ot 1 , X t  j |  )
x1 ...xt 1
 j (t  1)  max  i (t )aijbijot1
i
Probability of being in j at t+1 depends on
• the state i for which aij is highest
• the probability that j emits the symbol Ot+1
oT
Viterbi Algorithm: inductive step
x1
xt-1
xt
xt+1
o1
ot-1
ot
ot+1
oT
 j (t )  max P( X 1... X t 1 , o1...ot 1 , X t  j |  )
x1 ...xt 1
 j (t  1)  max  i (t )aijbijot1
i
 j (t  1)  arg max  i (t )aij bijot 1
i
Backtrace store: the most
probable state from which
state j can be reached
Illustration
S1
S2
S3
S4
t= 1
2
3
4
5
6
7
2(t=6) = probability of reaching state 2 at time t=6 by the
most probable path (marked) through state 2 at t=6
2(t=6) =3 is the state preceding state 2 at t=6 on the most
probable path through state 2 at t=6
Viterbi Algorithm: backtrace
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
Xˆ T  arg max i (T )
i
The best state at T is that state i for which the probability i(T) is
highest
Viterbi Algorithm: backtrace
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
Xˆ T  arg max i (T )
i
Xˆ t  
^
(t  1)
X t 1
Work backwards to the most likely preceding state
Viterbi Algorithm: backtrace
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
Xˆ T  arg max i (T )
Xˆ t  
i
(t  1)
^
X t 1
ˆ )  arg max  (T )
P( X
i
i
The probability of the best
state sequence is the
maximum value stored for
the final state T
Summary
 We’ve looked at two algorithms for solving two of the
fundamental problems of HMMS:
 likelihood of an observation sequence given a model
(Forward/Backward Procedure)
 most likely underlying state, given an observation sequence
(Viterbi Algorithm)
 Next up:
 we look at POS tagging

Corpora and Statistical Methods Lecture 15

Transcript Corpora and Statistical Methods Lecture 15

Directory