Corpora and Statistical Methods Lecture 15
Download
Report
Transcript Corpora and Statistical Methods Lecture 15
Corpora and Statistical Methods
Albert Gatt
Acknowledgement
Some of the examples in this lecture are taken from a tutorial
on HMMs by Wolgang Maass
Talking about the weather
Suppose we want to predict tomorrow’s weather. The
possible predictions are:
sunny
foggy
rainy
We might decide to predict tomorrow’s outcome based on
earlier weather
if it’s been sunny all week, it’s likelier to be sunny tomorrow than if it
had been rainy all week
how far back do we want to go to predict tomorrow’s weather?
Statistical weather model
Notation:
S: the state space, a set of possible values for the weather: {sunny, foggy,
rainy}
(each state is identifiable by an integer i)
X: a sequence of random variables, each taking a value from S
these model weather over a sequence of days
t is an integer standing for time
(X1, X2, X3, ... XT) models the value of a series of random variables
each takes a value from S with a certain probability P(X=si)
the entire sequence tells us the weather over T days
Statistical weather model
If we want to predict the weather for day t+1, our model might
look like this:
P( X t 1 sk | X1...X t )
E.g. P(weather tomorrow = sunny), conditional on the weather
in the past t days.
Problem: the larger t gets, the more calculations we have to
make.
Markov Properties I: Limited horizon
The probability that we’re in state si at time t+1 only
depends on where we were at time t:
P( X t 1 si | X 1... X t ) P( X t 1 si | X t )
Given this assumption, the probability of any sequence is
just:
T
P( X 1 ,..., X T )
P( X
i 1
i
| X i 1 )
Markov Properties II: Time invariance
The probability of being in state si given the previous state does
not change over time:
P( X t 1 si | X t ) P( X 2 si | X 1 )
Concrete instantiation
Day t
sunny
rainy
foggy
sunny
0.8
0.2
0.2
Day t+1
rainy
0.05
0.6
0.3
foggy
0.15
0.2
0.5
This is essentially a transition matrix, which gives us
probabilities of going from one state to the other.
We can denote state transition probabilities as aij (prob.
of going from state i to state j)
Graphical view
1.
2.
3.
4.
Components of the model:
states (s)
transitions
transition probabilities
initial probability distribution
for states
Essentially, a non-deterministic finite
state automaton.
Example continued
If the weather today (Xt) is sunny, what’s the probability that
tomorrow (Xt+1) is sunny and the day after (Xt+2) is rainy?
Markov
assumption
P( X t 1 s, X t 2 r | X t s )
P( X t 2 r | X t 1 s, X t s ) P( X t 1 s | X t s)
P( X t 2 r | X t 1 s) P( X t 1 s | X t s )
0.05 0.8
0.04
Formal definition
A Markov Model is a triple (S, , A) where:
S is the set of states
are the probabilities of being initially in some state
A are the transition probabilities
Part 2
Hidden Markov Models
A slight variation on the example
You’re locked in a room with no windows
You can’t observe the weather directly
You only observe whether the guy who brings you food is carrying an umbrella
or not
Need a model telling you the probability of seeing the umbrella, given the weather
distinction between observations and their underlying emitting state.
Define:
Ot as an observation at time t
K = {+umbrella, -umbrella} as the possible outputs
We’re interested in P(Ot=k|Xt=si)
i.e. p. of a given observation at t given that the underlying weather state at t is si
Symbol emission probabilities
weather
sunny
Probability
of umbrella
0.1
rainy
0.8
foggy
0.3
This is the hidden model, telling us the probability that
Ot = k given that Xt = si
We assume that each underlying state Xt = si emits an
observation with a given probability.
Using the hidden model
Model gives:P(Ot=k|Xt=si)
Then, by Bayes’ Rule we can compute: P(Xt=si|Ot=k)
P (Ot k | X t si ) P ( X t si )
P( X t si | Ot k )
P (Ot k )
Generalises easily to an entire sequence
HMM in graphics
Circles indicate states
Arrows indicate probabilistic dependencies between states
HMM in graphics
Green nodes are hidden states
Each hidden state depends only on the
previous state (Markov assumption)
Why HMMs?
HMMs are a way of thinking of underlying events
probabilistically generating surface events.
Example: Parts of speech
a POS is a class or set of words
we can think of language as an underlying Markov Chain of
parts of speech from which actual words are generated
(“emitted”)
So what are our hidden states here, and what are the
observations?
HMMs in POS Tagging
DET
ADJ
N
V
Hidden layer (constructed through
training)
Models the sequence of POSs in the
training corpus
HMMs in POS Tagging
DET
ADJ
N
V
the
tall
lady
is
Observations are words.
They are “emitted” by their corresponding
hidden state.
The state depends on its previous state.
Why HMMs
There are efficient algorithms to train HMMs using
Expectation Maximisation
General idea:
training data is assumed to have been generated by some HMM
(parameters unknown)
try and learn the unknown parameters in the data
Similar idea is used in finding the parameters of some n-gram
models, especially those that use interpolation.
Part 3
Formalisation of a Hidden Markov model
Crucial ingredients (familiar)
Underlying states: S = {s1,…,sN}
Output alphabet (observations): K = {k1,…,kM}
State transition probabilities:
A = {aij}, i,j Є S
State sequence: X = (X1,…,XT+1)
+ a function mapping each Xt to a state s
Output sequence: O = (O1,…,OT)
where each ot Є K
Crucial ingredients (additional)
Initial state probabilities:
Π = {πi}, i Є S
(tell us the initial probability of each state)
Symbol emission probabilities:
B = {bijk}, i,j Є S, k Є K
(tell us the probability b of seeing observation Ot=k, given
that Xt=si and Xt+1 = sj)
Trellis diagram of an HMM
s1
a1,1
a1,2
s2
s3
a1,3
Trellis diagram of an HMM
a1,1
s1
a1,2
s2
a1,3
s3
Obs. seq:
o1
time:
t1
o2
t2
o3
t3
Trellis diagram of an HMM
b1,1,k
a1,1
s1
a1,2
s2
b1,1,k
b1,2,k
a1,3
b1,3,k
s3
Obs. seq:
o1
time:
t1
o2
t2
o3
t3
The fundamental questions for HMMs
1.
Given a model μ = (A, B, Π), how do we compute the
likelihood of an observation P(O| μ)?
2.
Given an observation sequence O, and model μ, which is the
state sequence (X1,…,Xt+1) that best explains the observations?
3.
This is the decoding problem
Given an observation sequence O, and a space of possible
models μ = (A, B, Π), which model best explains the observed
data?
Application of question 1 (ASR)
Given a model μ = (A, B, Π), how do we compute the
likelihood of an observation P(O| μ)?
Input of an ASR system: a continuous stream of sound
waves, which is ambiguous
Need to decode it into a sequence of phones.
is the input the sequence [n iy d] or [n iy]?
which sequence is the most probable?
Application of question 2 (POS Tagging)
Given an observation sequence O, and model μ, which is the state sequence
(X1,…,Xt+1) that best explains the observations?
this is the decoding problem
Consider a POS Tagger
Input observation sequence:
I can read
need to find the most likely sequence of underlying POS tags:
e.g. is can a modal verb, or the noun?
how likely is it that can is a noun, given that the previous word is
a pronoun?
Part 4
Finding the probability of an observation sequence
Simplified trellis diagram representation
o1
start
ot-1
n
ot
ot+1
iy
dh
oT
end
Hidden layer: transitions between sounds
forming the words need, knee…
This is our model ( A, B, )
Simplified trellis diagram representation
o1
start
ot-1
n
ot
ot+1
iy
dh
oT
end
Visible layer is what ASR is given as input
Computing the probability of an observation
o1
start
ot-1
n
ot
ot+1
iy
dh
O (o1...oT ), ( A, B, )
ComputeP(O | )
oT
end
Computing the probability of an observation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | X , ) bx1x2o1 bx2 x3o2 ...bxT xT 1oT
Computing the probability of an observation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | X , ) bx1x2o1 bx2 x3o2 ...bxT xT 1oT
P( X | ) x1 a x1 x2 a x2 x3 ...a xT xT 1
Computing the probability of an observation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | X , ) bx1x2o1 bx2 x3o2 ...bxT xT 1oT
P( X | ) x1 a x1x2 a x2 x3 ...a xT xT 1
P(O, X | ) P(O | X , ) P( X | )
Computing the probability of an observation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | X , ) bx1x2o1 bx2 x3o2 ...bxT xT 1oT
P( X | ) x1 a x1x2 a x2 x3 ...a xT xT 1
P(O, X | ) P(O | X , ) P( X | )
P(O | ) P(O | X , ) P( X | )
X
Computing the probability of an observation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | )
T 1
a
x1
{ x1 ...xT }
t 1
xt xt 1 bxt xt 1ot 1
A final word on observation probabilities
Since we’re computing the probability of an observation
given a model, we can use these methods to compare
different models
if we take observations in our corpus as given, then the best
model is the one which maximises the probability of these
observations
(useful for training/parameter setting)
Part 5
The forward procedure
Forward Procedure
Given our phone input, how do we decide whether the actual
word is need, knee, …?
Could compute p(O|μ) for every single word
Highly expensive in terms of computation
Forward procedure
An efficient solution to resolving the problem
based on dynamic programming (memoisation)
rather than perform separate computations for all possible
sequences X, keep in memory partial solutions
Forward procedure
Network representation of all sequences (X) of states that could generate
the observations
sum of probabilities for those sequences
E.g. O=[n iy] could be generated by
X1 = [n iy d] (need)
X2 = [n iy t] (neat)
shared histories can help us save on memory
Fundamental assumption:
Given several state sequences of length t+1 with shared history up to t
probability of first t observations is the same in all of them
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
• Probability of the first t observations is the same
for all possible t+1 length state sequences.
Probability of ending up
• Define a forward variable:
in state s at time t after
i (t ) P(o1...ot 1 , xt i | )
i
observations 1 to t-1
Forward Procedure: initialisation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
• Probability of the first t observations is the same
for all possible t+1 length state sequences.
Probability of being in
• Define:
state s first is just equal
i (1) i
i
to the initialisation
probability
Forward Procedure (inductive step)
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
j (t 1)
N
i 1
N
i 1
i (t )aij bijot
i (t ) aij bijot
Looking backward
The forward procedure caches the probability of sequences
of states leading up to an observation (left to right).
The backward procedure works the other way:
probability of seeing the rest of the obs sequence given that we
were in some state at some time
Backward procedure: basic structure
Define:
i (t ) P(ot ...oT | X t i, )
probability of the remaining observations given that current obs is
emitted by state i
Initialise:
i (T 1) 1
probability at the final state
Inductive step:
i (t )
N
a b
j 1
Total:
P(O | )
ij ijot j (t
N
(1)
i i
i 1
1)
Combining forward & backward variables
Our two variables can be combined:
P(O, X t i | ) i (t ) i (t )
the likelihood of being in state i at time t with our sequence of
observations is a function of:
the probability of ending up in i at t given what came previously
the probability of being in i at t given the rest
Therefore: P(O | )
N
(t ) (t ), 1 t T 1
i
i 1
i
Part 6
Decoding: Finding the best state sequence
Best state sequence: example
Consider the ASR problem again
Input observation sequence:
[aa n iy dh ax]
(corresponds to I need the…)
Possible solutions:
I need a…
I need the…
I kneed a…
…
Problem is to find best word
segmentation and most likely
underlying phonetic input.
NB: each possible solution corresponds to a state sequence.
Some difficulties…
If we focus on the likelihood of each individual state, we run
into problems
context effects mean that what is individually likely may
together yield an unlikely sequence
the ASR program needs to look at the probability of entire
sequences
Viterbi algorithm
Given an observation sequence O and a model , find:
argmaxX P(X,O|)
the sequence of states X such that P(X,O|) is highest
Basic idea:
run a type of forward procedure (computes probability of all possible
paths)
store partial solutions
at the end, look back to find the best path
Illustration: path through the trellis
S1
S2
S3
S4
t= 1
2
3
4
5
6
7
At every node (state) and time, we store:
• the likelihood of reaching that state at that time by the most
probable path leading to that state (denoted )
• the preceding state leading to the current state (denoted )
Viterbi Algorithm: definitions
x1
xt-1
j
o1
ot-1
ot
ot+1
j (t ) max P( X 1... X t 1 , o1...ot 1 , X t j | )
x1 ...xt 1
The probability of the most probable path from
observation 1 to t-1, landing us in state j at t
oT
Viterbi Algorithm: initialisation
x1
xt-1
j
o1
ot-1
ot
ot+1
oT
j (1) j
The probability of being in state j at the beginning is just
the initialisation probability of state j.
Viterbi Algorithm: inductive step
x1
xt-1
xt
xt+1
o1
ot-1
ot
ot+1
j (t ) max P( X 1... X t 1 , o1...ot 1 , X t j | )
x1 ...xt 1
j (t 1) max i (t )aijbijot1
i
Probability of being in j at t+1 depends on
• the state i for which aij is highest
• the probability that j emits the symbol Ot+1
oT
Viterbi Algorithm: inductive step
x1
xt-1
xt
xt+1
o1
ot-1
ot
ot+1
oT
j (t ) max P( X 1... X t 1 , o1...ot 1 , X t j | )
x1 ...xt 1
j (t 1) max i (t )aijbijot1
i
j (t 1) arg max i (t )aij bijot 1
i
Backtrace store: the most
probable state from which
state j can be reached
Illustration
S1
S2
S3
S4
t= 1
2
3
4
5
6
7
2(t=6) = probability of reaching state 2 at time t=6 by the
most probable path (marked) through state 2 at t=6
2(t=6) =3 is the state preceding state 2 at t=6 on the most
probable path through state 2 at t=6
Viterbi Algorithm: backtrace
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
Xˆ T arg max i (T )
i
The best state at T is that state i for which the probability i(T) is
highest
Viterbi Algorithm: backtrace
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
Xˆ T arg max i (T )
i
Xˆ t
^
(t 1)
X t 1
Work backwards to the most likely preceding state
Viterbi Algorithm: backtrace
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
Xˆ T arg max i (T )
Xˆ t
i
(t 1)
^
X t 1
ˆ ) arg max (T )
P( X
i
i
The probability of the best
state sequence is the
maximum value stored for
the final state T
Summary
We’ve looked at two algorithms for solving two of the
fundamental problems of HMMS:
likelihood of an observation sequence given a model
(Forward/Backward Procedure)
most likely underlying state, given an observation sequence
(Viterbi Algorithm)
Next up:
we look at POS tagging