PowerPoint Presentation - A Risk Minimization Framework

Download Report

Transcript PowerPoint Presentation - A Risk Minimization Framework

Hidden Markov Models (HMMs)
1
Definition
• Hidden Markov Model is a statistical model
where the system being modeled is assumed
to be a Markov process with unknown
parameters.
• The challenge is to determine the hidden
parameters from the observable parameters.
2
State Transitions
Markov Model Example.
--x = States of the Markov model
-- a = Transition probabilities
-- b = Output probabilities
-- y = Observable outputs
-How does this differ from a Finite State machine?
3
Example
• Distant friend that you talk to daily about his
activities (walk, shop, clean)
• You believe that the weather is a discrete
Markov chain (no memory) with two states
(rainy, sunny), but you cant observe them
directly. You know the average weather
patterns
4
Code
states = ('Rainy', 'Sunny')
observations = ('walk', 'shop', 'clean')
start_probability = {'Rainy': 0.6, 'Sunny': 0.4}
transition_probability = {
'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},
'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},
}
emission_probability = {
'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},
'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1},
}
5
Observations
• Given (walk, shop, clean)
– What is the probability of this sequence of
observations? (is he really still at home, or did
he skip the country)
– What was the most likely sequence of
rainy/sunny days?
6
Matrix
Rainy
Sunny
walk
.6*.1
.4*.6
shop
.7*.4
.4*.4
.3*.3
.6*.3
clean
.7*.5
.4*.5
.3*.1
.6*.1
Sunny, Rainy, Rainy = (.4*.6)(.4*.4)(.7*.5)
7
The CpG island problem
• Methylation in human genome
– “CG” -> “TG” happens randomly except where
there is selection.
– One area of selection is the“start regions” of
genes
– CpG islands = 100-1,000 bases before a gene
starts
• Question
– Given a long sequence, how would we find the
CpG islands in it?
8
Hidden Markov Model
X=ATTGATGTGAACTGGGGATCGGGCGATATATGATTGG
Other
CpG Island
Other
How can we identify a CpG island in a long sequence?
Idea 1: Test each window of a fixed number of nucleitides
Idea2: Classify the whole sequence
Class label S1: OOOO………….……O
Class label S2: OOOO…………. OCC
…
Class label Si: OOOO…OCC..CO…O
…
Class label SN: CCCC……………….CC
S*=argmaxS P(S|X)
= argmaxS P(S,X)
S*=OOOO…OCC..CO…O
CpG
9
HMM is just one way of
modeling p(X,S)…
10
A simple HMM
0.7
P(B)=0.5
P(I)=0.5
0.5
Parameters
0.3
B
Initial state prob:
p(B)= 0.5; p(I)=0.5
I
P(x|B)
P(x|I)
0.5
State transition prob:
p(BB)=0.7 p(BI)=0.3
p(IB)=0.5 p(II)=0.5
P(x|HOther)=p(x|B)
P(x|HCpG)=p(x|I)
P(a|B)=0.25
P(t|B)=0.40
P(c|B)=0.10
P(g|B)=0.25
P(a|I)=0.25
P(t|I)=0.25
P(c|I)=0.25
P(g|I)=0.25
Output prob:
P(a|B) = 0.25,
…
p(c|B)=0.10
…
P(c|I) = 0.25 …
11
A General Definition of HMM
HMM  (S ,V , B, A, )
Initial state probability:
N states
  {1 ,...,  N }
S  {s1 ,..., sN }
N

i 1
1
i
 i : prob of starting at state si
State transition probability:
M symbols
V  {v1 ,..., vM }
A  {aij }
1  i, j  N
N
a
j 1
ij
1
aij : prob of going si  s j
Output probability:
B  {bi (vk )}
1  i  N, 1  k  M
M
 b (v )  1
k 1
i
k
bi (vk ) : prob of " generating " vk at si
12
How to “Generate” a Sequence?
P(x|B)
0.7
P(x|I)
0.5
0.3
P(a|B)=0.25
P(t|B)=0.40
P(c|B)=0.10
P(g|B)=0.25
B
P(B)=0.5
a c g t
B
I
I
I
I
I
I
B
P(a|I)=0.25
P(t|I)=0.25
P(c|I)=0.25
P(g|I)=0.25
I
0.5
t
P(I)=0.5
…
B
model
Sequence
B
I
B
states
B
I
I
B
……
Given a model, follow a path to generate the observations.13
How to “Generate” a Sequence?
P(x|B)
0.7
0.5
0.3
P(a|B)=0.25
P(t|B)=0.40
P(c|B)=0.10
P(g|B)=0.25
B
0.5
a c g t t
0.5
B
0.3
I
0.25
a
0.5
0.25
c
P(a|I)=0.25
P(t|I)=0.25
P(c|I)=0.25
P(g|I)=0.25
I
P(B)=0.5
0.5
I
0.25
g
0.5
Sequence
B
0.25
t
model
P(I)=0.5
…
I
P(x|I)
0.4
t
P(“BIIIB”, “acgtt”)=p(B)p(a|B) p(I|B)p(c|I) p(I|I)p(g|I) p(I|I)p(t|I) p(B|I)p(t|B)
14
HMM as a Probabilistic Model
Sequential data
Random
variables/
process
t2
o2
t3
o3
t4 …
o4 …
Observation variable: O1
O2
O3
O4 …
Hidden state variable: S1
S2
S3
S4 …
Time/Index: t1
Data:
o1
State transition prob: p(S1 , S2 ,..., ST )  p(S1 ) p(S2 | S1 )... p(ST | ST 1 )
Probability of observations with known state transitions:
p(O1 , O2 ,..., OT | S1 , S2 ,..., ST )  p(O1 | S1 ) p(O2 | S2 )... p(OT | ST )
Joint probability (complete likelihood): Init state distr. Output prob.
p(O1 , O2 ,..., OT , S1 , S2 ,..., ST )  p(S1 ) p(O1 | S1 ) p( S2 | S1 ) p(O2 | S2 )... p( ST | ST 1 ) p(OT ST )
Probability of observations (incomplete likelihood):
p(O1 , O2 ,..., OT )   p(O1 , O2 ,..., OT , S1 ,...ST )
S1 ,...ST
State trans. prob.
15
Three Problems
1. Decoding – finding the most likely path
Have: model, parameters, observations (data)
Want: most likely states sequence
S1*S2*...ST*  arg max p(S1S2 ...ST | O)  arg max p(S1S2 ...ST , O)
S1S2 ...ST
S1S2 ...ST
2. Evaluation – computing observation likelihood
Have: model, parameters, observations (data)
Want: the likelihood to generate the observed data
p(O |  ) 

S1S2 ...ST
p(O | S1S2 ...ST ) p(S1S2 ...ST )
16
Three Problems (cont.)
•
Training – estimating parameters
*  arg max p(O |  )
-
Supervised
Have: model, marked data( data+states
sequence)
Want: parameters
-
Unsupervised
Have: model, data
Want: parameters
17
Problem I: Decoding/Parsing
Finding the most likely path
You can think of this as classification with all
the paths as class labels…
18
What’s the most likely path?
P(x|B)
0.7
P(a|B)=0.25
P(t|B)=0.40
P(c|B)=0.10
P(g|B)=0.25
0.3
B
?
?
?
?
?
?
a
c
?
?
0.5
?
?
?
g
P(a|I)=0.25
P(t|I)=0.25
P(c|I)=0.25
P(g|I)=0.25
I
P(B)=0.5
?
P(x|I)
0.5
P(I)=0.5
?
?
t
?
?
?
t
?
?
?
a
?
?
?
?
t
g
T
S S ...S  arg max p(S1S2 ...ST , O)  arg max  (S1 )bS1 (vo1 ) aSi1Si bSi (voi )
* *
1 2
*
T
S1S2 ...ST
S1S2 ...ST
i 2
19
Viterbi Algorithm: An Example
0.7
0.5
P(x|B)
P(a|B)=0.251
P(t|B)=0.40
P(c|B)=0.098
P(g|B)=0.251
t=
0.5
0.5
VP(B):
VP(I)
P(x|I)
0.3
B
I
0.5
P(B)=0.5
P(I)=0.5
1
2
3
a
c
g
B
I
0.7
0.5
B
I
0.7
B
0.5
I
0.5*0.251 (B)
(0.5*0.251)*0.7*0.098(BB)
0.5*0.25(I)
(0.5*0.25)*0.5*0.25(II)
…
P(a|I)=0.25
P(t|I)=0.25
P(c|I)=0.25
P(g|I)=0.25
…
4 …
t
0.7
…
B
0.5
I
Remember
the best paths so far
20
Viterbi Algorithm
Observation:
max p(o1...oT , S1...ST )  max[ max p(o1...oT , S1...ST 1, ST  si )]
S1S2 ... ST
si
S1S2 ... ST 1
VPt (i)  max p(o1...ot , S1...St 1 , St  si )
Algorithm:
S1 ... St 1
(Dynamic programming)
qt* (i)  [arg max p(o1...ot , S1...St 1 , St  si ) ]  (i)
S1 ... St 1
1. VP1 (i)   ibi (o1 ), q1* (i)  (i), for i  1,..., N
2. For 1  t  T , VPt 1 (i )  max VPt ( j )a jibi (ot 1 ),
1 j  N
qt*1 (i )  qt* (k )  (i ), k  arg max VPt ( j )a jibi (ot 1 ), for i  1,..., N
1 j  N
The best path is qT* (i) Complexity: O(TN2)
21
Problem II: Evaluation
Computing the data likelihood
•
Another use of an HMM, e.g., as a generative
model for discrimination
•
Also related to Problem III – parameter
estimation
22
Data Likelihood: p(O|)
t=
0.5
0.5
1
2
3
a
c
g
B
I
0.7
0.5
B
I
0.7
B
0.5
I
4 …
t
0.7
0.5
…
B
I
p(" a c g t ..."|  )  p(" a c g t..."| BB...B) p( BB...B)
 p(" a c g t..."| BT ...B) p( BT ...B)
All HMM parameters
 ...  p(" a c g t..."| TT ...T ) p(TT ...T )
In general,
p(O |  ) 

p(O | S1S2 ...ST ) p(S1S2 ...ST )
S1S2 ...ST
23
The Forward Algorithm
N
Observation:
p (o1...oT |  )  
i 1 S1S2 ... ST 1
 t (i ) 
Algorithm:


p (o1...oT , S1S 2 ...ST 1 , ST  si )
Generating o1…ot
with ending state si
p (o1...ot , S1S 2 ...St 1 , St  si )
S1S2 ... St 1


p (o1...ot 1 , S1S 2 ...St 1 ) p ( St  si | St 1 ) p (ot | S  si )
S1S2 ... St 1
N
 [

j 1 S1S2 ... St 2
p(o1...ot 1 , S1S 2 ...St 1  s j )]a ji bi (ot )
N
 bi (ot )  t 1 ( j )a ji
j 1
The data likelihood is
N
p(o1...oT |  )   T (i)
i 1
Complexity: O(TN2)
24
Forward Algorithm: Example
N
N
p(o1...oT |  )  T (i)
t (i)  bi (ot )t 1 ( j )a ji
i 1
t=
0.5
0.5
j 1
1
2
3
a
c
g
B
I
1(B): 0.5*p(“a”|B)
1(I): 0.5*p(“a”|I)
0.7
0.5
B
I
0.7
0.5
B
I
4 …
t
0.7
0.5
…
B
I
2(B): [1(B)*0.8+ 1(I)*0.5]*p(“c”|B) ……
2(I): [1(B)*0.2+ 1(I)*0.5]*p(“c”|I) ……
P(“a c g t”) = 4(B)+ 4(I)
25
The Backward Algorithm
N
Observation: p(o ...o
1
T
| )  

i 1 S2 ... ST
p(o1...oT , S1  si , S2 ...ST )
N
Algorithm:
   i bi (o1 )
i 1
t (i ) 


p(o2 ...oT , S 2 ...ST | S1  si ) (o1…ot already generated)
S2 ... ST
Starting from state si
Generating ot+1…oT
p(ot 1...oT , St 1...ST | St  si )
St 1 ... ST


p(ot  2 ...oT , St  2 ...ST | St 1 ) p( St 1 | St  si ) p (ot 1 | St 1 )
St 1 ... ST
N
  aij b j (ot 1 )
j 1

p(ot  2 ...oT , St  2 ...ST | St 1  s j )
St 2 ... ST
N
  aij b j (ot 1 ) t 1 ( j )
j 1
Complexity: O(TN2)
The data likelihood is
N
N
N
i 1
i 1
i 1
p(o1...oT |  )    i bi (o1 )1 (i)  1 (i) 1 (i)  t (i) t (i) for any t
26
Backward Algorithm: Example
N
N
p(o1...oT |  )  T (i)
t (i)  bi (ot )t 1 ( j )a ji
i 1
t=
0.5
1
2
3
a
c
g
B
0.5
I
…
…
j 1
0.7
0.5
B
I
0.7
0.5
B
I
4 …
t
0.7
0.5
…
B
I
3(B): 0.7*p(“t”|B)*4(B)+ 0.3*p(“t”|I)*4(I)
4(B): 1
3(I): 0.5*p(“t”|B)*4(B)+ 0.5*p(“t”|T)*4(I)
4(I): 1
P(“a c g t”) =  1(B)* 1(B)+  1(I)* 1(I)
=  2(B)* 2(B)+  2(I)* 2(I)
27
Problem III: Training
Estimating Parameters
Where do we get the probability values
for all parameters?
Supervised vs. Unsupervised
28
Supervised Training
Given:
1. N – the number of states, e.g., 2, (s1 and s2)
2. V – the vocabulary, e.g., V={a,b}
3. O – observations, e.g., O=aaaaabbbbb
4. State transitions, e.g., S=1121122222
Task: Estimate the following parameters
1=1/1=1; 2=0/1=0
1. 1, 2
a11=2/4=0.5; a12=2/4=0.5
2. a11, a12,a22,a21
a21=1/5=0.2; a22=4/5=0.8
3. b1(a), b1(b), b2(a), b2(b)
0.5
P(s1)=1
P(s2)=0
0.8
b1(a)=4/4=1.0;
b1(b)=0/4=0;
b2(a)=1/6=0.167; b2(b)=5/6=0.833
0.5
1
2
0.2
P(a|s1)=1
P(b|s1)=0
P(a|s2)=167
P(b|s2)=0.833
29
Unsupervised Training
Given:
1. N – the number of states, e.g., 2, (s1 and s2)
2. V – the vocabulary, e.g., V={a,b}
3. O – observations, e.g., O=aaaaabbbbb
4. State transitions, e.g., S=1121122222
Task: Estimate the following parameters
1. 1, 2
How could this be possible?
2. a11, a12,a22,a21
3. b1(a), b1(b), b2(a), b2(b)

Maximum Likelihood:
*  arg max p(O |  )
30
Intuition
O=aaaaabbbbb, 
P(O,q1|)
q1=1111111111
q2=11111112211 …
qK=2222222222
T 1 K
K
i 
P(O,qK|)
P(O,q2|)
 p(O, qk |  ) [qk (1)  1]i
k 1
aij 
K
 p(O, qk |  )
k 1
 p(O, q
t 1 k 1
k
T 1 K
 p(O, q
t 1 k 1
T 1 K
bi (v j ) 
 p(O, q
t 1 k 1
T 1 K
k
k
|  ) [qk (t )  i]
New ’
|  ) [qk (t )  i, ot  v j ]
 p(O, q
t 1 k 1
|  ) [qk (t )  i, qk (t  1)  j ]
k
|  ) [qk (t )  i ]
Computation of P(O,qk|) is expensive …
31
Baum-Welch Algorithm
Basic “counters”:
Being at state si at time t
 t (i)  p(q(t )  si | O,  )
t (i, j )  p(q(t )  si , q(t  1)  s j | O,  )
Being at state si at time t
and at state sj at time t+1
Computation of counters:
 t (i ) 
 t (i ) t (i )
N
 ( j) ( j)
j 1
t (i, j ) 
t
t
 t (i )aij b j (ot 1 ) t 1 ( j )
Complexity: O(N2)
N
 ( j) ( j)
j 1
  t (i )
t
t
aij b j (ot 1 ) t 1 ( j )
t (i )
32
Baum-Welch Algorithm (cont.)
 i'   1 (i )
T 1
a 
Updating formulas:
'
ij
  (i, j )
t 1
N T 1
t
  (i, j ')
j '1 t 1
t
T
bi (vk ) 
  (i) [o
t 1
t
t
 vk ]
T
  (i)
t 1
t
Overall complexity for each iteration: O(TN2)
33
Next Time
• Tutorial
• Posted on blackboard
34