Pair-HMMs and CRFs

Download Report

Transcript Pair-HMMs and CRFs

Pair-HMMs and CRFs
Chuong B. Do
CS262, Winter 2009
Lecture #8
Outline
• I’ll cover two different topics today
– pair-HMMs
– conditional random fields (CRFs)
• Other resources
– For more information on pair-HMMs, see the Durbin et al. book
– For more information on CRFs, see the materials online.
Part I: Pair-HMMs
Quick recap of HMMs
• Formally, an HMM = (Σ, Q, A, a0, e).
–
–
–
–
–
alphabet: Σ = {b1, …, bM}
set of states: Q = {1, …, K}
transition probabilities: A = [aij]
initial state probabilities: a0i
emission probabilities: ei(bk)
• Example:
0.05
0.95
FAIR
0.95
LOADED
0.05
3 basic questions for HMMs
• Evaluation: Given a sequence of observations x and a
sequence of states π, compute P(x, π)
• Decoding: Given a sequence of observations x, compute the
maximum probability sequence of states πML = arg maxπ P(x,
π)
• Learning: Given an HMM with unspecified parameters θ = (A,
a0, b), compute the parameters that maximize the likelihood
of x and π, i.e., θML = arg maxθ P(x, π | θ)
Pair-HMMs
• Roughly speaking, the generalization of HMMs with 1
observation sequence to 2 observation sequences.
• Unfortunately, the usage of the term “pair-HMM” has come
to mean two very different types of models in computational
biology. I’ll talk about both.
Pair-HMMs: Usage #1
• Consider the HMM = (Σ1 x Σ2, Q, A, a0, e).
• Alphabet consists of pairs of letters – one letter from Σ1 paired
with one letter from Σ2.
• Example: Dishonest casino dealer playing two games at once.
0.05
0.95
HONEST
0.95
CHEAT
0.05
Pair-HMMs: Usage #1 (continued)
• QUESTION: How do we handle the 3 basic questions?
(evaluation, decoding, learning)
• QUESTION: How would you generalize to 3 sequences? What
about K sequences? Why might this be a bad idea?
• Where this version of pair-HMMs comes up in practice:
– Comparative gene recognition using pairwise alignments
– Protein secondary structure prediction using pairwise alignments as
input
– etc.
Pair-HMMs: Usage #2
• Consider the HMM = ((Σ1 U {η}) x (Σ2 U {η}), Q, A, a0, e).
• Similar to Usage #1, except that this time, instead of emitting
a pair of letters in every state, in some states we may choose
to emit a letter paired with η (the empty string).
– For simplicity, we’ll assume that η is never emitted for both
observation sequences simultaneously.
– We’ll call the two observation sequences x and y.
Application: sequence alignment
• Consider the following pair-HMM:
• QUESTION: What are the
interpretations of P(c,d) and
Q(c) for c,d  Σ?
1 – 2
M
P(xi, yj)
1–



I
P(xi, η)
1–

J
optional
• QUESTION: What does this
model have to do with
alignments?
P(η, yj)
c Σ, P(η, c) = P(c, η) = Q(c)
• QUESTION: What is the
average length of a gapped
region in alignments
generated by this model?
Average length of matched
regions?
Pair-HMMs: Usage #2 (continued)
• Key difference from Usage #1: In these pair-HMMs, we
cannot rely on the algorithms for single-sequence HMMs in
order to do inference. 
• However, we can modify the algorithms for single sequence
HMMs to work for alignment pair-HMMs. 
Recap: Viterbi for single-sequence HMMs
1
1
1
…
1
2
2
2
…
2
…
…
…
K
K
K
x1
x2
x3
…
…
K
xK
• Algorithm:
– Define Vk(i) = max π1 … πi-1 P(x1 … xi-1, π1 … πi-1, xi, πi = k)
– Compute using dynamic programming!
(Broken) Viterbi for pair-HMMs
• In the single sequence case, we defined
Vk(i) = max π1 … πi-1 P(x1 … xi-1, π1 … πi-1, xi, πi = k)
= ek(xi) ∙ maxj ajk Vj(i - 1)
– QUESTION: What’s the computational complexity of DP?
• Observation: In the pairwise case (x1, y1) … (xi -1, yi-1) no
longer correspond to the first i – 1 letters of x and y
respectively.
– QUESTION: Why does this cause a problem in the pair-HMM case?
(Fixed) Viterbi for pair-HMMs
• We’ll just consider this special case:
1 – 2
VM(i, j) = P(xi, yj) max
M
P(xi, yj)
1–



I
P(xi, η)
1–

VI(i, j) = Q(xi) max
δ VM(i - 1, j)
ε VI(i - 1, j)
VJ(i, j) =Q(yj) max
δ VM(i, j - 1)
ε VJ(i, j - 1)
J
optional
P(η, yj)
c Σ, P(η, c) = P(c, η) = Q(c)
(1 - 2δ) VM(i - 1, j - 1)
(1 - ε) VI(i - 1, j - 1)
(1 - ε) VI(i - 1, j - 1)
• A similar trick can be used when working with the
forward/backward algorithms (see Durbin et al for details)
•
QUESTION: What’s the computational complexity of DP?
Connection to NW with affine gaps
VM(i, j) = P(xi, yj) max
(1 - 2δ) VM(i - 1, j - 1)
(1 - ε) VI(i - 1, j - 1)
(1 - ε) VI(i - 1, j - 1)
VI(i, j) = Q(xi) max
δ VM(i - 1, j)
ε VI(i - 1, j)
VJ(i, j) =Q(yj) max
δ VM(i, j - 1)
ε VJ(i, j - 1)
• QUESTION: How would the optimal alignment change if we
divided the probability for every single alignment by ∏i=1 … |x|
Q(xi) ∏j = 1 … |y| Q(yj)?
Connection to NW with affine gaps
VM(i, j) = P(xi, yj) max
Q(xi) Q(yj)
(1 - 2δ) VM(i - 1, j - 1)
(1 - ε) VI(i - 1, j - 1)
(1 - ε) VI(i - 1, j - 1)
VI(i, j) = max
δ VM(i - 1, j)
ε VI(i - 1, j)
VJ(i, j) = max
δ VM(i, j - 1)
ε VJ(i, j - 1)
• Account for the extra terms “along the way.”
Connection to NW with affine gaps
log VM(i, j) =
log
P(xi, yj) + max
Q(xi) Q(yj)
log (1 - 2δ) + log VM(i - 1, j - 1)
log (1 - ε) + log VI(i - 1, j - 1)
log (1 - ε) + log VI(i - 1, j - 1)
log VI(i, j) = max
log δ + VM(i - 1, j)
log ε + VI(i - 1, j)
log VJ(i, j) = max
log δ + VM(i, j - 1)
log ε + VJ(i, j - 1)
• Take logs, and ignore a couple terms.
Connection to NW with affine gaps
M(i, j) = S(xi, yj) + max
M(i - 1, j - 1)
I(i - 1, j - 1)
J(i - 1, j - 1)
I(i, j) = max
d + M(i - 1, j)
e + I(i - 1, j)
J(i, j) = max
d + M(i, j - 1)
e + J(i, j - 1)
• Rename!
Summary of pair-HMMs
• We briefly discussed two types of pair-HMMs that come up in
the literature.
• We talked about pair-HMMs for sequence alignment.
• We derived a Viterbi inference algorithm for alignment pairHMMs and showed its rough “equivalence” to NeedlemanWunsch with affine gap penalties.
Part II: Conditional random fields
Motivation
• So far, we’ve seen HMMs, a type of probabilistic model for
labeling sequences.
– For an input sequence x and a parse π, an HMM defines P(π, x)
– In practice, when we want to perform inference, however, what we
really care about is P(π | x).
• In the case of the Viterbi algorithm, arg maxπ P(π | x) = arg maxπ P(x, π).
• But when using posterior decoding, need P(π | x).
• Why do need to parameterize P(x, π)? If we really are
interested in the conditional distribution over possible parses
π for a given input x, can’t we just model of P(π | x) directly?
– Intuition: P(x, π) = P(x) P(π | x)  why waste effort modeling P(x)?
Conditional random fields
• Definition
P(π | x) =
exp(∑i=1 … |x| wTF(πi, πi-1, x, i))
∑π’ exp(∑i=1 … |x| wTF(π’i, π’i-1, x, i))
partition coefficient
where
F : state x state x observations x index  Rn “local feature mapping”
w  Rn
“parameter vector”
– Summation is taken over all possible state sequences π’1 … π’|x|
– aTb for two vectors a, b  Rn denotes the inner product, ∑i=1 ... n ai bi
Interpretation of weights
• Define “global feature mapping”
F(x, π) = ∑i=1 … |x| wTF(πi, πi-1, x, i)
Then,
P(π | x) =
exp(wTF(x, π))
∑π’ exp(wTF(x, π’))
• QUESTION: If all features are nonnegative, what is the effect
of increasing some component wj?
Relationship with HMMs
• Recall that in an HMM,
log P(x, π) = log P(π0) + ∑i=1 … |x| [ log P(πi | πi-1) + log P(xi | πi) ]
= log a0(π0) + ∑i=1 … |x| [ log a(πi-1, πi) + log eπi(xi) ]
• Suppose that we list the logarithms of all the parameters of our model in a
long vector w.
w=
log a0(1)
…
log a0(K)
log a11
…
log aKK
log e1(b1)
…
log eK(bM)
 Rn
Relationship with HMMs
log P(x, π) = log a0(π0) + ∑i=1 … |x| [ log a(πi-1, πi) + log eπi(xi) ]
• Now, consider a feature mapping with the same dimensionality as w. For
each component wj, define Fj to be a 0/1 indicator variable indicating
whether the corresponding parameter should be included in scoring x, π
at position i:
w=
log a0(1)
…
log a0(K)
log a11
…
log aKK
log e1(b1)
…
log eK(bM)
 Rn
F(πi, πi-1, x, i) =
• Then, log P(x, π) = ∑i=1 … |x| wT F(πi, πi-1, x, i).
1{i = 1 Λ πi-1 = 1}
…
1{i = 1 Λ πi-1 = K}
1{πi-1 = 1 Λ πi = 1}
…
1{πi-1 = K Λ πi = K}
1{xi = b1 Λ πi = 1}
…
1{xi = bM Λ πi = K}
 Rn
(*)
Relationship with HMMs
log P(x, π) = ∑i=1 … |x| wT F(πi, πi-1, x, i)
• Equivalently,
P(π | x) =
P(x, π)
∑π P(x, π)
=
exp(∑i=1 … |x| wT F(πi, πi-1, x, i))
∑π’ exp(∑i=1 … |x| wT F(πi, πi-1, x, i))
• Therefore, an HMM can be converted to an equivalent CRF.
CRFs ≥ HMMs
• All HMMs can be converted to CRFs. But, the reverse does
not hold true!
• Reason #1: HMM parameter vectors obey certain constraints
(e.g., ∑i a0i = 1) but CRF parameter vectors (w) are
unconstrained!
• Reason #2: CRFs can contain features that are NOT possible
to include in HMMs.
CRFS ≥ HMMs (continued)
• In an HMM, our features were of the form
F(πi, πi-1, x, i) = F(πi, πi-1, xi, i)
– I.e., when scoring position i in the sequence, feature only considered
the emission xi at position i.
– Cannot look at other positions (e.g., xi-1, xi+1) since that would involve
“emitting” a character more than once – double-counting of
probability!
• CRFs don’t have this problem.
– Why? Because CRFs don’t attempt to model the observations x!
Examples of non-local features for CRFs
• Casino:
– Dealer looks at previous 100 positions, and determines whether at
least 50 over them had 6’s.
Fj(LOADED, FAIR, x, i) = 1{ xi-100 … xi has > 50 6s }
• CpG islands:
– Gene occurs near a CpG island.
Fj(*, EXON, x, i) = 1{ xi-1000 … xi+1000 has > 1/16 CpGs }.
3 basic questions for CRFs
• Evaluation: Given a sequence of observations x and a
sequence of states π, compute P(π | x)
• Decoding: Given a sequence of observations x, compute the
maximum probability sequence of states πML = arg maxπ P(π |
x)
• Learning: Given a CRF with unspecified parameters w,
compute the parameters that maximize the likelihood of π
given x, i.e., wML = arg maxw P(π | x, w)
Viterbi for CRFs
• Note that:
argmaxπ P(π | x) = argmaxπ
exp(∑i=1 … |x| wTF(πi, πi-1, x, i))
∑π’ exp(∑i=1 … |x| wTF(π’i, π’i-1, x, i))
= arg maxπ exp(∑i=1 … |x| wTF(πi, πi-1, x, i))
= arg maxπ ∑i=1 … |x| wTF(πi, πi-1, x, i)
• We can derive the following recurrence:
Vk(i) = maxj [ wTF(k, j, x, i) + Vj(i-1) ]
• Notes:
– Even though the features may depend on arbitrary positions in x, the
sequence x is constant. The correctness of DP depends only on knowing
the previous state.
– Computing the partition function can be done by a similar adaptation of
the forward/backward algorithms.
3 basic questions for CRFs
• Evaluation: Given a sequence of observations x and a
sequence of states π, compute P(π | x)
• Decoding: Given a sequence of observations x, compute the
maximum probability sequence of states πML = arg maxπ P(π |
x)
• Learning: Given a CRF with unspecified parameters w,
compute the parameters that maximize the likelihood of π
given x, i.e., wML = arg maxw P(π | x, w)
Learning CRFs
• Key observation: – log P(π | x, w) is a differentiable, convex
function of w
convex function
f(x)
f(y)
Any local minimum is a global minimum.
Learning CRFs (continued)
• Compute partial derivative of log P(π | x, w) with respect to
each parameter wj, and use the gradient ascent learning rule:
Gradient points in
the direction of
greatest function
increase
w
The CRF gradient
• It turns out (you’ll show this on your homework) that
(/wj) log P(π | x, w) = Fj(x, π) – Eπ’  P(π’ | x, w) [ Fj(x, π’) ]
correct value for jth
feature
expected value for
jth feature (given the
current parameters)
• This has a very nice interpretation:
– We increase parameters for which the correct feature values are greater
than the predicted feature values.
– We decrease parameters for which the correct feature values are less than
the predicted feature values
• This moves probability mass from incorrect parses to correct
parses.
Summary of CRFs
• We described a probabilistic model known as a CRF.
• We showed how to convert any HMM into an equivalent CRF.
• We mentioned briefly that inference in CRFs is very similar to
inference in HMMs.
• We described a gradient ascent approach to training CRFs
when true parses are available.