Transcript Slide 1
HMM for multiple sequences
Pair HMM
HMM for pairwise sequence alignment, which incorporates affine gap scores.
“Hidden” States • Match (M) • Insertion in
x
(X) • insertion in
y
(Y) Observation Symbols • Match (M): {(
a,b
)|
a,b
in
∑
}.
• Insertion in
x
(X): {(
a,-
)|
a
in
∑
}.
• Insertion in
y
(Y): {(
-,a
)|
a
in
∑
}.
Pair HMMs
1 -2 Begin 1-2 1 M 1 X Y End
Alignment: a path
a hidden state sequence
A T - G T T A T A T C G T - A C M M Y M M X M M
Multiple sequence alignment (Globin family)
Profile model (PSSM)
• A natural probabilistic model for a conserved region would be to specify independent probabilities
e i (a)
(amino acid)
a
of observing nucleotide in position
i
• The probability of a new sequence x according to this model is
P
(
x
|
M
)
L
i
1
e i
(
x i
)
•DNA / proteins Segments of the same length L; •Often represented as Positional frequency matrix;
Profile / PSSM
LTMTRGDIGNYLGLTVETISRLLGRFQKSGML LTMTRGDIGNYLGLTIETISRLLGRFQKSGMI LTMTRGDIGNYLGLTVETISRLLGRFQKSEIL LTMTRGDIGNYLGLTVETISRLLGRLQKMGIL LAMSRNEIGNYLGLAVETVSRVFSRFQQNELI LAMSRNEIGNYLGLAVETVSRVFTRFQQNGLI LPMSRNEIGNYLGLAVETVSRVFTRFQQNGLL VRMSREEIGNYLGLTLETVSRLFSRFGREGLI LRMSREEIGSYLGLKLETVSRTLSKFHQEGLI LPMCRRDIGDYLGLTLETVSRALSQLHTQGIL LPMSRRDIADYLGLTVETVSRAVSQLHTDGVL LPMSRQDIADYLGLTIETVSRTFTKLERHGAI
Searching profiles: inference
• Give a sequence S of length L, compute the likelihood ratio of being generated from this profile vs. from background model: – R(S|P)=
L e i
i i
1
b s
– Searching motifs in a sequence: sliding window approach
Match states for profile HMMs
• Match states – Emission probabilities
e M i
(
a
) ....
....
Begin M
j
End
Components of profile HMMs
• Insert states
e
I
i
– Emission prob.
(
a
) • Usually back ground distribution
q a
.
– Transition prob.
• M
i
to I
i
, I
i
to itself, I
i
to M
i
+1 – Log-odds score for a gap of length
k
from emission) (no logg-odds log
a
M
j
I
j
log
a
I
j
M
j
1 (
k
1 ) log
a
I
j
I
j
I
j
Begin End M
j
Components of profile HMMs
• Delete states – No emission prob.
– Cost of a deletion • M→D, D→D, D→M • Each D→D might be different D
j
Begin End M
j
Full structure of profile HMMs
Begin D
j
I
j
M
j
End
Deriving HMMs from multiple alignments
• Key idea behind profile HMMs – Model representing the consensus for the alignment of sequence from the same family – Not the sequence of any particular member HBA_HUMAN ...VGA--HAGEY...
HBB_HUMAN ...V----NVDEV...
MYG_PHYCA ...VEA--DVAGH...
GLB3_CHITP ...VKG------D...
GLB5_PETMA ...VYS--TYETS...
LGB2_LUPLU ...FNA--NIPKH...
GLB1_GLYDI ...IAGADNGAGV...
*** *****
Deriving HMMs from multiple alignments
• Basic profile HMM parameterization – Aim: making the higher probability for sequences from the family • Parameters – the probabilities values : trivial if many of independent alignment sequences are given.
a kl
l A kl
'
A kl
'
e k
(
a
)
a E
'
k E k
(
a
) (
a
' ) – length of the model: heuristics or systematic way
Sequence conservation: entropy profile of the emission probability distributions
Searching with profile HMMs
• Main usage of profile HMMs – Detecting potential sequences in a family – Matching a sequence to the profile HMMs • Viterbi algorithm or forward algorithm – Comparing the resulting probability with random model
P
(
x
|
R
)
i q x i
Searching with profile HMMs
• Viterbi algorithm (optimal log-odd alignment)
V j
M (
i
) log
e
M
j
(
x i
)
q x i
max
V V V j
M 1
j
I 1
j
D 1 (
i
( (
i i
1 ) 1 1 ) ) log log log
a
M
a a
I D
j
1 M
j
1 M
j
1 M
j j j
, , ;
V j
I (
i
) log
e
I
j
(
x i
)
q x i
max
V V V j j
M D
j
I (
i
(
i
(
i
1 ) 1 ) log log 1 ) log
a
M
j
I
j
,
a
I
j
I
j
,
a
D
j
I
j
;
V j
D (
i
) max
V V V j
M 1
j
I 1
j
D 1 (
i
) ( (
i i
) ) log log log
a
M
a a
I D
j
1 D
j
1 D
j
1 D
j j j
, , ;
Searching with profile HMMs
• Forward algorithm: summing over all potent alignments
F j
M (
i
) log
e
M
j
(
x i
)
q x i
log[
a
M
j
1 M
j
exp(
F j
M 1 (
i
1 ))
a
I
j
1 M
j
exp(
F j
I 1 (
i
1 ))
a
D
j
1 M
j
exp(
F j
D 1 (
i
1 ))];
F j
I (
i
) log
e
I
j
(
x i
)
q x i
log[
a
M
j
I
j
a
I
j
I
j
exp(
F j
exp(
F j
I (
i
1 ))
a
D
j
I
j
M (
i
1 )) exp(
F j
D (
i
1 ))];
F j
D (
i
) log[
a
M
j
1 D
j
exp(
F j
M 1 (
i
))
a
I
j
1 D
j
exp(
F j
I 1 (
i
))
a
D
j
1 D
j
exp(
F j
D 1 (
i
))];
Variants for non-global alignments
• Local alignments (flanking model) – Emission prob. in flanking states use background values
q a
.
– Looping prob. close to 1, e.g. (1 ) for some small .
D
j
I
j
M
j
Begin End
Q Q
Variants for non-global alignments
• Overlap alignments – Only transitions to the first model state are allowed.
– When expecting to find either present as a whole or absent – Transition to first delete state allows missing first residue D
j
Q
Begin I
j
M
j
Q
End
Variants for non-global alignments
• Repeat alignments – Transition from right flanking state back to random model – Can find multiple matching segments in query string D
j
I
j
M
j
Q
Begin End
Estimation of prob.
• Maximum likelihood (ML) estimation – given observed freq.
c ja
of residue
a e
M
j
(
c a
'
ja c ja
' in position
j
.
• Simple pseudocounts – –
q a
: background distribution
A
: weight factor
e
M
j
(
a
)
A c ja
a Aq a
'
c ja
'
Optimal model construction: mark columns I (a) Multiple alignment: x x . . . x bat A G - - - C rat A - A G - C cat A G - A A gnat - - A A A C goat A G - - - C 1 2 . . . 3 (b) Profile-HMM architecture: beg 0 D 1 I M D 2 I M D M 3 I (c) Observed emission/transition counts end 4 match emissions insert emissions state transitions 0 1 2 3 A 4 0 0 C 0 0 4 G T 0 3 0 0 0 0 A 0 0 6 0 C 0 0 0 0 G 0 0 1 0 T 0 0 0 0 M-M 4 3 2 4 M-D 1 1 0 0 M-I 0 0 1 0 I-M 0 0 2 0 I-D 0 0 1 0 I-I 0 0 4 0 D-M D-D D-I 0 0 1 1 0 0 0 2 0
Optimal model construction
• MAP (match-insert assignment) – Recursive calculation of a number S j •
S j
: log prob. of the optimal model for alignment up to and including column
j
, assuming
j
is marked.
•
S j
is calculated from
S i
between
i
and
j
.
and summed log prob. •
T ij
: summed log prob. of all the state transitions between marked
i
and
j
.
T ij
x
,
y
M,
c
D,
xy
I log
a xy
–
c xy
are obtained from partial state paths implied by marking
i
and
j
.
Optimal model construction
• Algorithm: MAP model construction – Initialization: •
S
0 = 0,
M L
+1 = 0.
– Recurrence: for
S j
max 0
i
j j S i
= 1,...,
L
+1 :
T ij
M j
I i
1 ,
j
1 ;
j
arg 0
i
max
j S i
T ij
M j
I i
1 ,
j
1 ; – Traceback: from
j
=
L
+1 , while
j
• Mark column j as a match column •
j
=
j
.
> 0 :
Weighting training sequences
• Input sequences are random?
• “Assumption: all examples are independent samples” might be incorrect • Solutions – Weight sequences based on similarity
Weighting training sequences
• Simple weighting schemes derived from a tree – Phylogenetic tree is given.
– [Thompson, Higgins & Gibson 1994b] – [Gerstein, Sonnhammer & Chothia 1994]
w i
t n w i
leaves
k
below
n w k
Weighting training sequences
7 t 1 = 2 t 5 = 3 5 t 6 = 3 6 t 2 = 2 t 3 = 5 1 2 3
w
1 :
w
2 :
w
3 :
w
4 = 35:35:50:64 4 t 4 = 8 I 1 V 5 I 1 +I 2 +I 3 V 6 I 1 +I 2 V 7 I 3 I 2 I 1 :I 2 :I 3 :I 4 = 20:20:32:47 I 4
Multiple alignment by training profile HMM
• Sequence profiles could be represented as probabilistic models like profile HMMs.
– Profile HMMs could simply be used in place of standard profiles in progressive or iterative alignment methods.
– ML methods for building (training) profile HMM (described previously) are based on multiple sequence alignment.
– Profile HMMs can also be trained from initially unaligned sequences using the Baum-Welch (EM) algorithm
Multiple alignment by profile HMM training Multiple alignment with a known profile HMM • Before we estimate a model and a multiple alignment simultaneously, we consider as simpler problem: derive a multiple alignment from a known profile HMM model.
– This can be applied to align a large member of sequences from the same family based on the HMM model built from the (seed) multiple alignment of a small representative set of sequences in the family.
Multiple alignment with a known profile HMM • Align a sequence to a profile HMM Viterbi algorithm • Construction a multiple alignment just requires calculating a Viterbi alignment for each individual sequence.
– Residues aligned to the same match state in the profile HMM should be aligned in the same columns.
Multiple alignment with a known profile HMM • Given a preliminary alignment, HMM can align additional sequences.
Multiple alignment with a known profile HMM
Multiple alignment with a known profile HMM • Important difference with other MSA programs – Viterbi path through HMM identifies inserts – Profile HMM does not align inserts – Other multiple alignment algorithms align the whole sequences.
Profile HMM training from unaligned sequences • Harder problem – estimating both a model and a multiple alignment from initially unaligned sequences.
– Initialization: Choose the length of the profile HMM and initialize parameters.
– Training: estimate the model using the Baum Welch algorithm (iteratively).
– Multiple Alignment: Align all sequences to the final model using the Viterbi algorithm and build a multiple alignment as described in the previous section.
Profile HMM training from unaligned sequences • Initial Model – The only decision that must be made in choosing an initial structure for Baum-Welch estimation is the length of the model M. – A commonly used rule is to set M be the average length of the training sequence.
– We need some randomness in initial parameters to avoid local maxima.
Multiple alignment by profile HMM training
• Avoiding Local maxima – Baum-Welch algorithm is guaranteed to find a LOCAL maxima.
• Models are usually quite long and there are many opportunities to get stuck in a wrong solution.
– Solution • Start many times from different initial models.
• Use some form of stochastic search algorithm, e.g. simulated annealing.
Multiple alignment by profile HMM similar to Gibbs sampling • The ‘Gibbs sampler’ algorithm described by Lawrence et al.[1993] has substantial similarities.
– The problem was to simultaneously find the motif positions and to estimate the parameters for a consensus statistical model of them.
– The statistical model used is essentially a profile HMM with no insert or delete states.
Multiple alignment by profile HMM training-Model surgery
• We can modify the model after (or during) training a model by manually checking the alignment produced from the model.
– Some of the match states are redundant – Some insert states absorb too many sequences • Model surgery – If a match state is used by less than ½ of training sequences, delete its module (match-insert-delete states) – If more than ½ of training sequences use a certain insert state, expand it into
n
new modules, where
n
is the average length of insertions – ad hoc, but works well
Phylo-HMMs: model multiple alignments of syntenic sequences
• A phylo-HMM is a probabilistic machine that generates a multiple alignment, column by column, such that each column is defined by a phylogenetic model • Unlike single-sequence HMMs, the emission probabilities of phylo-HMMs are complex distributions defined by phylogenetic models
Applications of Phylo-HMMs
• Improving phylogenetic modeling that allow for variation among sites in the rate of substitution (Felsenstein & Churchill, 1996; Yang, 1995) • Protein secondary structure prediction (Goldman et al., 1996; Thorne et al., 1996) • Detection of recombination from DNA multiple alignments (Husmeier & Wright, 2001) • Recently, comparative genomics (Siepel, et. al. Haussler, 2005)
Phylo-HMMs: combining phylogeny and HMMs
• Molecular evolution can be viewed as a combination of two Markov processes – One that operates in the dimension of space (along a genome) – One that operates in the dimension of time (along the branches of a phylogenetic tree) • Phylo-HMMs model this combination
Single-sequence HMM Phylo-HMM
Phylogenetic models
• Stochastic process of substitution that operates independently at each site in a genome • A character is first drawn at random from the background distribution and assigned to the root of the tree; character substitutions then occur randomly along the tree branches, from root to leaves • The characters at the leaves define an alignment column
Phylogenetic Models
• The different phylogenetic models associated with the states of a phylo-HMM may reflect different overall rates of substitution (e.g. in conserved and non-conserved regions), different patterns of substitution or background distributions, or even different tree topologies (as with recombination)
Phylo-HMMs: Formal Definition
– –
S
{
s
1 , ,
s M
} { 1 ,
M
} (
S
, : set of hidden states ,
A
,
b
) : set of associated phylogenetic – – models
A b
{
a j
,
k
} (1 (
b
1 , ,
b M
)
k
M
) : transition probabilities : initial probabilities
The Phylogenetic Model
• – – – –
j Q
j j
j
j
(
Q j
,
j
, : binary tree
j
,
j
) : : substitution rate matrix : background frequencies : branch lengths
The Phylogenetic Model
• The model is defined with respect to an alphabet whose size is denoted
d
• The substitution rate matrix has dimension
dxd
• The background frequencies vector has dimension
d
• The tree has
n
extant taxa leaves, corresponding to
n
• The branch lengths are associated with the tree
Probability of the Data
• Let
X
be an alignment consisting of
L
columns and
n
rows, with the
i
th column denoted
X i
• The probability that column
X i
is emitted by state
s j
is simply the probability of
X i
corresponding phylogenetic model, under the
P
(
X i
|
j
) • This is the likelihood of the column given the tree, which can be computed efficiently using Felsenstein’s “pruning” algorithm (which we will describe in later lectures)
Substitution Probabilities
• Felsenstein’s algorithm requires the conditional probabilities of substitution for all bases a,b and branch lengths t j • The probability of substitution of a base
b
for a base
a P
(
b
|
a
,
t
, along a branch of length
t
, denoted
j
) is based on a continuous-time Markov model of substitution, defined by the rate matrix
Q j
Substitution Probabilities
• In particular, for any given non-negative value
t
, a,b are given the
dxd P
(
b P
|
j
(
a t
, )
t
,
j
) exp(
Q j t
) where exp(
j t
)
k
0 (
Q j t
)
k k
!
Example: HKY model
j j
(
A
,
j
,
C
,
j
,
G
,
j
,
T
,
j
) represents the transition/transversion rate ratio for
j
‘-’s indicate quantities required to normalize each row.
State sequences in Phylo-HMMs
• A state sequence through the phylo-HMM is a sequence such that
i
S
1
i
L
• The joint probability of a path and and alignment is
P
( ,
X
| ) 1
P
(
X
1 1 )
i L
2
a
i
1
i P
(
X i
|
i
)
Phylo-HMMs
• The likelihood is given by the sum over all paths (forward algorithm)
P
(
X
| )
P
( ,
X
| ) • The maximum-likelihood path is (Vertebi’s)
argmax
P
(
,
X
|
)
Computing the Probabilities
• The likelihood can be computed efficiently using the forward algorithm • The maximum-likelihood path can be computed efficiently using the Viterbi algorithm • The forward and backward algorithms can be combined to compute the posterior probability
P
(
i
j
|
X
, )
Higher-order Markov Models for Emissions
• It is common with gene-finding HMMs to condition the emission probability of each observation on the observations that immediately precede it in the sequence • For example, in a 3-rd-codon-position state, the emission of a base
x i
=“A” might have a fairly high probability if the previous two bases are
x i-2
=“G” and
x i-1
=“A” (GAA=Glu), but should have zero probability if the previous two bases are
x i-2
=“T” and
x i-1
=“A” (TAA=stop)
Higher-order Markov Models for Emission
• Considering the
N
observations preceding each
x i
corresponds to using an Markov model for emissions
N
th order • An
N
th order model for emissions is typically parameterized in terms of (
N
+1)-tuples of observations, and conditional probabilities are computed as
N
th
Order Phylo-HMMs
Probability of the
N
-tuple Sum over all possible alignment columns Y (can be calculated efficiently by a slight modification of Felsenstein’s “pruning” algorithm)