Slide 1

Download Report

Transcript Slide 1

HMM for multiple sequences

Pair HMM

HMM for pairwise sequence alignment, which incorporates affine gap scores.

“Hidden” States • Match (M) • Insertion in

(X) • insertion in

(Y) Observation Symbols • Match (M): {(

a,b

∑

• Insertion in

(X): {(

a,-

∑

• Insertion in

(Y): {(

-,a

∑

Pair HMMs

1  -2  Begin 1-2    1  M  1   X Y      End

Alignment: a path



a hidden state sequence

A T - G T T A T A T C G T - A C M M Y M M X M M

Multiple sequence alignment (Globin family)

Profile model (PSSM)

• A natural probabilistic model for a conserved region would be to specify independent probabilities

e i (a)

(amino acid)

of observing nucleotide in position

• The probability of a new sequence x according to this model is

(

) 



 1

e i

(

x i

) 

•DNA / proteins Segments of the same length L; •Often represented as Positional frequency matrix;

Profile / PSSM

LTMTRGDIGNYLGLTVETISRLLGRFQKSGML LTMTRGDIGNYLGLTIETISRLLGRFQKSGMI LTMTRGDIGNYLGLTVETISRLLGRFQKSEIL LTMTRGDIGNYLGLTVETISRLLGRLQKMGIL LAMSRNEIGNYLGLAVETVSRVFSRFQQNELI LAMSRNEIGNYLGLAVETVSRVFTRFQQNGLI LPMSRNEIGNYLGLAVETVSRVFTRFQQNGLL VRMSREEIGNYLGLTLETVSRLFSRFGREGLI LRMSREEIGSYLGLKLETVSRTLSKFHQEGLI LPMCRRDIGDYLGLTLETVSRALSQLHTQGIL LPMSRRDIADYLGLTVETVSRAVSQLHTDGVL LPMSRQDIADYLGLTIETVSRTFTKLERHGAI

Searching profiles: inference

• Give a sequence S of length L, compute the likelihood ratio of being generated from this profile vs. from background model: – R(S|P)=

L e i

 

i i

  1

b s

– Searching motifs in a sequence: sliding window approach

Match states for profile HMMs

• Match states – Emission probabilities

e M i

(

) ....

....

Begin M

End

Components of profile HMMs

• Insert states

– Emission prob.

(

) • Usually back ground distribution

q a

– Transition prob.

• M

to I

, I

to itself, I

to M

+1 – Log-odds score for a gap of length

from emission) (no logg-odds log

 log

 1  (

 1 ) log

Begin End M

Components of profile HMMs

• Delete states – No emission prob.

– Cost of a deletion • M→D, D→D, D→M • Each D→D might be different D

Begin End M

Full structure of profile HMMs

Begin D

End

Deriving HMMs from multiple alignments

• Key idea behind profile HMMs – Model representing the consensus for the alignment of sequence from the same family – Not the sequence of any particular member HBA_HUMAN ...VGA--HAGEY...

HBB_HUMAN ...V----NVDEV...

MYG_PHYCA ...VEA--DVAGH...

GLB3_CHITP ...VKG------D...

GLB5_PETMA ...VYS--TYETS...

LGB2_LUPLU ...FNA--NIPKH...

GLB1_GLYDI ...IAGADNGAGV...

*** *****

Deriving HMMs from multiple alignments

• Basic profile HMM parameterization – Aim: making the higher probability for sequences from the family • Parameters – the probabilities values : trivial if many of independent alignment sequences are given.

a kl

 

l A kl

A kl

e k

(

)  

a E

k E k

(

) (

' ) – length of the model: heuristics or systematic way

Sequence conservation: entropy profile of the emission probability distributions

Searching with profile HMMs

• Main usage of profile HMMs – Detecting potential sequences in a family – Matching a sequence to the profile HMMs • Viterbi algorithm or forward algorithm – Comparing the resulting probability with random model

(

)  

i q x i

Searching with profile HMMs

• Viterbi algorithm (optimal log-odd alignment)

V j

M (

)  log

(

x i

)

q x i

 max     

V V V j

M  1

I  1

D  1 (

( (

i i

 1 )   1 1 ) )    log log log

a a

I D

 1 M

j j j

, , ;

V j

I (

)  log

(

x i

)

q x i

 max     

V V V j j

M D

I (

(

  1 ) 1 )   log log  1 )  log

;

V j

D (

)  max     

V V V j

M  1

I  1

D  1 (

) ( (

i i

) )    log log log

a a

I D

 1 D

j j j

, , ;

Searching with profile HMMs

• Forward algorithm: summing over all potent alignments

F j

M (

)  log

(

x i

)

q x i

 log[

 1 M

exp(

F j

M  1 (

 1 )) 

 1 M

exp(

F j

I  1 (

 1 )) 

 1 M

exp(

F j

D  1 (

 1 ))];

F j

I (

)  log

(

x i

)

q x i

 log[



exp(

F j

exp(

F j

I (

 1 )) 

M (

 1 )) exp(

F j

D (

 1 ))];

F j

D (

)  log[

 1 D

exp(

F j

M  1 (

)) 

 1 D

exp(

F j

I  1 (

)) 

 1 D

exp(

F j

D  1 (

))];

Variants for non-global alignments

• Local alignments (flanking model) – Emission prob. in flanking states use background values

q a

– Looping prob. close to 1, e.g. (1  ) for some small  .

Begin End

Q Q

Variants for non-global alignments

• Overlap alignments – Only transitions to the first model state are allowed.

– When expecting to find either present as a whole or absent – Transition to first delete state allows missing first residue D

Begin I

End

Variants for non-global alignments

• Repeat alignments – Transition from right flanking state back to random model – Can find multiple matching segments in query string D

Begin End

Estimation of prob.

• Maximum likelihood (ML) estimation – given observed freq.

c ja

of residue

a e

(  

c a

ja c ja

' in position

• Simple pseudocounts – –

q a

: background distribution

: weight factor

(

) 

A c ja

  

a Aq a

c ja

Optimal model construction: mark columns I (a) Multiple alignment: x x . . . x bat A G - - - C rat A - A G - C cat A G - A A gnat - - A A A C goat A G - - - C 1 2 . . . 3 (b) Profile-HMM architecture: beg 0 D 1 I M D 2 I M D M 3 I (c) Observed emission/transition counts end 4 match emissions insert emissions state transitions 0 1 2 3 A 4 0 0 C 0 0 4 G T 0 3 0 0 0 0 A 0 0 6 0 C 0 0 0 0 G 0 0 1 0 T 0 0 0 0 M-M 4 3 2 4 M-D 1 1 0 0 M-I 0 0 1 0 I-M 0 0 2 0 I-D 0 0 1 0 I-I 0 0 4 0 D-M D-D D-I 0 0 1 1 0 0 0 2 0

Optimal model construction

• MAP (match-insert assignment) – Recursive calculation of a number S j •

S j

: log prob. of the optimal model for alignment up to and including column

, assuming

is marked.

•

S j

is calculated from

S i

between

and

and summed log prob. •

T ij

: summed log prob. of all the state transitions between marked

and

T ij



, 

 M,

I log

a xy

–

c xy

are obtained from partial state paths implied by marking

and

Optimal model construction

• Algorithm: MAP model construction – Initialization: •

0 = 0,

M L

+1 = 0.

– Recurrence: for

S j

 max 0 



j j S i

= 1,...,

+1 : 

T ij



M j



I i

 1 ,

 1   ; 

 arg 0 

max 

j S i



T ij



M j



I i

 1 ,

 1   ; – Traceback: from

= 

+1 , while 

• Mark column j as a match column •

= 

> 0 :

Weighting training sequences

• Input sequences are random?

• “Assumption: all examples are independent samples” might be incorrect • Solutions – Weight sequences based on similarity

Weighting training sequences

• Simple weighting schemes derived from a tree – Phylogenetic tree is given.

– [Thompson, Higgins & Gibson 1994b] – [Gerstein, Sonnhammer & Chothia 1994] 

w i



t n w i

 leaves

below

n w k

Weighting training sequences

7 t 1 = 2 t 5 = 3 5 t 6 = 3 6 t 2 = 2 t 3 = 5 1 2 3

1 :

2 :

3 :

4 = 35:35:50:64 4 t 4 = 8 I 1 V 5 I 1 +I 2 +I 3 V 6 I 1 +I 2 V 7 I 3 I 2 I 1 :I 2 :I 3 :I 4 = 20:20:32:47 I 4

Multiple alignment by training profile HMM

• Sequence profiles could be represented as probabilistic models like profile HMMs.

– Profile HMMs could simply be used in place of standard profiles in progressive or iterative alignment methods.

– ML methods for building (training) profile HMM (described previously) are based on multiple sequence alignment.

– Profile HMMs can also be trained from initially unaligned sequences using the Baum-Welch (EM) algorithm

Multiple alignment by profile HMM training Multiple alignment with a known profile HMM • Before we estimate a model and a multiple alignment simultaneously, we consider as simpler problem: derive a multiple alignment from a known profile HMM model.

– This can be applied to align a large member of sequences from the same family based on the HMM model built from the (seed) multiple alignment of a small representative set of sequences in the family.

Multiple alignment with a known profile HMM • Align a sequence to a profile HMM  Viterbi algorithm • Construction a multiple alignment just requires calculating a Viterbi alignment for each individual sequence.

– Residues aligned to the same match state in the profile HMM should be aligned in the same columns.

Multiple alignment with a known profile HMM • Given a preliminary alignment, HMM can align additional sequences.

Multiple alignment with a known profile HMM

Multiple alignment with a known profile HMM • Important difference with other MSA programs – Viterbi path through HMM identifies inserts – Profile HMM does not align inserts – Other multiple alignment algorithms align the whole sequences.

Profile HMM training from unaligned sequences • Harder problem – estimating both a model and a multiple alignment from initially unaligned sequences.

– Initialization: Choose the length of the profile HMM and initialize parameters.

– Training: estimate the model using the Baum Welch algorithm (iteratively).

– Multiple Alignment: Align all sequences to the final model using the Viterbi algorithm and build a multiple alignment as described in the previous section.

Profile HMM training from unaligned sequences • Initial Model – The only decision that must be made in choosing an initial structure for Baum-Welch estimation is the length of the model M. – A commonly used rule is to set M be the average length of the training sequence.

– We need some randomness in initial parameters to avoid local maxima.

Multiple alignment by profile HMM training

• Avoiding Local maxima – Baum-Welch algorithm is guaranteed to find a LOCAL maxima.

• Models are usually quite long and there are many opportunities to get stuck in a wrong solution.

– Solution • Start many times from different initial models.

• Use some form of stochastic search algorithm, e.g. simulated annealing.

Multiple alignment by profile HMM similar to Gibbs sampling • The ‘Gibbs sampler’ algorithm described by Lawrence et al.[1993] has substantial similarities.

– The problem was to simultaneously find the motif positions and to estimate the parameters for a consensus statistical model of them.

– The statistical model used is essentially a profile HMM with no insert or delete states.

Multiple alignment by profile HMM training-Model surgery

• We can modify the model after (or during) training a model by manually checking the alignment produced from the model.

– Some of the match states are redundant – Some insert states absorb too many sequences • Model surgery – If a match state is used by less than ½ of training sequences, delete its module (match-insert-delete states) – If more than ½ of training sequences use a certain insert state, expand it into

new modules, where

is the average length of insertions – ad hoc, but works well

Phylo-HMMs: model multiple alignments of syntenic sequences

• A phylo-HMM is a probabilistic machine that generates a multiple alignment, column by column, such that each column is defined by a phylogenetic model • Unlike single-sequence HMMs, the emission probabilities of phylo-HMMs are complex distributions defined by phylogenetic models

Applications of Phylo-HMMs

• Improving phylogenetic modeling that allow for variation among sites in the rate of substitution (Felsenstein & Churchill, 1996; Yang, 1995) • Protein secondary structure prediction (Goldman et al., 1996; Thorne et al., 1996) • Detection of recombination from DNA multiple alignments (Husmeier & Wright, 2001) • Recently, comparative genomics (Siepel, et. al. Haussler, 2005)

Phylo-HMMs: combining phylogeny and HMMs

• Molecular evolution can be viewed as a combination of two Markov processes – One that operates in the dimension of space (along a genome) – One that operates in the dimension of time (along the branches of a phylogenetic tree) • Phylo-HMMs model this combination

Single-sequence HMM Phylo-HMM

Phylogenetic models

• Stochastic process of substitution that operates independently at each site in a genome • A character is first drawn at random from the background distribution and assigned to the root of the tree; character substitutions then occur randomly along the tree branches, from root to leaves • The characters at the leaves define an alignment column

Phylogenetic Models

• The different phylogenetic models associated with the states of a phylo-HMM may reflect different overall rates of substitution (e.g. in conserved and non-conserved regions), different patterns of substitution or background distributions, or even different tree topologies (as with recombination)

   

Phylo-HMMs: Formal Definition

– –

  {

1 , ,

s M

}  {  1 , 

}   (

,  : set of hidden states ,

) : set of associated phylogenetic – – models

A b

  {

a j

} (1  (

1 , ,

b M

) 



) : transition probabilities : initial probabilities

The Phylogenetic Model

    •  – – – –

j Q

 

j j





(

Q j

, 

,  : binary tree

, 

) : : substitution rate matrix : background frequencies : branch lengths

The Phylogenetic Model

• The model is defined with respect to an alphabet  whose size is denoted

• The substitution rate matrix has dimension

dxd

• The background frequencies vector has dimension

• The tree has

extant taxa leaves, corresponding to

• The branch lengths are associated with the tree

Probability of the Data

• Let

be an alignment consisting of

columns and

rows, with the

th column denoted

X i

• The probability that column

X i

is emitted by state

s j

is simply the probability of

X i

corresponding phylogenetic model, under the

(

X i

| 

) • This is the likelihood of the column given the tree, which can be computed efficiently using Felsenstein’s “pruning” algorithm (which we will describe in later lectures)



Substitution Probabilities

• Felsenstein’s algorithm requires the conditional probabilities of substitution for all bases a,b  and branch lengths t  j • The probability of substitution of a base

for a base

a P

(

, along a branch of length

, denoted 

) is based on a continuous-time Markov model of substitution, defined by the rate matrix

Q j

Substitution Probabilities

• In particular, for any given non-negative value

, a,b  are given the

dxd P

(

b P

(

a t

, )

,  

) exp(

Q j t

) where exp( 

j t

)    

 0 (

Q j t

)

k k



Example: HKY model

   

j j

 ( 

, 

) represents the transition/transversion rate ratio for 

‘-’s indicate quantities required to normalize each row.



State sequences in Phylo-HMMs

• A state sequence through the phylo-HMM is a sequence such that 



 1 



• The joint probability of a path and and alignment is 

(  ,

|  )    1

(

1    1 )

i L

  2



 1 

i P

(

X i

|  

)



Phylo-HMMs

• The likelihood is given by the sum over all paths (forward algorithm)

(

|  )   

(  ,

|  ) • The maximum-likelihood path is (Vertebi’s)  

argmax



(



,

|



)



Computing the Probabilities

• The likelihood can be computed efficiently using the forward algorithm • The maximum-likelihood path can be computed efficiently using the Viterbi algorithm • The forward and backward algorithms can be combined to compute the posterior probability

( 



,  ) 

Higher-order Markov Models for Emissions

• It is common with gene-finding HMMs to condition the emission probability of each observation on the observations that immediately precede it in the sequence • For example, in a 3-rd-codon-position state, the emission of a base

x i

=“A” might have a fairly high probability if the previous two bases are

x i-2

=“G” and

x i-1

=“A” (GAA=Glu), but should have zero probability if the previous two bases are

x i-2

=“T” and

x i-1

=“A” (TAA=stop)

Higher-order Markov Models for Emission

• Considering the

observations preceding each

x i

corresponds to using an Markov model for emissions

th order • An

th order model for emissions is typically parameterized in terms of (

+1)-tuples of observations, and conditional probabilities are computed as

N

Order Phylo-HMMs

Probability of the

-tuple Sum over all possible alignment columns Y (can be calculated efficiently by a slight modification of Felsenstein’s “pruning” algorithm)

Slide 1

Transcript Slide 1

HMM for multiple sequences

Pair HMM

Pair HMMs

Alignment: a path

a hidden state sequence

Multiple sequence alignment (Globin family)

Profile model (PSSM)

Profile / PSSM

Searching profiles: inference

Match states for profile HMMs

Components of profile HMMs

Components of profile HMMs

Full structure of profile HMMs

Deriving HMMs from multiple alignments

Deriving HMMs from multiple alignments

Searching with profile HMMs

Searching with profile HMMs

Searching with profile HMMs

Variants for non-global alignments

Variants for non-global alignments

Variants for non-global alignments

Estimation of prob.

Optimal model construction

Optimal model construction

Weighting training sequences

Weighting training sequences

Weighting training sequences

Multiple alignment by training profile HMM

Multiple alignment by profile HMM training

Multiple alignment by profile HMM training-Model surgery

Phylo-HMMs: model multiple alignments of syntenic sequences

Applications of Phylo-HMMs

Phylo-HMMs: combining phylogeny and HMMs

Phylogenetic models

Phylogenetic Models

Phylo-HMMs: Formal Definition

The Phylogenetic Model

The Phylogenetic Model

Probability of the Data

Substitution Probabilities

Substitution Probabilities

Example: HKY model

State sequences in Phylo-HMMs

Phylo-HMMs

argmax

(

,

|

)

Computing the Probabilities

Higher-order Markov Models for Emissions

Higher-order Markov Models for Emission

N

Order Phylo-HMMs

Directory