4 Pairwise alignment

Download Report

Transcript 4 Pairwise alignment

Pairwise alignment - Score-based alignments, dynamic programming - Problems with score-based alignments - Markov chains, hidden Markov models, examples - The Viterbi algorithm - Pair HMMs - Forward, Backward, and Posterior probabilities - Posterior Decoding, comparison with score-based alignments

Sequence alignment

To study how nucleotide (or amino acid) sequences evolve, first a precise matching between homologous nucleotides in two (homologous) nucleotide sequences has to be made.

“ Homology ” is an old concept in biology, and means “having a similar form”. In evolutionary biology, it means more specifically, having a similar form because of shared ancestry . For example, the skeletons of mammals are homologous in this sense, but eyes have evolved independently a number of times, so they’re not homologous (even though they may be very similar in form and function). The word is also used for nucleotide sequences. Here it means sequence similarity because of shared ancestry (although it is often used as synonym of “sequence similarity”). Ancestry is tricky to determine, especially when sequences are short, highly divergent, or contain many repetitive patterns (because of the possibility that these repeats have arisen independently). So in practice, sequence similarity is often used as a proxy for homology. Other characteristics, such as the function or 3D structure of the protein encoded by the DNA sequence are used as well. Hexokinase I (glucose metabolism enzyme) is found in mammals, plants, and yeast.

Needleman-Wunsch algorithm

Two sequences are considered homologous is they’re likely to have evolved from a common ancestor, because of their sequence similarity. To quantify this, it is first assumed that during evolution the sequences have been changed by two processes: (1) single nucleotide substitutions , and (2) short sequence insertions and deletions ( indels ). Other mutations do occur (e.g. multiple-nucleotide substitutions, duplications, inversions, ...), but they’re usually ignored because of their relatively low frequency and/or the technical difficulty of dealing with them.

Considering only nucleotide substitutions and indels, the homology between 2 sequences can be summarized by an alignment . It implies a minimum set of substitutions and indels for that homology.

CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCA CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA

the alignment that optimizes a score, which penalizes both nucleotide mis-matches at aligned positions, and gaps. This approach is a form of weighted parsimony .

The score function has the following form: S =  n,m aligned M(n,m) +  gaps h(length) , where M(m,n) is the score matrix , and h(k) to be a linear affine function, is the gap penalty function . The latter is almost always taken where a h(k) = a + b * (k-1) is the gap opening penalty , and b is the gap extension penalty .

Needleman-Wunsch algorithm CGACATTAA--ATA CGACGTTAACGATT

An alignment can be represented as a path through a table, with the sequences along the edges of the table.

A T A C G A T T C G A A

C G A C G T T A A C G A T T

Needleman-Wunsch algorithm CGACATTAA--ATA CGACGTTAACGATT

The problem with determining the top scoring alignment is that there are very many paths. If we ignore the diagonal steps (i.e. only gaps...), the number of paths can be computed recursively, by adding the entries above and to the left and writing the result in the box.

The numbers you get are called “Pascal’s triangle”, or the binomial coefficients. The number of alignments of sequences of length N and M is therefore at least 

N N M

  Simply going through all possible alignments is therefore impossible.

A T A C G A T T C G A A

C 1

1 1 1 1 1 1 1 1 1 4 5

2

3 6 7 8 9

G

1

A

1 3

6

10 15 21 28 36

C

1 4

10 20

35 56 84

G

1 5 15 35

70

126

T

1 6 21 54 126

T

1 7 28 84

A

1 8 36

A

1 9

C

1

G A T T

Needleman-Wunsch algorithm

Needleman-Wunsch is, yet again, an application of dynamic programming. Here, the intermediate problem to solve is the computation of the maximum score S(n,m) of the first n nucleotides of sequence u, and the first m nucleotides of sequence v.

The recursive formula for this score is S(n,m) = max{ S(n-1,m-1) + M( u(n), v(m) ), max k>1 S(n-k,m) + h(k), max k>1 S(n,m-k) + h(k) } (To make the formulas look clean, boundary conditions such as S(-1,n)=0 , and the allowed range of k , are not written explicitly).

This algorithm takes O(n 3 ) time. It can be improved to O(n 2 ) by using 3 tables which collect the scores that (1) end in an aligned pair, (2) end in a gap in u, and (3) end in a gap in v. The linear affine form of h allows simple updating of the gap ending scores using neighbouring values.

CGACATTAA--ATA CGACGTTAACGATT

C G A C G T T A A A T A

C

10 -10 -15 -20

G

-10 20

A

-15

C

-20

G

30 40 50 30

T T A A C G A T T

60 40 70 50 80 60 90 70 70 85 65 65 65 80 60 75 60 90 85 70 80 gap opening= -10; gap extension= -5; match= +10 mismatch= -5

Needleman-Wunsch algorithm

With the table filled out, the bottom-right entry gives the score of the best-scoring alignment.

CGACATTAA--ATA CGACGTTAACGATT

By starting in this entry, and re-computing the score, you can figure out which term of the max{ } formula gave rise to this score. Go to the corresponding entry and repeat until you arrive at the top-left entry. The path now corresponds to the alignment.

This algorithm computes the best-scoring The global alignment: all nucleotides participate, and gaps at the beginning or end are penalized as much as gaps in the middle.

Smith-Waterman algorithm is a variation of Needleman-Wunsch which does not penalize gaps at the beginning or end. It computes the best-scoring local alignment.

C G A C G T T A A A T A

C

10 -10 -15 -20

G

-10 20

A

-15

C

-20

G

30 40 50 30

T T A A C G A T T

60 40 70 50 80 60 90 70 70 85 65 65 65 80 60 75 60 90 85 70 80 gap opening= -10; gap extension= -5; match= +10 mismatch= -5

Alignments are often wrong

Sensitivity for true homology Incorrect bases in alignment

Biases in alignments

A

: gap wander

(Holmes & Durbin, JCB 5 1998)

B,C

: gap attraction

D

: gap annihilation

Problems with score-based alignments

The Needleman-Wunsch algorithm (and its elaborations) optimize a score to arrive at a single preferred alignment. This approach is unsatisfactory for several reasons:

1.

It is unclear how to determine the score matrix and gap penalties given only unaligned sequences.

2.

Even if somehow the underlying evolutionary process is known (e.g. as substitution and indel rates) it is unclear how the scores and penalties relate to this.

3.

It is impossible to recover the “true” alignment with certainty, particularly if the sequences are quite divergent. But how to quantify the alignment’s reliability?

4.

The N-W alignments are biased in certain ways, and provide no obvious ways to address the problem.

These observations suggests that a probabilistic (or statistical ) approach to alignments might be helpful.

Markov chains

A Markov chain is a discrete-time stochastic process X(t) that has the Markov property .

“Discrete time” means that t  T and T is the set of integers {...,-2,-1,0,1,2,...} , a discrete set.

The state space may be either discrete or continuous. We’ll only consider discrete state spaces here (but continuous state spaces are also useful in biology, e.g. to model the evolution of gene expression levels).

The “Markov property” is the property that the future dynamics of X(t) , given the present state, is independent of the past states. Symbolically, P( X(t+1)=y t+1 | X(t)=y t , X(t-1)=y t-1 , X(t-2)=y t-2 ,.... ) = P( X(t+1)=y t+1 | X(t)=y t ) This means that the behaviour of X(t) is completely determined by the matrices A pq (t) = P( X(t+1)=q | X(t)=p ) (Strictly speaking, they are matrices only if the state space is finite.) These matrices are called the transition matrices Markov chain .

of X(t). They may well depend on t . If they do not, X(t) is called a time-homogeneous

Markov chains

Suppose you know the probabilities that the Markov chain is in a particular state for time t ; let’s say you know the vector u where u a = P( X(t) = a ) Then the probability that X(t+1)=b is (*)  a  a P( X(t)=a ) * P( X(t+1)=b | X(t)=a ) = u a [ u A ] b A ab = because the summation (*) is just the definition of the product of a vector and a matrix. Applying the same formula with “u” = u A some number of times, you get P( X(t+n)=b ) = [ u A n ] b Conditioning on X(t)=a means settings u = (0,..,1,...,0), with the 1 at the coordinate corresponding to a : P( X(t+n)=b | X(t)=a ) = [ A n ] ab

Markov chains

Simple example: Two states, and transition probabilities between them. This system models the fact that yesterday’s weather is a reasonable prediction for today’s weather.

Here is a typical realization:

Hidden Markov model (HMM)

Elaboration of the simple weather model, now as an HMM. The underlying states, which are considered unobservable, are H and L representing periods of high and low atmospheric pressure. The observable is the weather (sunny or rainy). Here is a typical realization of this HMM:

H H H H H H H H H L L L H H H H H L L L L H H

Hidden Markov models

A hidden Markov model (HMM) is a Markov chain X(t) , and a stochastic variable Y(t) representing “emissions” or “observations”.

X(t) has the usual Markov property. Y(t) independent of all other Y(s) given is only dependent on X(t) X(t) (more precisely, Y(t) is conditionally .) These dependencies are summarized in this picture: (wikipedia “Hidden Markov model”) The state X(t) is considered unobservable; the only observables are the Y(t) .

(Note: The Y(t) may also emit the “ the emitted symbol sequence. Only this symbol sequence is considered to be observable; this then also makes the time t empty ” symbol, denoted  . If these are used, they are simply left out of unobservable, since it is unknown how many empty symbols have been output).

Usually the Markov chain starts at t=0 , and in a particular state, which is called the ‘start’ state . Often one of the states in the Markov chain is labeled ‘end’ state , and the process stops when it reaches this state. In this way HMMs can be made to produce finite symbol sequences.

Hidden Markov models

The unobserved Markov chain is defined as usual: a number of states {1,2,...,N}, and a transition matrix A .

The Y(t) are a stochastic variable whose discrete state space is called the emission alphabet. A probability distribution P( Y=y | X=x ) determines the emissions conditional on the state of X(t) .

HMMs are generative models : they specify how a sequence of observations are generated randomly. This can be useful for simulations. Their power however derives from the existence of a number of efficient algorithms that compute the answer to important questions relating to HMMs.

Questions you may want to ask include: (i) What is the most likely state path X(t) generating a given observed sequence Y ?

(ii) What is the likelihood of a given model generating a sequence Y ?

(iii) What is the posterior probability that a symbol Y i from the sequence Y was generated by some state X ?

(iv) For a given sequence Y , what matrix A and emission probabilities P( Y=y | X=x ) maximize the likelihood of Y ?

These questions are solved by (i) the Viterbi algorithm, (ii) the Forward or Backward algorithms, (iii) the Forward and Backward algorithms combined, and (iv) the Baum-Welch algorithm.

Profile HMMs

Profile HMMs are used to model DNA sequence motifs in longer DNA sequences. Examples of sequence motifs are DNA binding motifs for transcription factors, or protein domains (in which case amino acids can also be used as alphabet).

Profile HMMs

Adding the “x” states (which emit nucleotides according to the background distribution) allow the motif to be anywhere inside a longer DNA sequence. The black dots are silent states (states that do not emit symbols).

Profile HMMs

Adding more silent states allow letters in the motif to be skipped (with a certain probability), so model e.g. protein domains with slightly variable lengths.

Profile HMMs

Similarly, “insert” states allow the insertion of, in principle, any length of sequence somewhere within the motif.

This is the profile HMM architecture used in the HMMER and SAM packages. An extensive list of HMMs for protein domains can be found in the Pfam database.

Profile HMMs

Sean Eddy David Haussler HMMER SAM Washington U. UCSC A loopback transition allows multiple motifs to occur in a single sequence (but note that in this model, the sequence always includes at least one motif. Using this model to find motifs will always get you one).

Andrew Viterbi (1935-), Italian-American engineer

Viterbi algorithm

This algorithm determines the most likely state path conditional on producing the observed sequence Y. Straightforward computation would involve examining all possible paths, which is virtually impossible. The Viterbi algorithm solves the problem efficiently, using dynamic programming, in a way that is very similar to the Needleman-Wunsch algorithm (except that this one is in 1 dimension – but see later) The intermediate result that can be computed recursively is v(i,x) := the probability of the most probable path that begins in the ‘ start ’ state, emits the first i symbols of Y , and ends up in state x .

Suppose for convenience that all states except the start state emit a symbol. Then v(0,start)=1, v(0,x)=0 for other states x . For i>0 , v can be computed recursively: v(i,x) = max s { v(i-1,s) * A sx * P( Y=Y i | X=x ) } (*) The red factor has already been computed. After all v(i,x) the most probable path.

have been computed, a traceback procedure recovers (Note 1: The convention has been taken that a state emits its corresponding symbol as soon as it is entered. One can also choose to emit once a state is left. Further, emissions may be assigned to transitions rather than states; this is called a “ Mealy model ” or machine; the conventional choice is called a “ Moore model ”.) (Note 2: If some states x do not emit a symbol, the factor v(i-1,s) must be replaced by v(i,s) changes any further. Depending on the structure of the Markov chain, computing the in (*). This makes the formula self referential. For the Viterbi algorithm, the computation (*) can simply be carried out a number of times, until none of the terms v(i,x) in a suitable order may decrease the number of iterations required.)

Pair HMMs

Up until now, the HMMs emitted a single sequence of symbols. A straightforward generalization allows the emission of two symbol sequences at the same time. Such HMMs are called pair HMMs .

Each state emits one symbol to each sequence. As before, either or both may be the empty symbol ( — instead of  ) Pair HMMs can be used to model sequence alignments. Simultaneously emitted nucleotides correspond to homologous pairs, and are correlated because of their shared ancestry. States that emit nucleotides to only one sequence model gaps in the other sequence.

Viterbi for pair HMMs

To show what changes going from standard to pair HMMs, here is the Viterbi algorithm for pair HMMs. The observable sequences are denoted Y and Z . The probabilities to be computed are v(i,j,x) , and represent the probability of the most probable path that begins in the ‘ start ’ state, emits the first i symbols from Y and the first j symbols from Z , and ends up in state x .

The recursion is: v(i,j,x) = max s { v(i-p, j-q,s) * A sx * P(Y=Y i , Z=Z j | X=x) } Here, p=1 if x emits into Y , 0 otherwise; similarly q=1 if x emits into Z , 0 otherwise. If x does not output to both sequences, the emission probability should change accordingly (e.g. P(Y=Y i |X=x) for emission into Y only). The maximum path probability can be read off from v(length(Y),length(Z),’end’) ; the actual path can be obtained by backtracing.

For simplicity I assume that none of the states (except for ‘ start ’) is silent, i.e. they all emit into at least one sequence. If not, the recursion again becomes self-referential, and the same remarks as with the 1 dimensional Viterbi algorithm apply.

Viterbi for pair HMMs

The Viterbi algorithm computes the single path {X(0),X(1),...,X(t)} through the Markov chain that has the highest probability of simultaneously emitting the two observed sequences, denoted Y and Z . The path encodes which nucleotides from Y and Z were emitted simultaneously, i.e. it encodes the alignment.

Symbolically, the Viterbi algorithm first computes max X P( X, Y, Z ) and then the traceback part computes arg max X P( X , Y, Z ) = arg max X P( X | Y, Z ) The last equation holds because P(X|Y,Z) = P(X,Y,Z) / P(Y,Z), and because the denominator does not depend on X, the full and conditional probabilities reach their maxima for the same path X.

Alignments: Best scoring / most likely path (Needleman-Wunsch, Smith-Waterman, Viterbi)

A C C G T T C A C A A T G G A T

Likelihood: the Forward algorithm

The Viterbi algorithm is very similar to the Needleman-Wunsch algorithm (the states playing the role of index to the three tables mentioned briefly); it has the same time complexity, and indeed for the three-state HMM of the example, you can give a score matrix and gap penalties that make them compute identical alignments.

The HMM approach has the important advantage above score-based approaches such as N-W that it is fairly straightforward to determine the model parameters if you know how the sequences evolve, i.e. if you know the evolutionary model (drawback

1

of score-based aligners).

Drawback

2

concerned determining the model parameters (or the evolutionary model) from sequences alone. The HMM approach also has a method to do this, by maximum likelihood parameter estimators. For this, it is necessary to be able to calculate the likelihood. This involves considering (summing over) all possible paths , instead of just the most likely one. The Forward algorithm does this.

The Forward algorithm (for pair HMMs)

The algorithm is very similar to the Viterbi algorithm (without traceback). The probabilities to be computed are slightly different: w(i,j,x) emit the first i now represent the sum of probabilities of paths through the Markov chain that symbols from Y and the first j symbols from Z , and end up in state x . The final result which appears in the table entry w(length(Y),length(Z),’end’) is the likelihood of the sequences Y and Z.

Symbolically, the Forward algorithm computes P(Y, Z) =  X P( X, Y, Z ) The recursion is: w(i,j,x) =  s { w(i-p, j-q,s) * A sx * P(Y=Y i , Z=Z j | X=x) } Here, p and q have the same definitions as before, and the same caveats-for-the-sake-of-simplicity apply.

Note: Dealing with silent states is more involved than with Viterbi. As before, silent states (where neither Y nor Z emits a symbol) create self-references in the recursion. In this case they are course can be dealt with by a simple division.

linear equations , which in general must be solved by a matrix inversion. In practice, most situations involve 1x1 matrices, which of Note: the numbers w(i,j,x) are not conditional probabilities; they are not even true probabilities, as in some cases they can be >1.

The Backward algorithm (for pair HMMs)

The Backward algorithm computes the same thing as the Forward algorithm: the likelihood that the HMM emits the sequences Y and Z ; symbolically P(Y, Z) =  X P( X, Y, Z ) . So why bother? The answer is that the table of intermediate results are different, and the Forward and Backward tables can be used together to compute other interesting quantities, such as posterior probabilities.

The algorithm is very similar to the Forward algorithm, but starts the recursion and the bottom-right end of the table, working its way (backwards) to the top-left.

The quantities u now represent the conditional probability and j+1,j+2,... of Z , using paths that start in x of the model emitting the symbols i+1,i+2,... of Y and end up in the ‘end’ state. The emission associated with x is not included in this likelihood. Initialization is u(length(Y),length(Z),’end’)=1 , and the recursion is: u(i,j,x) =  s { u(i+p s , j+q s ,s) * A xs * P(Y=Y i+1 , Z=Z j+1 | X=s) } Here, p and q have the same definitions as before, but now refer to the state s rather than x .

Note: Dealing with silent states is more involved than with Viterbi. As before, silent states (where neither Y nor Z emits a symbol) create self-references in the recursion. In this case they are be dealt with by a simple division.

linear equations , which in general must be solved by a matrix inversion. In practice, most situations involve 1x1 matrices, which can

The Backward algorithm & Sampling from the posterior

Recall that the quantities u computed by the Backward algorithm represent the conditional probability of the model emitting the symbols i+1,i+2,... of Y and j+1,j+2,... of Z , using paths that start in x and end up in the ‘end’ state.

In other words, u(i,j,x) gives the probability that the model will produce the remainder of Y and Z , conditional on starting in state x . These probabilities can be used to sample from the posterior distribution of paths that generate Y and Z : x  ‘start’, i  0, j  0 While x is not ‘end’: Compute r s = u(i+p s , j+q s ,s) * A xs * P(Y=Y i+1 , Z=Z j+1 | X=s) for all states s Normalize the r so that they sum to 1, and draw a state s from the distribution x  s, i  i+p s , j  j+q s

Alignments: Posterior probabilities (Durbin, Eddy, Krogh, Mitchison 1998)

A C C G T T C A C A A T G G A T

Posterior probabilities

Using the table of intermediate results w(i,j,x) and u(i,j,x) computed by the Forward and Backward algorithms, it is possible to compute certain posterior probabilities.

For example, let’s denote by [i j x] the event that the HMM emitted the symbols Y i and Z j from state x (assuming that x emits into both sequences; an alternative interpretation can be formulated when x not.) The probability of this event, and the HMM producing the sequences Y and Z , is does P(Y, Z, [i j x]) = w(i,j,x) * u(i,j,x) Since the likelihood of emitting Y and Z is P(Y, Z) = w(length(Y),length(Z),’end’) = u(0,0,’start’), the probability of the event [Y i Z j x] conditional on observing Y and Z is P([i j x] | Y, Z ) = w(i,j,x) * u(i,j,x) / u(0,0,’start’) If the state x represents the “aligned” state in an alignment HMM, this would be the posterior probability of Y i and Z j to be aligned.

Posterior probabilities

1 0.5

0 A T T A T T T T T A G G T A G G G T G T G G A G C G T T T T T T T C C T G C A T T G T G C T C G A G A T G G A G T G G C G G T G T C G A C T G T T A T C C C G G C A A T G T C T T T G T C G A G G A G G G G G G T C C A G C C A G A C A G C C G A C G T G G T C G C C G C A A G C G C G G 1 0.5

0 C T C A A C T A C G G G A A A C C C C G A G G T T A G A C T A G G G G G C C A A T T T A G T G G C C A G G T T G G T C G G G A A T T C T C G C A T A A T A A G A T T T A A A G A C G G C C A C G C G G A A G G T T C T A G T A A G G T C C T C T C G T G T C A C T G T T G G A T C G G G T G T T C G C A G A G T A T G A

Posterior probabilities (*)

As another example, let’s consider the event that the HMM transitioned from state x to s just after emitting the symbols Y i and Z j (from state x , assuming it emits into both sequences; again, this is not essential.) Denote this event by [i j x s] . The probability of this event, and the HMM producing the sequences Y and Z , is P(Y, Z, [i j x s]) = w(i,j,x) * A xs * P(Y=Y i+1 , Z=Z j+1 | X=s) * u(i+p,j+q,s) where again p and q refer to the emission of state s. Dividing by the total likelihood of observing the sequences Y and Z results in the conditional probability of [i j x s]: P( [i j x s] | Y, Z ) = w(i,j,x) * A xs * P(Y=Y i+1 , Z=Z j+1 | X=s) * u(i+p,j+q,s) / u(0,0,’start’) Conditional on the sequences Y and number of times that the transition Z , and the position within these sequences, this is the expected x->s was taken by the HMM. Now we can marginalize over the position variables i and j: E( number of times that the transition x  s was taken | emitting Y, Z ) =  i,j w(i,j,x) * A xs * P(Y=Y i+1 , Z=Z j+1 | X=s) * u(i+p,j+q,s) / u(0,0,’start’) The conditional expected number of times that particular symbols were emitted from particular states can be computed in a very similar way.

Posteriors: Good predictors of accuracy

Posterior Decoding

The good correspondence of posterior probabilities and the actual frequencies that particular states were visited, suggests that posteriors may be interpreted as the “ but a set of closely related procedures.

reliability ” of any particular prediction. The idea of Posterior Decoding is to use these posteriors to construct alignments. PD is not a single algorithm, Generally, P.D. maximizes some overall measure of the posterior probability over a set of paths. - The paths can consist of HMM states , or of alignment columns .

- The overall measure can be a (weighted) sum of posteriors , or the product of posteriors .

As before, this is accomplished by a dynamic programming algorithm.

For alignments, the product-of-posteriors measure turned out to work best. PD is a heuristic procedure; in contrast to e.g. the Viterbi algorithm, there is no direct (probabilistic) interpretation of the alignment. For instance, when HMM states are used to build paths, the resulting PD path may not be a valid path through the HMM.

a

Posterior decoding

b c d e Paths: Alignment columns Posteriors: Sum of posteriors for all states contributing to a column.

E.g. The posterior for Y i P([i j e ]) and Z j to be aligned is The posterior for Y i  j to be involved in a gap is P([i j a ] or [i j b ] or [i j c ] or [i j d ] | Y, Z) To emphasize the marginalization of the gap position j within the second sequence, the resulting algorithm is called Marginalized Posterior Decoding .

Alignments: Posterior probabilities (Durbin, Eddy, Krogh, Mitchison 1998)

A C C G T T C A C A A T G G A T

Alignments: Posterior probabilities (Durbin, Eddy, Krogh, Mitchison 1998)

A C C G T T C A C A A T G G A T

Alignments: Posterior probabilities (Durbin, Eddy, Krogh, Mitchison 1998)

A C C G T T C A C A A T G G A T

Posterior Decoding is better than Viterbi

Posterior decoding is better than score-based alignment

Summary •

HMMs are a powerful framework to model sequential data with randomness

Modelling alignments using HMMs provide a connection between alignment models and models of sequence evolution

Provides practical and conceptual framework to deal with uncertainties • Posterior decoding is more accurate and less biased than either score-based approaches or Viterbi.

Baum-Welch algorithm (not described) allows efficient parameter estimation from unaligned sequences