Nessun titolo diapositiva

Download Report

Transcript Nessun titolo diapositiva

Fisica Computazionale applicata alle
Macromolecole
Modelli probabilistici per Sequenze
Biologiche
Pier Luigi Martelli
Università di Bologna
[email protected]
051 2094005
338 3991609
PROLOGUE:
Pitfalls of standard alignments
Scoring a pairwise alignment
A:
B:
ALAEVLIRLITKLYP
ASAKHLNRLITELYP
Score( A, B)   s( A , B )
i
i
Blosum62
………………………………………….
Alignment of a family (globins)
Different positions are not equivalent
Sequence logos
http://weblogo.berkeley.edu/cache/file5h2DWc.png
The substitution score IN A FAMILY
should depend on the position (the same
for gaps)
For modelling families we need more flexible tools
Probabilistic Models for Biological Sequences
•What are they?
Probabilistic models for sequences
Generative definition:
•Objects producing different outcomes (sequences) with
different probabilities
•The probability distribution over the sequences space
determines the model specificity
Sequence space
M
Generates si with probability P(si | M)
e.g.: M is the representation of the family of globins
Probabilistic models for sequences
We don’t need a generator of new biological sequences
the generative definition is useful as operative definition
Associative definition:
•Objects that, given an outcome (sequence), compute a
probability value
Sequence space
M
Associates probability P(si | M) to si
e.g.: M is the representation of the family of globins
Probabilistic models for sequences
Most useful probabilistic models are Trainable systems
The probability density function over the sequence space is
estimated from known examples by means of a learning
algorithm
Known examples
Sequence space
Pdf estimate (generalization)
Sequence space
e.g.: Writing a generic representation of the sequences of
globins starting from a set of known globins
Probabilistic Models for Biological Sequences
•What are they?
•Why to use them?
Modelling a protein family
Probabilistic model
Given a protein class (e.g. Globins), a probabilistic
model trained on this family can compute a
probability value for each new sequence
Seq1
Seq2
Seq3
Seq4
Seq5
Seq6
0.98
0.21
0.12
0.89
0.47
0.78
This value measures the similarity between the
new sequence and the family described by the
model
Probabilistic Models for Biological Sequences
•What are they?
•Why to use them?
•Which probabilities do they compute?
P( s | M ) or P( M | s ) ?
A model M associates to a sequence s the probability
P( s | M )
This probability answers the question:
Which is the probability for a model describing the
Globins to generate the sequence s ?
The question we want to answer is:
Given a sequence s, is it a Globin?
We need to compute P( M | s ) !!
Bayes Theorem
P(X,Y) = P(X | Y) P(Y) = P(Y | X) P(X)
P(Y | X) =
So:
P(M | s) =
Joint probability
P(X | Y) P(Y)
P(X)
P(s | M) P(M)
A priori
probabilities
P(s)
P(s | M)
P(M | s)
Evidence
s
Conclusion
M
Conclusion
s
Evidence
M
Bayes’ rule: Example
• A rare disease affects 1 out of 100,000 people.
• A test shows positive
– with probability 0.99 when applied to an ill person, and
– with probability 0.01 when applied to a healthy person.
• What is the probability that you have the
disease given that you test positive?
Bayes’ rule: Example
P(+|ill) = 0.99
P(+|healthy) = 0.01
P(ill) = 10-5
P( | ill ) P(ill )
P(ill | ) 
P( | ill ) P(ill )  P( | healthy) P(healthy)
0.99105
4


9
.
89

10
0.99105  0.01 (1  105 )
Happy End:
More likely the test is incorrect!!
Is the pope an alien?
Since the probability
P(Pope|Human) =1/(6,000,000,000)
do this imply that
the Pope is not a human being?
Beck-Bornholdt HP, Dubben HH, Nature 381, 730 (1996)
P(Pope|Human) is not the same as P(Human|Pope)
P( Hum an| Pope) 
P( Pope| Hum an) P( Hum an)
P( Pope| Hum an) P( Hum an)  P( Pope| Alien) P( Alien)
but
P(Alien) ~ 0
So
P(Human|Pope) ~ 1.0
The pope is (probably) not an alien
S Eddy and D McKay’s answer
The A priori probabilities
P(M | s) =
P(s | M) P(M)
P(s)
A priori
probabilities
P(M) is the probability of the model (i.e. of the class
described by the model) BEFORE we know the sequence:
can be estimated as the abundance of the class
P(s) is the probability of the sequence in the sequence
space.
Cannot be reliably estimated!!
Comparison between models
We can overcome the problem comparing the probability of
generating s from different models
P(M1 | s)
P(M2 | s)
=
=
P(s | M1) P(M1)
P(s)
P(s)
P(s | M2) P(M2)
=
P(s | M1) P(M1)
P(s | M2) P(M2)
Ratio between the abundance of the classes
Null model
Otherwise we can score a sequence for a model M comparing
it to a Null Model:
a model that generates ALL the possible sequences with
probabilities depending ONLY on the statistical amino acid
abundance
P(s | M)
S(M, s) = log
P(s | N)
Sequences NOT
belonging to model M
Sequences belonging
to model M
S(M, s)
In this case we need a threshold and a statistic for
evaluating the significance (E-value, P-value)
The simplest probabilistic models:
Markov Models
•Definition
Markov Models Example: Weather
Register the weather conditions day by day:
as a first hypothesis the weather condition
in a day depends ONLY on the weather
conditions in the day before.
C
R
Define the conditional probabilities
F
S
P(C|C), P(C|R),…. P(R|C)…..
The probability for the 5-days registration
CRRCS
P(CRRCS) = P(C)·P(R|C) ·P(R|R) ·P(C|R) ·P(S|C)
C: Clouds
R: Rain
F: Fog
S: Sun
Markov Model
Stochastic generator of sequences in which the probability
of state in position i depends ONLY on the state in position
i-1
Given an alphabet C = {c1; c2; c3; ………cN }
a Markov model is described with N×(N+2) parameters
{art , aBEGIN t , ar END; r, t  C}
arq = P( s i = q| s i-1 = r )
aBEGIN q = P( s 1 = q )
ar END = P( s T = END | s T-1 = r )
t art + ar END = 1  r
t aBEGIN t = 1
c2
c3
c1
END
BEGIN
c4
cN
Markov Models
Given the sequence:
s = s1s2s3s4s6 ……………sT
with si  C = {c1; c2; c3; ………cN }
P( s | M ) = P( s1 )  i=2 P( s i | s i-1 ) =
= aBEGIN s 1  Ti=2 as i-1 s i  asT END
P(“ALKALI”)= aBEGIN A  aA L  aL K  aK A  aA L  aL I  aI END
Markov Models: Exercise
1) Fill the non defined values for the
transition probabilities
0.3
0.4
0.3
0.5
C
0.3
R
0.2
0.2 0.2
?? ??
0.2
0.1
S
S
0.2
F
0.3
0.2
0.4
??
0.2
0.3
0.2
C
??
0.1
F
0.0
0.0
0.1
0.0
R
0.1 0.7
0.4
1.0
0.0
S
S
0.8
Markov Models: Exercise
0.3
0.4
2) Which model better describes the
weather in summer? Which one describes
the weather in winter?
0.3
0.5
C
0.3
R
0.2
0.2 0.2
0.2 0.0
0.2
0.1
S
S
0.2
F
0.3
0.2
0.4
0.1
0.2
0.3
0.2
C
0.1
0.0
0.0 0.1
0.0
F
0.0
R
0.1 0.7
0.4
1.0
0.0
S
S
0.8
Markov Models: Exercise
3) Given the sequence
CSSSCFS
Winter
0.4
0.3
0.3
0.5
C
0.3
0.2
0.2 0.2
which model gives the higher probability?
[Consider the starting probabilities:
P(X|BEGIN)=0.25]
0.2 0.0
0.2
0.1
S
S
0.2
F
0.4
R
0.3
Summer
0.2
0.2
0.1
0.3
0.2
C
0.1
0.0
0.0 0.1
0.0
F
0.0
R
0.1 0.7
0.4
1.0
0.0
S
S
0.8
Markov Models: Exercise
P (CSSSCFS | Winter) =
=0.25x0.1x0.2x0.2x0.3x0.2x0.2=
=1.2 x 10-5
Winter
0.4
0.3
0.5
C
0.3
4) Can we conclude that the
observation sequence refers to a
summer week?
R
0.2
0.2 0.2
P (CSSSCFS | Summer) =
=0.25x0.4x0.8x0.8x0.1x0.1x1.0=
=6.4 x 10-4
0.3
0.2 0.0
0.2
0.1
S
S
0.2
F
0.3
0.4
Summer
0.2
0.2
0.1
0.3
0.2
C
0.1
0.0
0.0 0.1
0.0
F
0.0
R
0.1 0.7
0.4
1.0
0.0
S
S
0.8
Markov Models: Exercise
P (Seq | Winter) =1.2 x 10-5
Winter
0.4
0.3
0.5
C
P (Seq | Summer) =6.4 x 10-4
0.3
P(Winter | Seq)
P(Seq |Summer) P(Summer)
0.1
S
S
0.2
F
0.3
0.4
=
0.2 0.0
0.2
=
R
0.2
0.2 0.2
P(Summer | Seq)
0.3
Summer
0.2
0.2
0.1
0.3
P(Seq| Winter) P(Winter)
0.2
C
0.1
0.0
0.0 0.1
0.0
F
0.0
R
0.1 0.7
0.4
1.0
0.0
S
S
0.8
Simple Markov Model for DNA sequences
G
A
C
T
DNA:
C = {Adenine, Citosine, Guanine, Timine }
16 transition probabilities (12 of which
independent) +
4 Begin probabilities +
4 End probabilities.
The parameters of the model are different in different
zones of DNA
They describe the overall composition and the couple
recurrences
Example of Markov Models: GpC Island
GATGCGTCGC
G
C
G
C
A
T
A
T
CTACGCAGCG
GpC Islands
Non-GpC Islands
In the Markov Model of GpC Islands aGC is higher than in
Markov Model Non-GpC Islands
Given a sequence s we can evaluate
P (GpC | s) =
P ( s | GpC ) ·P(GpC)
P (s | GpC) ·P(GpC) + P (s | nonGpC) ·P(nonGpC)
The simplest probabilistic models:
Markov Models
•Definition
•Training
Training of Markov Models
Let M be the set of parameters of model M.
During the training phase, M parameters are estimated from the set of
known data D
Maximum Likelihood Extimation (ML)
ML = argmax P( D | M,  )
It can be proved that:
nik
Frequency of occurrence as counted in the
aik =
Sj nij data set D
Maximum A Posteriori Extimation (MAP)
MAP = argmax P(  | M, D ) = argmax [P( D | M,  )  P( ) ]
Maximum Likelihood training: Proof
Given a sequence s contained in D:
s = s1s2s3s4s6 ……………sT
T 1
P( s | M )  aBEGIN , s1   asi si1  asT END
i 2
We can count the number of transitions between any to states j and k: njk
N 1 N 1
P( s | M )   a jk
n jk
Where states 0 and N+1 are BEGIN and END
j 0 k 0
Normalisation contstraints are taken into account using the Lagrange
multipliers k
N
 N

P(s | M )  P(s | M )    j    a jk  1
j 0
 k 0

  P( s | M ) n jk

P( s | M )   j  0

n jk
a jk
 a jk
a jk  N

N

P
(
s
|
M
)
n jk '


  a jk '  1  0
k ' 0
 
k
'

0
j

Hidden Markov Models
•Preliminary examples
Loaded dice
We have 99 regular dice (R) and 1 loaded die (L).
R
L
P(1)
1/6
1/10
P(2)
1/6
1/10
P(3)
1/6
1/10
P(4)
1/6
1/10
P(5)
1/6
1/10
P(6)
1/6
1/2
Given a sequence:
4156266656321636543662152611536264162364261664616263
We don’t know the sequence of dice that generated it.
RRRRRLRLRRRRRRRLRRRRRRRRRRRRLRLRRRRRRRRLRRRRLRRRRLRR
Loaded dice
Hypothesis:
We chose a different die for each roll
Two stochastic processes give origin to the sequence of observations.
1) Choosing the die ( R o L ).
2) Rolling the die
The sequence of dice is hidden
The first process is assumed to be Markovian (in this case a 0-order MM)
The outcome of the second process depends only on the state reached in
the first process (that is the chosen die)
Loaded dice
Model
0.01
0.99
R
L
0.01
0.99
Each state (R and L) generates a character of the alphabet
C = {1, 2, 3, 4, 5, 6 }
The emission probabilities depend only on the state.
The transition probabilities describe a Markov model that generates a
state path: the hidden sequence (p)
The observations sequence (s) is generated by two concomitant
stochastic processes
Alcuni esempi semi-seri
1) DEMOGRAFICO
Osservabile: Numero di nascite e/o morti in un anno in un luogo
Variabile Nascosta: Stato economico (in prima istanza, se
consideriamo la “fortuna” economica in un anno un processo casuale,
la ricchezza accumulata è un prodesso markoviano
---> possiamo ricostruire lo stato economico a partire dai registri
anagrafici?
2) DOCENTE METEOPATICO
Osservabile: Media dei voti giornalieri registrati su un registro di
un professore meteopatico
Variabile Nascosta: Stato meteorologico
---> possiamo ricostruire lo stato meteorologico a partire dal
registro del docente?
Alcuni esempi più seri
1) STRUTTURA SECONDARIA
Osservabile: Sequenza proteica
Variabile Nascosta: Struttura secondaria
---> possiamo predire la struttura secondaria a partire dalla
sequenza?
2) ALLINEAMENTI
Osservabile: Sequenza proteica
Variabile Nascosta: Posizione di ogni residuo nell’allineamento di
una famiglia proteica
---> possiamo ricostruire l’allineamento della sequenza alla
famiglia proteica a partire dalla sequenza?
GpC Island
Given a long non-annotated DNA sequence, we want to localise the
GpC Islands (if they exist)
We build a model that unifies the two Markov models for GpC
Islands and Non-GpC Islands. Transitions between any state of the
first and of the second one are added
G+
C+
G-
C-
A+
T+
A-
T-
GpC Islands
Non-GpC Islands
Now there is not one-to-one correspondence between states and symbols
GpC Island: conditioning events
Instead of the model:
G
C
G
C
A
T
A
T
Non-GpC Islands
GpC Islands
Can we use a model similar to that of the dice example?
Non
GpC
GpC
On the alphabet C =
{A,G,C,T }
Using such a model all the characters of the generated sequence would
be independent of the preceding character.
Hidden Markov Models
•Preliminary examples
•Formal definition
Formal definition of Hidden Markov Models
A HMM is a stochastic generator of sequences characterised by:
 N states
 A set of transition probabilities between two states {akj}
akj = P( p (i) = j | p (i-1) = k )
 A set of starting probabilities {a0k}
a0k = P( p (1) = k )
 A set of ending probabilities {ak0}
ak0 = P( p (i) = END | p (i-1) = k )
 An alphabet C with M characters.
 A set of emission probabilities for each state {ek (c)}
ek (c) = P( s i = c | p (i) = k )
s: sequence
Constraints:
p: path through the states
Sk a0 k = 1
ak0 + Sj ak j = 1
k
Sc C ek (c) = 1
k
Generating a sequence with a HMM
Choose the initial state p (1) following the probabilities a0k
i=1
Choose the character s i from the alphabet C following the probabilities ek(c)
Choose the next state following the probabilities ak j and ak0
i  i +1
No
Yes
Is the END state
choosed?
End
GpC Island, simple model
BEGIN
Gpc Island
a0Y= 0.2
a0N = 0.8
aYN = 0.2
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
Y
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
s :AGCGCGTAATCTG
p :YYYYYYYNNNNNN
P( s, p | M ) can be easily computed
Non- Gpc Island
aNN = 0.8
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
P( s, p | M ) can be easily computed
BEGIN
GpC Island
a0Y= 0.2
aYN = 0.2
Y
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
s: A
p: Y
G
Y
C
Y
a0N = 0.8
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
G
Y
Non- GpC Island
C
Y
G
Y
T
Y
A
N
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
A
N
T
N
C
N
T
N
G
N
Emission:
0.1  0.4 0.4  0.4 0.4 0.4 0.10.250.250.250.250.250.25
Transition: 0.2  0.7  0.7  0.7  0.7  0.7  0.7  0.20.8  0.8 0.80.8 0.8  0.1
Multiplying all the probabilities gives the probability of having the
sequence AND the path through the states
Evaluation of the joint probability of the sequence ad the path
P(s, p | M )  P(s | p , M )  P(p | M )
P(p | M )  a0p (1)  i 2 ap (i 1)p (i )  ap (T ) 0
T
P( s | p , M )  i 1 ep (i ) ( s i )
T
P( s, p | M )  ap (T ) 0  i 1 ap ( i 1)p (i )  ep (i ) ( s i )
T
Hidden Markov Models
•Preliminary examples
•Formal definition
•Three questions
GpC Island, simple model
BEGIN
GpC Island
a0Y= 0.2
a0N = 0.8
Non- GpC Island
aYN = 0.2
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
Y
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
s :AGCGCGTAATCTG
p:?????????????
P( s, p | M ) can be easily computed
How to evaluate P ( s | M )?
aNN = 0.8
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
How to evaluate P ( s | M )?
BEGIN
GpC Island
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
a0Y= 0.2
a0N = 0.8
aYN = 0.2
Y
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
Non- GpC Island
aNN = 0.8
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
P ( s | M ) = Sp P( s, p | M )
s: A G C G C G T A A T C T G
p1: Y Y Y Y Y Y Y Y Y Y Y Y Y
p2: Y Y Y Y Y Y Y Y Y Y Y Y N
p3: Y Y Y Y Y Y Y Y Y Y Y N Y
p4: Y Y Y Y Y Y Y Y Y Y Y N N
p5: Y Y Y Y Y Y Y Y Y Y N Y Y
………………………………………………………………………………………………………
213 different paths
Summing over all the path will give the probability of
having the sequence
Resumé: GpC Island, simple model
BEGIN
GpC Island
a0Y= 0.2
a0N = 0.8
Non- GpC Island
aYN = 0.2
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
Y
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
s :AGCGCGTAATCTG
p :?????????????
P( s, p | M ) can be easily computed
How to evaluate P ( s | M )?
Can we show the hidden path?
aNN = 0.8
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
Can we show the hidden path?
BEGIN
GpC Island
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
a0Y= 0.2
a0N = 0.8
aYN = 0.2
Y
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
Non- GpC Island
aNN = 0.8
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
p* = argmax p [ P( p | s, M ) ] = argmax p [ P( p , s | M ) ]
s: A G C G C G T A A T C T G
p1: Y Y Y Y Y Y Y Y Y Y Y Y Y
p2: Y Y Y Y Y Y Y Y Y Y Y Y N
p3: Y Y Y Y Y Y Y Y Y Y Y N Y
p4: Y Y Y Y Y Y Y Y Y Y Y N N
p5: Y Y Y Y Y Y Y Y Y Y N Y Y
………………………………………………………………………………………………………
213 different paths
Viterbi path: path that gives the best joint probability
Can we show the hidden path?
A Posteriori decoding
For each position choose the state p (t) :
p (i) = argmax k [ P( p (i) = k| s, M ) ]
The contribution to this probability derives from all the paths that
go through the state k at position i.
The A posteriori path can be a non-sense path (it may not be a
legitimate path if some transitions are not permitted in the model)
GpC Island, simple model
BEGIN
GpC Island
a0Y= ?
a0N = ?
aYN = ?
aYY = ?
eY (A) = ? eY (G) = ?
eY (C) = ? eY (T) = ?
Y
N
aNY = ?
aN0 = ?
aY0 = ?
END
Non- GpC Island
aNN = ?
eN (A) = ? eN (G) = ?
eN (C) = ? eN (T) = ?
s :AGCGCGTAATCTG
p :YYYYYYYNNNNNN
P( s, p | M ) can be easily computed
How to evaluate P ( s | M )?
Can we show the hidden path?
Can we evaluate the parameters starting from known examples?
Can we evaluate the parameters starting from known examples?
BEGIN
a0Y= ?
GpC Island
aYN = ?
Y
aYY = ?
Emission:
Transition:
G
Y
C
Y
aN0 = ?
aY0 = ?
END
G
Y
Non- GpC Island
aNN = ?
N
aNY = ?
eY (A) = ? eY (G) = ?
eY (C) = ? eY (T) = ?
s: A
p: Y
a0N = ?
C
Y
G
Y
T
Y
A
N
eY (A) = ? eY (G) = ?
eY (C) = ? eY (T) = ?
A
N
T
N
C
N
T
N
G
N
eY (A)eY (G)eY (C)e Y(G)eY (C)eY (G)eY (T)eN (A)eN (A)eN (T)eN (C)eN (T)eN (G)
a0Y
 aYY  aYY 
aYY

aYY

aYY
 aYY  aYN  aNN 
aNN
 aNN  aNN aNN 
aN0
How to find the parameters e and a that maximises this probability?
How if we don’t know the path?
Hidden Markov Models:Algorithms
•Resumé
•Evaluating P(s | M): Forward Algorithm
Computing P( s,p | M ) for each path is a redundant operation
BEGIN
GpC Island
a0Y= 0.2
aYN = 0.2
Y
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
s: A
p: Y
G
Y
C
Y
a0N = 0.8
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
G
Y
Non- GpC Island
C
Y
G
Y
T
Y
A
Y
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
A
Y
T
Y
C
Y
T
Y
G
Y
Emission:
0.1  0.4 0.4  0.4 0.4 0.4 0.1 0.1  0.1  0.1  0.4  0.1  0.4
Transition: 0.2  0.7  0.7  0.7  0.7  0.7  0.7  0.70.7  0.7 0.70.7 0.7  0.1
s: A
p: Y
G
Y
C
Y
G
Y
C
Y
G
Y
T
Y
A
Y
A
Y
T
Y
C
Y
T
Y
G
N
Emission:
0.1  0.4 0.4  0.4 0.4 0.4 0.1 0.1  0.1  0.1  0.4  0.1  0.25
Transition: 0.2  0.7  0.7  0.7  0.7  0.7  0.7  0.70.7  0.7 0.70.7 0.2  0.1
States
Computing P( s,p | M ) for each path is a redundant operation
End
L
R
Begin
0
1
(
| M )  (
P ( s, p 1 | M ) 
T 1
P ( s, p 2
T 1
2
T-1
3
T
Iteration
)
( s ) ) a
t
T
a

e
(
s
)

a

e
(
s
)  ap1 (T ) 0
p1 (t )
p 1 (T 1)p 1 (T )
p 1 (T )
t 1 p 1 ( t 1)p 1 ( t )
a
 ep 2 (t )
t 1 p 2 ( t 1)p 2 ( t )
t
T

e
(
s
)  ap 2 (T ) 0
p 2 (T 1)p 2 (T )
p 2 (T )
If we compute the common part only once we gain 2·(T-1) operations
Summing over all the possible paths
BEGIN
GpC Island
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
s: A
p: Y
G
Y
Emission:
0.1  0.4
Transition: 0.2  0.7
0.0056
s: A
p: Y
G
N
Emission:
0.1  0.25
Transition: 0.2  0.2
0.001
a0Y= 0.2
a0N = 0.8
aYN = 0.2
Y
Non- GpC Island
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
s: A
p: N
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
G
Y
s: A
p: X
Emission:
0.250.4
Transition: 0.8  0.1
G
Y
0.0136
0.008
s: A
p: N
Sum
G
N
Emission:
0.250.25
Transition: 0.8  0.8
0.04
s: A
p: X
0.041
G
N
Summing over all the possible paths
BEGIN
a0Y= 0.2
GpC Island
aYN = 0.2
Y
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
s: A
p: X
G C
Y Y
0.0136  0.4
0.7
a0N = 0.8
Non- GpC Island
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
s: A
p: X
+
G C
N Y
0.041  0.4
s: A
p: X
G C
s: A G C
Y N
p: X N N
0.25
0.0136  0.25
0.041 
+
0.2
0.8
C
Y
0.005448
0.1
Sum
s: A
p: X
G
X
s: A
p: X
G
X
C
N
0.00888
Summing over all the possible paths
BEGIN
GpC Island
a0Y= 0.2
aYN = 0.2
Y
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
a0N = 0.8
Non- GpC Island
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
Iterating until the last position of the sequence:
A
X
G
X
C
X
G
X
C
X
G
X
T
X
A
X
A
X
T
X
C
X
T
X
G
Y
 0.1 (aY0)
A
X
G
X
C
X
G
X
C
X
G
X
T
X
A
X
A
X
T
X
C
X
T
X
+
G
N
 0.1 (aN0)
P(s|M)
States
Summing over all the possible paths
End
L
R
Begin
0
1
2
3
T-1
T
Iteration
If we know the probabilities of emitting the two first characters of the
sequence ending the path in states L and R respectively:
FR(2)  P(s1,s2,p(2) = R | M) and FL(2)  P(s1,s2,p(2) = L | M)
then we can compute:
P(s1,s2,s3,p(3) = R | M) = FR(2) · aRR ·eR(s3) + FL(2) · aLR ·eR(s3)
Forward Algorithm
On the basis of preceding observations the computation of P(s | M) can
be decomposed in simplest problems
For each state k and each position i in the sequence, we compute:
Fk(i) = P( s1s2s3……s i, p (i) = k | M)
Initialisation: FBEGIN (0) = 1
Fi (0) = 0
Will be understood
 i  BEGIN
Recurrence: Fl ( i+1) = P( s1s2…s is i+1, p (i + 1) = l ) =
= Sk P( s1s2 …s i, p (i) = k )  a k l  e l ( s i+1 ) =
= e l ( s i+1 )  Sk Fk ( i )  a k l
Termination: P( s ) = P( s1s2s3……s T, p (T + 1) = END ) =
=Sk P( s1s2 …s T , p (T) = k )  a k0
= Sk Fk ( T )  a k 0
Forward Algorithm
STATE
Fi (1) ∙ aiB
P(s | M)
END
eB (s2)
FB (2)
B
A
S
BEGIN
0
1
2
Iteration
T
T+1
Forward algorithm: computational complexity
Naïf method
P ( s | M ) = Sp P( s, p | M )
There are N T possible paths.
Each path requires about 2T operations.
The time for the computation is O( T N T )
s: A
p: Y
G
Y
C
Y
G
Y
C
Y
G
Y
T
Y
A
Y
A
Y
T
Y
C
Y
T
Y
G
Y
Emission:
0.1  0.4 0.4  0.4 0.4 0.4 0.1 0.1  0.1  0.1  0.4  0.1  0.4
Transition: 0.2  0.7  0.7  0.7  0.7  0.7  0.7  0.70.7  0.7 0.70.7 0.7  0.1
s: A
p: Y
G
Y
C
Y
G
Y
C
Y
G
Y
T
Y
A
Y
A
Y
T
Y
C
Y
T
Y
G
N
Emission:
0.1  0.4 0.4  0.4 0.4 0.4 0.1 0.1  0.1  0.1  0.4  0.1  0.25
Transition: 0.2  0.7  0.7  0.7  0.7  0.7  0.7  0.70.7  0.7 0.70.7 0.2  0.1
Forward algorithm: computational complexity
Forward algorithm
T positions, N values for each position
Each element requires about 2N product and 1 sum
The time for the computation is O(T N2)
s: A
p: X
G C
Y Y
0.0136  0.4
0.7
s: A
p: X
+
G C
N Y
0.041  0.4
s: A
p: X
G C
s: A G C
Y N
p: X N N
0.25
0.0136  0.25
0.041 
+
0.2
0.8
C
Y
0.005448
0.1
Sum
s: A
p: X
G
X
s: A
p: X
G
X
C
N
0.00888
No. of operations
Forward algorithm: computational complexity
1000
900
800
700
600
500
400
300
200
100
0
Naïf method
Forward algorithm
1
2
3
4
T
5
6
7
Hidden Markov Models:Algorithms
•Resumé
•Evaluating P(s | M): Forward Algorithm
•Evaluating P(s | M): Backward Algorithm
Backward Algorithm
Similar to the Forward algorithm: it computes P( s | M ), reconstructing
the sequence from the end
For each state k and each position i in the sequence, we compute:
Bk(i) = P( s i+1s i+2s i+3……s T | p (i) = k )
Initialisation: Bk (T) = P(p (T+1) = END | p (T) = k ) = ak0
Recurrence: Bl ( i-1) = P(s is i+1…s T | p (i - 1) = l ) =
= Sk P(s i+1s i+2…s T | p (i) = k)  a l k  e k (s i )=
= Sk Bk ( i )  e k ( s i )  a l k
Termination: P( s ) = P( s1s2s3……s T | p (0) = BEGIN ) =
= Sk P( s2 …s T | p (1) = k )  a 0 k  e k ( s 1 ) =
= Sk Bk ( 1 )  a 0k  e k ( s 1 )
Backward Algorithm
STATE
Bk(T)· aB T· e k (s T-1 )
END
BB (T-1)
B
S
A
BEGIN
0
P(s | M)
1
2
T-1
Iteration
T
T+1
Hidden Markov Models:Algorithms
•Resumé
•Evaluating P(s | M): Forward Algorithm
•Evaluating P(s | M): Backward Algorithm
•Showing the path: Viterbi decoding
Finding the best path
BEGIN
GcP Island
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
s: A
p: Y
G
Y
Emission:
0.1  0.4
Transition: 0.2  0.7
0.0056
s: A
p: Y
G
N
Emission:
0.1  0.25
Transition: 0.2  0.2
0.001
a0Y= 0.2
a0N = 0.8
aYN = 0.2
Y
Non- GcP Island
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
s: A
p: N
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
G
Y
s: A
p: N
Emission:
0.250.4
Transition: 0.8  0.1
G
Y
0.008
0.008
s: A
p: N
Max s : A
p: N
G
N
Emission:
0.250.25
Transition: 0.8  0.8
0.04
0.04
G
N
Finding the best path
BEGIN
a0Y= 0.2
GcP Island
aYN = 0.2
Y
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
s: A
p: N
G C
Y Y
0.008  0.4
0.7
=0.00224
s: A
p: N
G C
Y N
0.008  0.25
0.2
=0.0004
a0N = 0.8
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
Non- GcP Island
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
s: A
p: N
s: A
p: N
G C
N Y
0.04  0.4
0.1
=0.0016
s: A
p: N
G C
N N
0.04  0.25
;
0.8
=0.008
G
Y
C
Y
0.00224
Max s : A
p: N
G
N
C
N
0.008
Finding the best path
BEGIN
GcP Island
a0Y= 0.2
aYN = 0.2
Y
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
a0N = 0.8
Non- GcP Island
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
eY (A) = 0.25 eY (G) = 0.25
eY (C) = 0.25 eY (T) = 0.25
Iterating until the last position of the sequence:
A
N
G
Y
C
Y
G
Y
C
Y
G
Y
T
Y
A
N
A
N
T
N
C
Y
T
Y
G
Y
 0.1 (aY0)
A
N
G
N
C
N
G
N
C
N
G
N
T
N
A
N
A
N
T
N
C
N
T
N
G
N
 0.1 (aN0)
Choose the Maximum
Viterbi Algorithm
p* = argmax p [ P( p , s | M ) ]
The computation of P(s,p*| M) can be decomposed in simplest problems
Let Vk(i) be the probability of the most probable path for generating the
subsequence s1s2s3……s i ending in the state k at iteration i
Initialisation: VBEGIN (0) = 1
Vi (0) = 0
 i  BEGIN
Recurrence: Vl ( i+1) = e l ( s i+1 )  Max k ( Vk ( i )  a k l )
ptr i ( l ) = argmax k ( Vk ( i )  a k l )
Termination: P( s, p* ) =Maxk (Vk ( T )  a k 0 )
Traceback:
p* ( T ) = argmax k (Vk ( T )  a k 0 )
p* ( i-1 ) = ptr i (p* ( i ))
Viterbi Algorithm
STATE
Vi (1) ∙ aiB
P(s,p *| M)
END
eB (s2)
VB (2)
B
A
MAX
BEGIN
0
1
ptr2 (B)
2
Iteration
T
T+1
Viterbi Algorithm
STATE
END
B
A
BEGIN
0
1
2
T–1
T
Iteration
Viterbi path
Different paths can have the same probability
T+1
Hidden Markov Models:Algorithms
•Resumé
•Evaluating P(s | M): Forward Algorithm
•Evaluating P(s | M): Backward Algorithm
•Showing the path: Viterbi decoding
•Showing the path: A posteriori decoding
•Training a model: EM algorithm
If we know the path generating the training sequence
BEGIN
a0Y= ?
GcP Island
aYN = ?
Y
aYY = ?
Emission:
Transition:
G
Y
C
Y
aN0 = ?
aY0 = ?
END
G
Y
Non- GcP Island
aNN = ?
N
aNY = ?
eY (A) = ? eY (G) = ?
eY (C) = ? eY (T) = ?
s: A
p: Y
a0N = ?
C
Y
G
Y
T
Y
A
N
eN (A) = ? eN (G) = ?
eN (C) = ? eN (T) = ?
A
N
T
N
C
N
T
N
G
N
eY (A)eY (G)eY (A)e Y(G)eY (C)eY (G)eY (T)eN (A)eN (A)eN (T)eN (C)eN (T)eN (G)
a0Y
Just count!
Example:
 aYY  aYY 
aYY

aYY

aYY
 aYY  aYN  aNN 
aNN
 aNN  aNN aNN 
aYY= nYY /(nYY+ nYN)= 6/7
eY(A) = nY(A) /[nY(A)+nY(C) +nY(G) +nY(T)]= 1/7
aN0
Expectation-Maximisation algorithm
We need to estimate the Maximum Likelihood parameters when the
paths generating the training sequences are unknown
ML = argmax [P ( s | , M)]
Given a model with parameters 0 the EM algorithm finds new
parameters  that increase the likelihood of the model:
P( s |  ) > P( s| 0 )
Expectation-Maximisation algorithm
s: A G C G C G T A A T C T G
p1: Y Y Y Y Y Y Y Y Y Y Y Y Y
p2: Y Y Y Y Y Y Y Y Y Y Y Y N
p3: Y Y Y Y Y Y Y Y Y Y Y N Y
p4: Y Y Y Y Y Y Y Y Y Y Y N N
……………………………………………………………...
Given a path p we can count
the number of transition between states k and l: Ak,l(p)
the number on emissions of character c from state k: Ek (c,p)
We can compute the expected values over all the paths
Ak,l = Sp P(p | s,0) · Ak,l(p)
Ek (c) = Sp P(p | s,0) · Ek (c,p)
The updated parameters are:
ak,l =
Ak,l
SNm = 1 Ak,m
ek(c) =
Ek (c)
Sc Ek (c)
Expectation-Maximisation algorithm
We need to estimate the Maximum Likelihood parameters when the
paths generating the training sequences are unknown
ML = argmax [P ( s | , M)]
Given a model with parameters 0 the EM algorithm finds new
parameters  that increase the likelihood of the model:
P( s |  ) > P( s| 0 )
or equivalentely
log P( s |  ) > log P( s| 0 )
Expectation-Maximisation algorithm
log P( s |  ) = log P(s,p | ) - log P(p | s, )
Multiplying for P(p | s,0) and summing over all the possible paths:
log P( s |  ) =
=Sp P(p | s,0) ·log P(s,p |  ) - Sp P(p | s,0) · log P(p | s,)
Q( |0) : Expectation value of log P(s,p | ) over all the “current” paths
log P( s |  ) - log P(s | 0) =
P(π | s, )
0
P(
π
|
s,

)

log
= Q( |0) - Q(0|0) +π
P(π | s, 0 )
 Q( |0) - Q(0|0)
0
Expectation-Maximisation algorithm
The EM algorithm is an iterative process
Each iteration performs two steps:
E-step: evaluation of Q( |0) = Sp P(p | s,0) ·log P(s,p |  )
M-step: Maximisation of Q( |0) over all 
It does NOT assure to converge to the GLOBAL Maximum Likelihood
Baum-Welch implementation of the EM algorithm
E-step:
Q(  |0) = Sp P(p | s,0) ·log P(s,p | )
P(s,p | ) = a0,p(1) · PiT= 1 ap(i),p(i+1) ·ep(i)(si) =
=P
N
k=0
P
Ak.l (p)
N
l = 1 ak,l
·P
N
k=1
Pc  C ek (c)
Ek (c,p)
Ak,l(p): number of transitions between the states k and l in path p
Ek (c,p): number of emissions of character c in path p
Ak,l = Sp P(p | s,0) · Ak,l(p)
Ek (c) = Sp P(p | s,0) · Ek (c,p)
So:
Expected values over all the
“actual” paths
Q( |0) = SNk = 0 SNl = 1 Ak,l · log ak,l + SNk = 1 Sc  C Ek (c) ·log ek (c)
Baum-Welch implementation of the EM algorithm
M-step:

0
ak ,l

0
ek (c)
For any state k and l, with Sl ak,l = 1
For any state k and character c, with Sc ek(c) = 1
By means of Lagrange’s multipliers techniques, we can solve the system
ak,l =
Ak,l
SNm = 1 Ak,m
ek(c) =
Ek (c)
SNm = 1 Em (c)
Baum-Welch implementation of the EM algorithm
How to compute the expected number of transitions and emissions
over all the paths
Fk(i) = P( s1s2s3……s i, p (i) = k )
Bk(i) = P( s i+1s i+2s i+3……s T | p (i) = k )
i +1)  B (i + 1)
S
F
(i
)

a

e
(s
i k
kl
l
l
= P(p (i ) = k , p ( i +1) = l | s,) =
Ak,l
P (s )
Ek (c) = P( s = c , p (i ) = k | s,) =
i
S s = c Fk(i )  Bl(i)
i
P (s )
Baum-Welch implementation of the EM algorithm
Algorithm
Start with random parameters
Compute Forward and Backward matrices on the known sequences
Compute Ak,l and Ek (c) expected numbers of transitions and emissions
Update a k,l Ak,l ek (c)  Ek (c)
Yes
Has P(s|M) incremented ?
No
End
Profile HMMs
•HMMs for alignments
How to align?
M0
M1
M2
M3
M5
M4
Each state represent a position in the alignment.
A
M0
C
M1
G
M2
G
M3
T
M4
A
M5
A
M0
C
M1
G
M2
A
M3
T
M4
C
M5
G
M2
A
M3
T
M4
C
M5
A
M0
?
Silent states
Each state represent a position in the alignment.
Red transitions allow gaps
(N-1) ! transitions
We can reduce the number of parameters we can use states that doesn’t
emit any character
4N-5 transitions
Profile HMMs
Delete states
D1
D2
D3
D4
I0
I1
I2
I3
I4
M0
M1
M2
M3
M4
A
M0
C
M1
G
M2
G
M3
T
M4
A
M5
A C G C
M 0 I 0 I 0 M1
A
M2
G
M3
T
M4
C
M5
A
M0
G
M2
D1
A
M3
T
M4
Insert states
M5
C
M5
Match states
Example of alignment
D1
D2
D3
D4
I0
I1
I2
I3
I4
M0
M1
M2
M3
M4
D1
D2
D3
D4
I0
I1
I2
I3
I4
M0
M1
M2
M3
M4
D1
D2
D3
D4
I0
I1
I2
I3
I4
M0
M1
M2
M3
M4
A
S
M0 M1
A S
M5
Sequence 1
T R A L
Viterbi path
M2 M3 M4 M5
T R A L
M0 M1 M2
A S T
Sequence 2
T A I L
Viterbi path
D3 M4 I4 M5
A I L
M0 M1
A R
Sequence 3
A R T I
Viterbi path
M2 D3 D4 M5
T
I
A
S
M5
M5
Example of alignment
M0 M1 M2 M3 M4
A S T R A
M5
L
Sequence 1
M0 M1 M2 D3 M4 I4 M5
A S T
A I L
Sequence 2
M0 M1 M2 D3 D4
A R T
Sequence 3
M5
I
Grouping by vertical layers
0 1 2 3 4
5
L
s1 A S T R A
AI L
s2 A S T
I
s3 A R T
-Log P(s | M)
Alignment
ASTRA-L
AST-AIL
ART---I
Is an alignment score
Searching for a structural/functional pattern in protein sequence
Zn binding loop:
C
C
C
D
C
C
H
H
H
H
H
H
C
C
C
C
C
C
I
L
I
L
I
L
C
C
C
C
D
C
R
K
S
T
S
K
I
I
L
I
I
I
Cysteines can be replaced by an Aspartic Acid,
but only ONCE for each sequence
C
C
C
C
C
C
Searching for a structural/functional pattern in protein sequences
D1
D2
D3
D4
D5
D6
I0
I1
I2
I3
I4
I5
I6
M0
M1
M2
M3
M4
M5
M6
M7
..ALCPCHCLCRICPLIY..
obtains higher probability than
..WERWDHCIDSICLKDE..
.. because M0 and M4 have low emission probability for Aspartic
Acid and we multiply them twice.
Profile HMMs
•HMMs for alignments
•Example on globins
Structural alignment of globins
Bashdorf D, Chothia C & Lesk AM, (1987) Determinants of a protein fold:
unique features of the globin amino sequence. J.Mol.Biol. 196, 199-216
Alignment of globins reconstructed with profile HMMs
Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational
biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531
Discrimination power of profile HMMs
Log(P(s | M)) - <Log( P(s | M))>
Z-score =
s (Log(P(s | M)) )
Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology:
applications to protein modelling. J.Mol.Biol. 235, 1501-1531
Profile HMMs
•HMMs for alignments
•Example on globins
•Other applications
Finding a domain
Profile HMM specific for
the considered domain
Begin
End
I2
I2
Clustering subfamilies
HMM 1
HMM 2
B
E
G
I
N
HMM 3
.
E
N
D
.
HMM n
Each sequence s contributes to update HMM i with a weight equal
to P ( s | Mi )
Profile HMMs
•HMMs for alignments
•Example on globins
•Other applications
•Available codes and servers
HMMER at WUSTL: http://hmmer.wustl.edu/
Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755-763
HMMER applications:
http://www.sanger.ac.uk/Software/Pfam/
SAM at UCSD:http://www.soe.ucsc.edu/research/compbio/sam.html
Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in
computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531
SAM applications:
http://www.cse.ucsc.edu/research/compbio/HMM-apps/T02-query.html
HMMPRO: http://www.netid.com/html/hmmpro.html
Pierre Baldi, Net-ID
HMMs for Mapping problems
•Mapping problems in protein prediction
Secondary structure
Covalent structure
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Secondary structure
EEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........
3D structure
Ct
Topology of membrane proteins
Topography
position of Trans Membrane Segments along the sequence
ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK
Outer Membrane
-barrel
Porin
(Rhodobacter capsulatus)
Inner Membrane
-helices
Bacteriorhodopsin
(Halobacterium salinarum)
HMMs for Mapping problems
•Mapping problems in protein prediction
•Labelled HMMs
HMM for secondary structure prediction
Simplest model


c
Introducing a grammar
1
2
3
c
1
2
HMM for secondary structure prediction
Labels
The states 1, 2 and 3 share the same label, so states 1 and 2 do.
Decoding the
Viterbi path for emitting a sequence s, makes a mapping between the
sequence s and a sequence of labels y
S A L K M N Y T R E I M V A S N Q
c 1 2 3 4 4 4 c c c c 1 2 2 2 c c
c       c c c c     c c
1
2
3
c
1
2
s: Sequence
p: Path
Y(p): Labels
Computing P(s, y | M)
P ( s, y | M )  p |Y (p )  y P ( s, p | M )
Only the path whose labelling is y have to be considered in the sum
In Forward and Backward algorithms it means to set
Fk(i) = 0, Bk(i) = 0
if
Y(k)  yi
S A L K M N Y T R E I M V A S N Q s: Sequence
c       c c c c     c c y: Labels
1
2
3
4
1
2
c






c
States Labelling
Baum-Welch training algorithm for labelled HMMs
Given a set of known labelled sequences (e.g. amino acid sequences
and their native secondary structure) we want to find the parameters of
the model, without knowing the generating paths:
ML = argmax [P ( s, y | , M)]
The algorithm is the same as in the non-labelled case if we use the
Forward and Backward matrices defined in the last slide.
Supervised learning of the mapping
HMMs for Mapping problems
•Mapping problems in protein prediction
•Labelled HMMs
•Duration modelling
Self loops and geometric decay
p
Begin
1-p
P(l)
1
0,9
0,8
P(l) = p l-1 ·( 1-p )
End
p=0.9
0,7
0,6
0,5
0,4
0,3
0,2
p=0.5
0,1
p=0.1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
l
The length distribution of the generated segments is always exp-like
How can we model other length distributions?
Limited case
p1
Begin
P(l)
0,350
0,300
0,250
0,200
0,150
0,100
0,050
0,000
p2
1
2
p3
3
p4
4
….
N
pN
End
P(1)
P(6)
P(2) P(4)
P(3)
P(5)
1
2
3
4
5
P(8)
This topology can model any length
distribution between 1 and N
P(7)
6

 P (1)  p
1

 P (2)  (1  p1 )  p2

 P (3)  (1  p1 )  (1  p2 )  p3
......................

N 1

 P ( N )   (1  pi )  p N
i 1

7
l
8
 p1  P (1)
 p  P ( 2) /(1  p )
1
 2
..............

 p  P(k )
 k k 1
(1  pi )


i 1

How can we model other length distributions?
Non limited case
p1
Begin
1
p2
2
p3
3
p4
4
….
N
pN+1
End
pN
This topology can model any length distribution between 1 and N-1
and a geometrical decay from N and 
Secondary structure: length statistic
0,25
Frequency
0,2
Helix
Strand
Coil
0,15
0,1
0,05
0
0
2
4
6
8 10 12 14 16 18 20 22 24 26 28 30 32 34
Length (residues)
Secondary structure: model
1
2
3
4
c1
c2
c3
c4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
5
Do we use the same emission probabilities for states sharing the same
label?
HMMs for Mapping problems
•Mapping problems in protein prediction
•Labelled HMMs
•Duration modelling
•Models for membrane proteins
Outer Membrane
-barrel
Porin
(Rhodobacter capsulatus)
Inner Membrane
-helices
Bacteriorhodopsin
(Halobacterium salinarum)
Topology of membrane proteins
Topography
position of Trans Membrane Segments along the sequence
ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK
Outer Membrane
-barrel
Porin
(Rhodobacter capsulatus)
Inner Membrane
-helices
Bacteriorhodopsin
(Halobacterium salinarum)
A generic model for membrane proteins (TMHMM)
End
Outer Side
Transmembrane
Begin
Inner Side
Model of -barrel membrane proteins
End
Begin
Outer Side
Transmembrane
Inner Side
Model of -barrel membrane proteins
End
Begin
Outer Side
Labels:
Transmembrane
Transmembrane states
Loop states
Inner Side
Model of -barrel membrane proteins
End
Begin
Outer Side
Transmembrane
Length of transmembrane -strands:
Minimum: 6 residues
Maximum: unbounded
Inner Side
Model of -barrel membrane proteins
End
Begin
Outer Side
Transmembrane
Inner Side
Six different sets of emission parameters:
Outer loop
TM strands edges
Inner loop
Long globular domains
TM strands core
Model of -helix membrane proteins (HMM1)
....
.....
.....
x 13
x 12
....
x 10
x 10
...
x12
.....
Outer Side
x13
.....
Transmembrane
...
Inner Side
Model of -helix membrane proteins (HMM2)
....
....
x 10
x 10
...
Outer Side
...
Transmembrane
Inner Side
Dynamic programming filtering procedure
TMS probability
TMS
probability
TMS
probability
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1
51
101
151
201
251
301
Sequence (1A0S)
351
401
451
Dynamic programming filtering procedure
TMS probability
TMS probability
Predicted TMS
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1
51
101
151
201
251
301
Sequence (1A0S)
351
401
Maximum-scoring subsequences with constrained segment length
and number
451
Dynamic programming filtering procedure
TMS probability
TMS probability
Observed TMS
Predicted TMS
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1
51
101
151
201
251
301
Sequence (1A0S)
351
401
Maximum-scoring subsequences with constrained segment length
and number
451
Predictors of alpha-transmembrane topology
www.cbs.dtu.dk/services/TMHMM
Hybrid systems: Basics
•Sequence profile based HMMs
MSA
Sequence profiles
1
2
3
4
5
6
7
8
9
10
Y
Y
Y
Y
Y
Y
Y
G
T
T
K
R
R
R
R
K
R
Y
K
K
D
D
D
D
D
D
D
G
G
G
Y
Y
Y
Y
Y
Y
Y
F
Y
Y
H
Q
Q
V
Q
N
Q
G
G
G
S
T
S
S
F
T
T
F
F
G
G
D
D
D
D
D
H
D
L
L
L
K
Q
H
H
Q
Q
H
I
I
I
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
N
N
N
G
G
G
G
G
N
A
T
T
T
E
D
E
E
S
E
D
E
E
E
L
L
L
L
L
S
L
T
T
T
T
T
T
K
K
K
Sequence profile
sequence position
A
C
D
E
F
G
H
K
I
L
M
N
P
Q
R
S
T
V
W
Y
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
20
0
0
70
0
0
0
0
0
0
0
40
0
0
0
0
0
0
50
0
0
0
10
0
0
0
70
0
0
30
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
0
0
0
0
0
30
10
0
0
0
0
10
0
40
0
0
0
10
0
0
0
0
0
0
0
0
0
0
33
0
0 100
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
33
0
33
0
0
0
0
0
0
0
0
0
60
0
0
0
10
0
0
30
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
30
0
10 100
30
0
0
0
0
0
0
0
0
0
30
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
70
0
0
0
30
0
0
0
0
0
0
0
0
10
0
0
0
0
50
0
0
0
0
0
10
0
0
0
0
30
0
0
0
0
0
20
70
0
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 100
0
0
0
0
0
0
60
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
0
0
30 100
0
0
0
0
0
0
0
0
0
0
Sequence-profile-based HMM
.. A
90
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
n
C
0
85
0
0
5
0
0
0
0
2
0
8
0
0
0
0
0
0
0
0
L
0
0
0
0
4
0
13
0
4
0
5
0
6
0
0
23
0
1
44
0
P R
0
0
22
0
23
0
0
5
0
23
0
3
0
11
0
0
2
0
11
0
0
34
0
0
0
24
0
0
0
0
0
2
0
22
0
18
0
0
0
0
P
8
0
0
0
0
0
0
0
0
0
0
0
92
0
0
0
0
0
0
0
E
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
77
0
23
Sequence of characters
T ...
3
0
2
7
4
0
8
6
1
3
6
5
5
12
5
6
17
2
2
6
st
t
Sequence of
M-dimensional vectors
vt
For proteins M=20
Constraints
0  vt (n)  S
 t, n
Sk=1 v (k) = S
t
M
t
Sequence-profile-based HMM
Probability of emission from state k
Sequence of characters
Sequence of
M-dimensional vectors
constraints
P(vt|k)
Sn=1 e (n) = 1
dM vt
=1
st
vt
P(st|k) = ek(st)
P(vt|k)
1
=
Z
Sn=1 v (n)e (n)
SA
Z=
A!
M
t
k
Sn=1 e (n)
M
k
M
If
k
Z is independent of the state
Algorithms for training and probability computation can be derived
Hybrid systems: Basics
•Sequence profile based HMMs
•Membrane protein topology
Scoring the prediction
1) Accuracy:
Q2=P/N
where P is the total number of correctly predicted residues and N is the total number of residues
2) Correlation coefficient C:
C(s)=(p(s)*n(s) - u(s)*o(s))/[(p(s)+u(s))(p(s)+o(s))(n(s)+u(s))(n(s)+o(s))]1/2
where, for each class s, p(s) and n(s) are respectively the total number of correct predictions and
correctly rejected assignments while u(s) and o(s) are the numbers of under and over predictions
3) Accuracy for each discriminated structure s:
Q(s)=p(s)/[p(s)+u(s)]
where p(s) and u(s) are the same as in equation 2
4) Probability of correct predictions P(s) :
P(s)=p(s)/[p(s)+o(s)]
where p(s) and o(s) are the same as in equation 2
5) Segment-based measure (Sov):
Sov 

S
 S2  L1
 S1 

 S2  N
 S1 
Topology of -barrel membrane proteins
Performance of sequence-profile-based HMM
Q2
83%
76%
78%
QTMS Qloop
83%
77%
74%
82%
76%
82%
PTMS
79%
72%
81%
Ploop
85%
80%
76%
Corr
Sov
HMM based on
Multiple Sequence
Alignment
0.65
0.83
0.53
Standard HMM based
0.64 on Single Sequence
0.56
0.79
NN based on
Multiple Sequence
Alignment
Martelli PL, Fariselli P, Krogh A, Casadio R -A sequence-profile-based HMM for
predicting and discriminating beta barrel membrane proteins- Bioinformatics 18:
S46-S53 (2002)
Topology of -barrel membrane proteins
Discriminative power of the profile-based HMM
100
90
80
Percentage
70
Beta barrel membrane proteins (145)
Globular proteins(1239)
All helical membrane proteins (188)
60
50
40
30
20
10
0
2.75
2.8
2.85
2.9
I(s | M) = -1/L log P(s | M)
2.95
The Bologna predictor
for the topology of
all  membrane
proteins
NN
Sequence
Sequence Profiles
HMM1
HMM2
S
MaxSubSeq
Topography
Von Heijne rule
Topology
Prediction
Topology of all- membrane proteins
Performance
Q2 %
Corr
QTM %
NN°
85.8
0.714
84.1
QLoop
%
87.3
PTM
%
84.7
PLoop
%
86.8
Qtopography
Qtoplogy
49/59 (83%)
38/59 (64%)
0.908
HMM1°
84.4
0.692
88.6
80.9
79.5
89.4
48/59 (81%)
38/59 (64%)
0.896
HMM2°
82.4
0.658
88.0
78.1
77.1
88.1
48/59 (81%)
39/59 (66%)
0.872
Jury°
85.3
0.708
86.9
84.1
82.1
88.4
53/59 (90%)
42/59 (71%)
0.926
TMHMM 2.0 82.3
0.661
70.9
92.9
89.3
79.2
42/59 (71%)
32/59 (54%)
0.840
MEMSAT
83.6
0.672
70.6
93.9
90.5
75.7
35/49 (71%)
24/49 (49%)
0.823
PHD
80.1
0.614
63.6
94.0
89.9
75.5
43/59 (73%)
30/59 (51%)
0.847
HMMTOP
81.2
0.627
68.5
91.8
87.5
77.7
45/59 (76%)
35/59 (59%)
0.862
Nir Ben-Tal
46/59 (78%)
KD
43/59 (73%)
SOV
° ° Test
Martelli PL, Fariselli P, Casadio R -An ENSEMBLE machine learning approach for
the prediction of all-alpha membrane proteins- Bioinformatics (in press, 2003)
HMM: Application in gene finding
•Basics
Eukaryotic gene structure
A. Krogh
Simple model for coding regions
A. Krogh
Simple model for unspliced gene
A. Krogh
Simple model for spliced gene
A. Krogh