Nessun titolo diapositiva

Download Report

Transcript Nessun titolo diapositiva

PROLOGUE:
Pitfalls of standard alignments
Scoring a pairwise alignment
A:
B:
ALAEVLIRLITKLYP
ASAKHLNRLITELYP
Score( A, B)   s( A , B )
i
i
Blosum62
………………………………………….
Alignment of a family (globins)
Different positions are not equivalent
Sequence logos
http://weblogo.berkeley.edu/cache/file5h2DWc.png
The substitution score IN A FAMILY
should depend on the position (the same
for gaps)
For modelling families we need more flexible tools
Probabilistic Models for Biological Sequences
•What are they?
Probabilistic models for sequences
Generative definition:
•Objects producing different outcomes (sequences) with
different probabilities
•The probability distribution over the sequences space
determines the model specificity
Sequence space
M
Generates si with probability P(si | M)
e.g.: M is the representation of the family of globins
Probabilistic models for sequences
We don’t need a generator of new biological sequences
the generative definition is useful as operative definition
Associative definition:
•Objects that, given an outcome (sequence), compute a
probability value
Sequence space
M
Associates probability P(si | M) to si
e.g.: M is the representation of the family of globins
Probabilistic models for sequences
Most useful probabilistic models are Trainable systems
The probability density function over the sequence space is
estimated from known examples by means of a learning
algorithm
Known examples
Sequence space
Pdf estimate (generalization)
Sequence space
e.g.: Writing a generic representation of the sequences of
globins starting from a set of known globins
Probabilistic Models for Biological Sequences
•What are they?
•Why to use them?
Modelling a protein family
Probabilistic model
Given a protein class (e.g. Globins), a probabilistic
model trained on this family can compute a
probability value for each new sequence
Seq1
Seq2
Seq3
Seq4
Seq5
Seq6
0.98
0.21
0.12
0.89
0.47
0.78
This value measures the similarity between the
new sequence and the family described by the
model
Probabilistic Models for Biological Sequences
•What are they?
•Why to use them?
•Which probabilities do they compute?
P( s | M ) or P( M | s ) ?
A model M associates to a sequence s the probability
P( s | M )
This probability answers the question:
Which is the probability for a model describing the
Globins to generate the sequence s ?
The question we want to answer is:
Given a sequence s, is it a Globin?
We need to compute P( M | s ) !!
Bayes Theorem
P(X,Y) = P(X | Y) P(Y) = P(Y | X) P(X)
P(Y | X) =
So:
P(M | s) =
Joint probability
P(X | Y) P(Y)
P(X)
P(s | M) P(M)
A priori
probabilities
P(s)
P(s | M)
P(M | s)
Evidence
s
Conclusion
M
Conclusion
s
Evidence
M
The A priori probabilities
P(M | s) =
P(s | M) P(M)
P(s)
A priori
probabilities
P(M) is the probability of the model (i.e. of the class
described by the model) BEFORE we know the sequence:
can be estimated as the abundance of the class
P(s) is the probability of the sequence in the sequence
space.
Cannot be reliably estimated!!
Comparison between models
We can overcome the problem comparing the probability of
generating s from different models
P(M1 | s)
P(M2 | s)
=
=
P(s | M1) P(M1)
P(s)
P(s)
P(s | M2) P(M2)
=
P(s | M1) P(M1)
P(s | M2) P(M2)
Ratio between the abundance of the classes
Null model
Otherwise we can score a sequence for a model M comparing
it to a Null Model:
a model that generates ALL the possible sequences with
probabilities depending ONLY on the statistical amino acid
abundance
P(s | M)
S(M, s) = log
P(s | N)
Sequences NOT
belonging to model M
Sequences belonging
to model M
S(M, s)
In this case we need a threshold and a statistic for
evaluating the significance (E-value, P-value)
The simplest probabilistic models:
Markov Models
•Definition
Markov Models Example: Weather
Register the weather conditions day by day:
as a first hypothesis the weather condition
in a day depends ONLY on the weather
conditions in the day before.
C
R
Define the conditional probabilities
F
S
P(C|C), P(C|R),…. P(R|C)…..
The probability for the 5-days registration
CRRCS
P(CRRCS) = P(C)·P(R|C) ·P(R|R) ·P(C|R) ·P(S|C)
C: Clouds
R: Rain
F: Fog
S: Sun
Markov Model
Stochastic generator of sequences in which the probability
of state in position i depends ONLY on the state in position
i-1
Given an alphabet C = {c1; c2; c3; ………cN }
a Markov model is described with N×(N+2) parameters
{art , aBEGIN t , ar END; r, t  C}
arq = P( s i = q| s i-1 = r )
aBEGIN q = P( s 1 = q )
ar END = P( s T = END | s T-1 = r )
t art + ar END = 1  r
t aBEGIN t = 1
c2
c3
c1
END
BEGIN
c4
cN
Markov Models
Given the sequence:
s = s1s2s3s4s6 ……………sT
with si  C = {c1; c2; c3; ………cN }
P( s | M ) = P( s1 )  i=2 P( s i | s i-1 ) =
= aBEGIN s 1  Ti=2 as i-1 s i  asT END
P(“ALKALI”)= aBEGIN A  aA L  aL K  aK A  aA L  aL I  aI END
Markov Models: Exercise
1) Fill the non defined values for the
transition probabilities
0.3
0.4
0.3
0.5
C
0.3
R
0.2
0.2 0.2
?? ??
0.2
0.1
S
S
0.2
F
0.3
0.2
0.4
??
0.2
0.3
0.2
C
??
0.1
F
0.0
0.0
0.1
0.0
R
0.1 0.7
0.4
1.0
0.0
S
S
0.8
Markov Models: Exercise
0.3
0.4
2) Which model better describes the
weather in summer? Which one describes
the weather in winter?
0.3
0.5
C
0.3
R
0.2
0.2 0.2
0.2 0.0
0.2
0.1
S
S
0.2
F
0.3
0.2
0.4
0.1
0.2
0.3
0.2
C
0.1
0.0
0.0 0.1
0.0
F
0.0
R
0.1 0.7
0.4
1.0
0.0
S
S
0.8
Markov Models: Exercise
3) Given the sequence
CSSSCFS
Winter
0.4
0.3
0.3
0.5
C
0.3
0.2
0.2 0.2
which model gives the higher probability?
[Consider the starting probabilities:
P(X|BEGIN)=0.25]
0.2 0.0
0.2
0.1
S
S
0.2
F
0.4
R
0.3
Summer
0.2
0.2
0.1
0.3
0.2
C
0.1
0.0
0.0 0.1
0.0
F
0.0
R
0.1 0.7
0.4
1.0
0.0
S
S
0.8
Markov Models: Exercise
P (CSSSCFS | Winter) =
=0.25x0.1x0.2x0.2x0.3x0.2x0.2=
=1.2 x 10-5
Winter
0.4
0.3
0.5
C
0.3
4) Can we conclude that the
observation sequence refers to a
summer week?
R
0.2
0.2 0.2
P (CSSSCFS | Summer) =
=0.25x0.4x0.8x0.8x0.1x0.1x1.0=
=6.4 x 10-4
0.3
0.2 0.0
0.2
0.1
S
S
0.2
F
0.3
0.4
Summer
0.2
0.2
0.1
0.3
0.2
C
0.1
0.0
0.0 0.1
0.0
F
0.0
R
0.1 0.7
0.4
1.0
0.0
S
S
0.8
Markov Models: Exercise
P (Seq | Winter) =1.2 x 10-5
Winter
0.4
0.3
0.5
C
P (Seq | Summer) =6.4 x 10-4
0.3
P(Winter | Seq)
P(Seq |Summer) P(Summer)
0.1
S
S
0.2
F
0.3
0.4
=
0.2 0.0
0.2
=
R
0.2
0.2 0.2
P(Summer | Seq)
0.3
Summer
0.2
0.2
0.1
0.3
P(Seq| Winter) P(Winter)
0.2
C
0.1
0.0
0.0 0.1
0.0
F
0.0
R
0.1 0.7
0.4
1.0
0.0
S
S
0.8
Simple Markov Model for DNA sequences
G
A
C
T
DNA:
C = {Adenine, Citosine, Guanine, Timine }
16 transition probabilities (12 of which
independent) +
4 Begin probabilities +
4 End probabilities.
The parameters of the model are different in different
zones of DNA
They describe the overall composition and the couple
recurrences
Example of Markov Models: GpC Island
GATGCGTCGC
G
C
G
C
A
T
A
T
CTACGCAGCG
GpC Islands
Non-GpC Islands
In the Markov Model of GpC Islands aGC is higher than in
Markov Model Non-GpC Islands
Given a sequence s we can evaluate
P (GpC | s) =
P ( s | GpC ) ·P(GpC)
P (s | GpC) ·P(GpC) + P (s | nonGpC) ·P(nonGpC)
The simplest probabilistic models:
Markov Models
•Definition
•Training
Probabilistic training of a parametric method
Generally speaking, a parametric model M aims to reproduce a set
of known data
Real data (D)
Model M
Parameters T
How to compare them?
Modelled data
Training of Markov Models
Let M be the set of parameters of model M.
During the training phase, M parameters are estimated from the set of
known data D
Maximum Likelihood Extimation (ML)
ML = argmax P( D | M,  )
It can be proved that:
nik
Frequency of occurrence as counted in the
aik =
Sj nij data set D
Maximum A Posteriori Extimation (MAP)
MAP = argmax P(  | M, D ) = argmax [P( D | M,  )  P( ) ]
Example (coin-tossing)
Given N tossing of a coin (our data D), the outcomes are h heads
and t tails (N=t+h)
ASSUME the model
P(D|M)= ph (1- p)t
Computing the maximum likelihood of P(D|M)
d P(D|M)
=0
dp
d P(D|M)
= ph -1(1- p)t-1(h(1-p)-tp) = 0
dp
We obtain that our estimate of p is
p = h / (h+t) = h / N
Example (Error measure)
Suppose you think that your data are affected by a Gaussian error
So that they are distributed according to
F(xi)=A*exp-[(xi – m)2 /2s 2]
With A=1/sqrt(2 s)
If your measures are independent the data likelihood is
P(Data| model) = Pi F(xi)
Find m and s that maximize the P(Data| model)
Maximum Likelihood training: Proof
Given a sequence s contained in D:
s = s1s2s3s4s6 ……………sT
T 1
P( s | M )  aBEGIN , s1   asi si1  asT END
i 2
We can count the number of transitions between any to states j and k: njk
N 1 N 1
P( s | M )   a jk
n jk
Where states 0 and N+1 are BEGIN and END
j 0 k 0
Normalisation contstraints are taken into account using the Lagrange
multipliers k
N
 N

P(s | M )  P(s | M )    j    a jk  1
j 0
 k 0

  P( s | M ) n jk

P( s | M )   j  0

n jk
a jk
 a jk
a jk  N

N

P
(
s
|
M
)
n jk '


  a jk '  1  0
k ' 0
 
k
'

0
j

Hidden Markov Models
•Preliminary examples
Loaded dice
We have 99 regular dice (R) and 1 loaded die (L).
R
L
P(1)
1/6
1/10
P(2)
1/6
1/10
P(3)
1/6
1/10
P(4)
1/6
1/10
P(5)
1/6
1/10
P(6)
1/6
1/2
Given a sequence:
4156266656321636543662152611536264162364261664616263
We don’t know the sequence of dice that generated it.
RRRRRLRLRRRRRRRLRRRRRRRRRRRRLRLRRRRRRRRLRRRRLRRRRLRR
Loaded dice
Hypothesis:
We chose a different die for each roll
Two stochastic processes give origin to the sequence of observations.
1) Choosing the die ( R o L ).
2) Rolling the die
The sequence of dice is hidden
The first process is assumed to be Markovian (in this case a 0-order MM)
The outcome of the second process depends only on the state reached in
the first process (that is the chosen die)
Casinò
Model
0.99
R
0.01
L
0.01
0.99
Each state (R and L) generates a character of the alphabet
C = {1, 2, 3, 4, 5, 6 }
The emission probabilities depend only on the state.
The transition probabilities describe a Markov model that
generates a state path: the hidden sequence ()
The observations sequence (s) is generated by two
concomitant stochastic processes
The observations sequence (s) is generated by two
concomitant stochastic processes
4156266656321636543662152611
RRRRRLRLRRRRRRRLRRRRRRRRRRRR
0.99
R
0.01
L
0.01
0.99
Choose the State : R
Probability= 0.99
Chose the Symbol: 1
Probability= 1/6 (given R)
4156266656321636543662152611
RRRRRLRLRRRRRRRLRRRRRRRRRRRR
The observations sequence (s) is generated by two
concomitant stochastic processes
415626665632163654366215261
RRRRRLRLRRRRRRRLRRRRRRRRRRR
0.99
R
0.01
L
0.01
0.99
Choose the State : L
Probability= 0.99
Chose the Symbol: 5
Probability= 1/10 (given L)
41562666563216365436621526115
RRRRRLRLRRRRRRRLRRRRRRRRRRRRL
Loaded dice
Model
0.01
0.99
R
L
0.01
0.99
Each state (R and L) generates a character of the alphabet
C = {1, 2, 3, 4, 5, 6 }
The emission probabilities depend only on the state.
The transition probabilities describe a Markov model that generates a
state path: the hidden sequence ()
The observations sequence (s) is generated by two concomitant
stochastic processes
Some not so serious example
1) DEMOGRAPHY
Observable: Number of births and deaths in a year in a village.
Hidden variable: Economic conditions (as a first approximation we can
consider the success in business as a random variable, and by
consequence, the wealth as a Markov variable
---> can we deduce the economic conditions of a village during a
century by means of the register of births and deaths?
2) THE METEREOPATHIC TEACHER
Observable: Average of the marks that a metereopathic teacher
gives to their students during a day.
Hidden variable: Weather conditions
---> can we deduce the weather conditions during a years by means of
the class register?
To be more serious
1) SECONDARY STRUCTURE
Observable: protein sequence
Hidden variable: secondary structure
---> can we deduce (predict) the secondary structure of
a protein given its amino acid sequence?
2) ALIGNMENT
Observable: protein sequence
Hidden variable: position of each residue along the
alignment of a protein family
---> can we align a protein to a family, starting from its
amino acid sequence?
Hidden Markov Models
•Preliminary examples
•Formal definition
Formal definition of Hidden Markov Models
A HMM is a stochastic generator of sequences characterised by:
 N states
 A set of transition probabilities between two states {akj}
akj = P(  (i) = j |  (i-1) = k )
 A set of starting probabilities {a0k}
a0k = P(  (1) = k )
 A set of ending probabilities {ak0}
ak0 = P(  (i) = END |  (i-1) = k )
 An alphabet C with M characters.
 A set of emission probabilities for each state {ek (c)}
ek (c) = P( s i = c |  (i) = k )
s: sequence
Constraints:
: path through the states
Sk a0 k = 1
ak0 + Sj ak j = 1
k
Sc C ek (c) = 1
k
Generating a sequence with a HMM
Choose the initial state  (1) following the probabilities a0k
i=1
Choose the character s i from the alphabet C following the probabilities ek(c)
Choose the next state following the probabilities ak j and ak0
i  i +1
No
Yes
Is the END state
choosed?
End
GpC Island, simple model
BEGIN
Gpc Island
a0Y= 0.2
a0N = 0.8
aYN = 0.2
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
Y
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
s :AGCGCGTAATCTG
 :YYYYYYYNNNNNN
P( s,  | M ) can be easily computed
Non- Gpc Island
aNN = 0.8
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
P( s,  | M ) can be easily computed
BEGIN
GpC Island
a0Y= 0.2
aYN = 0.2
Y
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
s: A
: Y
G
Y
C
Y
a0N = 0.8
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
G
Y
Non- GpC Island
C
Y
G
Y
T
Y
A
N
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
A
N
T
N
C
N
T
N
G
N
Emission:
0.1  0.4 0.4  0.4 0.4 0.4 0.10.250.250.250.250.250.25
Transition: 0.2  0.7  0.7  0.7  0.7  0.7  0.7  0.20.8  0.8 0.80.8 0.8  0.1
Multiplying all the probabilities gives the probability of having the
sequence AND the path through the states
Evaluation of the joint probability of the sequence ad the path
P(s,  | M )  P(s |  , M )  P( | M )
P( | M )  a0 (1)  i 2 a (i 1) (i )  a (T ) 0
T
P( s |  , M )  i 1 e (i ) ( s i )
T
P( s,  | M )  a (T ) 0  i 1 a ( i 1) (i )  e (i ) ( s i )
T
Hidden Markov Models
•Preliminary examples
•Formal definition
•Three questions
GpC Island, simple model
BEGIN
GpC Island
a0Y= 0.2
a0N = 0.8
Non- GpC Island
aYN = 0.2
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
Y
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
s :AGCGCGTAATCTG
:?????????????
P( s,  | M ) can be easily computed
How to evaluate P ( s | M )?
aNN = 0.8
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
How to evaluate P ( s | M )?
BEGIN
GpC Island
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
a0Y= 0.2
a0N = 0.8
aYN = 0.2
Y
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
Non- GpC Island
aNN = 0.8
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
P ( s | M ) = S P( s,  | M )
s: A G C G C G T A A T C T G
1: Y Y Y Y Y Y Y Y Y Y Y Y Y
2: Y Y Y Y Y Y Y Y Y Y Y Y N
3: Y Y Y Y Y Y Y Y Y Y Y N Y
4: Y Y Y Y Y Y Y Y Y Y Y N N
5: Y Y Y Y Y Y Y Y Y Y N Y Y
………………………………………………………………………………………………………
213 different paths
Summing over all the path will give the probability of
having the sequence
Resumé: GpC Island, simple model
BEGIN
GpC Island
a0Y= 0.2
a0N = 0.8
Non- GpC Island
aYN = 0.2
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
Y
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
s :AGCGCGTAATCTG
 :?????????????
P( s,  | M ) can be easily computed
How to evaluate P ( s | M )?
Can we show the hidden path?
aNN = 0.8
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
Can we show the hidden path?
BEGIN
GpC Island
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
a0Y= 0.2
a0N = 0.8
aYN = 0.2
Y
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
Non- GpC Island
aNN = 0.8
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
* = argmax  [ P(  | s, M ) ] = argmax  [ P(  , s | M ) ]
s: A G C G C G T A A T C T G
1: Y Y Y Y Y Y Y Y Y Y Y Y Y
2: Y Y Y Y Y Y Y Y Y Y Y Y N
3: Y Y Y Y Y Y Y Y Y Y Y N Y
4: Y Y Y Y Y Y Y Y Y Y Y N N
5: Y Y Y Y Y Y Y Y Y Y N Y Y
………………………………………………………………………………………………………
213 different paths
Viterbi path: path that gives the best joint probability
Can we show the hidden path?
A Posteriori decoding
For each position choose the state  (t) :
 (i) = argmax k [ P(  (i) = k| s, M ) ]
The contribution to this probability derives from all the paths that
go through the state k at position i.
The A posteriori path can be a non-sense path (it may not be a
legitimate path if some transitions are not permitted in the model)
GpC Island, simple model
BEGIN
GpC Island
a0Y= ?
a0N = ?
aYN = ?
aYY = ?
eY (A) = ? eY (G) = ?
eY (C) = ? eY (T) = ?
Y
N
aNY = ?
aN0 = ?
aY0 = ?
END
Non- GpC Island
aNN = ?
eN (A) = ? eN (G) = ?
eN (C) = ? eN (T) = ?
s :AGCGCGTAATCTG
 :YYYYYYYNNNNNN
P( s,  | M ) can be easily computed
How to evaluate P ( s | M )?
Can we show the hidden path?
Can we evaluate the parameters starting from known examples?
Can we evaluate the parameters starting from known examples?
BEGIN
a0Y= ?
GpC Island
aYN = ?
Y
aYY = ?
Emission:
Transition:
G
Y
C
Y
aN0 = ?
aY0 = ?
END
G
Y
Non- GpC Island
aNN = ?
N
aNY = ?
eY (A) = ? eY (G) = ?
eY (C) = ? eY (T) = ?
s: A
: Y
a0N = ?
C
Y
G
Y
T
Y
A
N
eY (A) = ? eY (G) = ?
eY (C) = ? eY (T) = ?
A
N
T
N
C
N
T
N
G
N
eY (A)eY (G)eY (C)e Y(G)eY (C)eY (G)eY (T)eN (A)eN (A)eN (T)eN (C)eN (T)eN (G)
a0Y
 aYY  aYY 
aYY

aYY

aYY
 aYY  aYN  aNN 
aNN
 aNN  aNN aNN 
aN0
How to find the parameters e and a that maximises this probability?
How if we don’t know the path?
Hidden Markov Models:Algorithms
•Resumé
•Evaluating P(s | M): Forward Algorithm
Computing P( s, | M ) for each path is a redundant operation
BEGIN
GpC Island
a0Y= 0.2
aYN = 0.2
Y
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
s: A
: Y
G
Y
C
Y
a0N = 0.8
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
G
Y
Non- GpC Island
C
Y
G
Y
T
Y
A
Y
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
A
Y
T
Y
C
Y
T
Y
G
Y
Emission:
0.1  0.4 0.4  0.4 0.4 0.4 0.1 0.1  0.1  0.1  0.4  0.1  0.4
Transition: 0.2  0.7  0.7  0.7  0.7  0.7  0.7  0.70.7  0.7 0.70.7 0.7  0.1
s: A
: Y
G
Y
C
Y
G
Y
C
Y
G
Y
T
Y
A
Y
A
Y
T
Y
C
Y
T
Y
G
N
Emission:
0.1  0.4 0.4  0.4 0.4 0.4 0.1 0.1  0.1  0.1  0.4  0.1  0.25
Transition: 0.2  0.7  0.7  0.7  0.7  0.7  0.7  0.70.7  0.7 0.70.7 0.2  0.1
States
Computing P( s, | M ) for each path is a redundant operation
End
L
R
Begin
0
1
(
| M )  (
P ( s,  1 | M ) 
T 1
P ( s,  2
T 1
2
T-1
3
T
Iteration
)
( s ) ) a
t
T
a

e
(
s
)

a

e
(
s
)  a1 (T ) 0
1 (t )
 1 (T 1) 1 (T )
 1 (T )
t 1  1 ( t 1) 1 ( t )
a
 e 2 (t )
t 1  2 ( t 1) 2 ( t )
t
T

e
(
s
)  a 2 (T ) 0
 2 (T 1) 2 (T )
 2 (T )
If we compute the common part only once we gain 2·(T-1) operations
Summing over all the possible paths
BEGIN
GpC Island
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
s: A
: Y
G
Y
Emission:
0.1  0.4
Transition: 0.2  0.7
0.0056
s: A
: Y
G
N
Emission:
0.1  0.25
Transition: 0.2  0.2
0.001
a0Y= 0.2
a0N = 0.8
aYN = 0.2
Y
Non- GpC Island
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
s: A
: N
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
G
Y
s: A
: X
Emission:
0.250.4
Transition: 0.8  0.1
G
Y
0.0136
0.008
s: A
: N
Sum
G
N
Emission:
0.250.25
Transition: 0.8  0.8
0.04
s: A
: X
0.041
G
N
Summing over all the possible paths
BEGIN
a0Y= 0.2
GpC Island
aYN = 0.2
Y
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
s: A
: X
G C
Y Y
0.0136  0.4
0.7
a0N = 0.8
Non- GpC Island
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
s: A
: X
+
G C
N Y
0.041  0.4
s: A
: X
G C
s: A G C
Y N
: X N N
0.25
0.0136  0.25
0.041 
+
0.2
0.8
C
Y
0.005448
0.1
Sum
s: A
: X
G
X
s: A
: X
G
X
C
N
0.00888
Summing over all the possible paths
BEGIN
GpC Island
a0Y= 0.2
aYN = 0.2
Y
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
a0N = 0.8
Non- GpC Island
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
Iterating until the last position of the sequence:
A
X
G
X
C
X
G
X
C
X
G
X
T
X
A
X
A
X
T
X
C
X
T
X
G
Y
 0.1 (aY0)
A
X
G
X
C
X
G
X
C
X
G
X
T
X
A
X
A
X
T
X
C
X
T
X
+
G
N
 0.1 (aN0)
P(s|M)
States
Summing over all the possible paths
End
L
R
Begin
0
1
2
3
T-1
T
Iteration
If we know the probabilities of emitting the two first characters of the
sequence ending the path in states L and R respectively:
FR(2)  P(s1,s2,(2) = R | M) and FL(2)  P(s1,s2,(2) = L | M)
then we can compute:
P(s1,s2,s3,(3) = R | M) = FR(2) · aRR ·eR(s3) + FL(2) · aLR ·eR(s3)
Forward Algorithm
On the basis of preceding observations the computation of P(s | M) can
be decomposed in simplest problems
For each state k and each position i in the sequence, we compute:
Fk(i) = P( s1s2s3……s i,  (i) = k | M)
Initialisation: FBEGIN (0) = 1
Fi (0) = 0
Will be understood
 i  BEGIN
Recurrence: Fl ( i+1) = P( s1s2…s is i+1,  (i + 1) = l ) =
= Sk P( s1s2 …s i,  (i) = k )  a k l  e l ( s i+1 ) =
= e l ( s i+1 )  Sk Fk ( i )  a k l
Termination: P( s ) = P( s1s2s3……s T,  (T + 1) = END ) =
=Sk P( s1s2 …s T ,  (T) = k )  a k0
= Sk Fk ( T )  a k 0
Forward Algorithm
STATE
Fi (1) ∙ aiB
P(s | M)
END
eB (s2)
FB (2)
B
A
S
BEGIN
0
1
2
Iteration
T
T+1
Forward algorithm: computational complexity
Naïf method
P ( s | M ) = S P( s,  | M )
There are N T possible paths.
Each path requires about 2T operations.
The time for the computation is O( T N T )
s: A
: Y
G
Y
C
Y
G
Y
C
Y
G
Y
T
Y
A
Y
A
Y
T
Y
C
Y
T
Y
G
Y
Emission:
0.1  0.4 0.4  0.4 0.4 0.4 0.1 0.1  0.1  0.1  0.4  0.1  0.4
Transition: 0.2  0.7  0.7  0.7  0.7  0.7  0.7  0.70.7  0.7 0.70.7 0.7  0.1
s: A
: Y
G
Y
C
Y
G
Y
C
Y
G
Y
T
Y
A
Y
A
Y
T
Y
C
Y
T
Y
G
N
Emission:
0.1  0.4 0.4  0.4 0.4 0.4 0.1 0.1  0.1  0.1  0.4  0.1  0.25
Transition: 0.2  0.7  0.7  0.7  0.7  0.7  0.7  0.70.7  0.7 0.70.7 0.2  0.1
Forward algorithm: computational complexity
Forward algorithm
T positions, N values for each position
Each element requires about 2N product and 1 sum
The time for the computation is O(T N2)
s: A
: X
G C
Y Y
0.0136  0.4
0.7
s: A
: X
+
G C
N Y
0.041  0.4
s: A
: X
G C
s: A G C
Y N
: X N N
0.25
0.0136  0.25
0.041 
+
0.2
0.8
C
Y
0.005448
0.1
Sum
s: A
: X
G
X
s: A
: X
G
X
C
N
0.00888
No. of operations
Forward algorithm: computational complexity
1000
900
800
700
600
500
400
300
200
100
0
Naïf method
Forward algorithm
1
2
3
4
T
5
6
7
Hidden Markov Models:Algorithms
•Resumé
•Evaluating P(s | M): Forward Algorithm
•Evaluating P(s | M): Backward Algorithm
Backward Algorithm
Similar to the Forward algorithm: it computes P( s | M ), reconstructing
the sequence from the end
For each state k and each position i in the sequence, we compute:
Bk(i) = P( s i+1s i+2s i+3……s T |  (i) = k )
Initialisation: Bk (T) = P( (T+1) = END |  (T) = k ) = ak0
Recurrence: Bl ( i-1) = P(s is i+1…s T |  (i - 1) = l ) =
= Sk P(s i+1s i+2…s T |  (i) = k)  a l k  e k (s i )=
= Sk Bk ( i )  e k ( s i )  a l k
Termination: P( s ) = P( s1s2s3……s T |  (0) = BEGIN ) =
= Sk P( s2 …s T |  (1) = k )  a 0 k  e k ( s 1 ) =
= Sk Bk ( 1 )  a 0k  e k ( s 1 )
Backward Algorithm
STATE
Bk(T)· aB T· e k (s T-1 )
END
BB (T-1)
B
S
A
BEGIN
0
P(s | M)
1
2
T-1
Iteration
T
T+1
Hidden Markov Models:Algorithms
•Resumé
•Evaluating P(s | M): Forward Algorithm
•Evaluating P(s | M): Backward Algorithm
•Showing the path: Viterbi decoding
Finding the best path
BEGIN
GcP Island
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
s: A
: Y
G
Y
Emission:
0.1  0.4
Transition: 0.2  0.7
0.0056
s: A
: Y
G
N
Emission:
0.1  0.25
Transition: 0.2  0.2
0.001
a0Y= 0.2
a0N = 0.8
aYN = 0.2
Y
Non- GcP Island
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
s: A
: N
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
G
Y
s: A
: N
Emission:
0.250.4
Transition: 0.8  0.1
G
Y
0.008
0.008
s: A
: N
Max s : A
: N
G
N
Emission:
0.250.25
Transition: 0.8  0.8
0.04
0.04
G
N
Finding the best path
BEGIN
a0Y= 0.2
GcP Island
aYN = 0.2
Y
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
s: A
: N
G C
Y Y
0.008  0.4
0.7
=0.00224
s: A
: N
G C
Y N
0.008  0.25
0.2
=0.0004
a0N = 0.8
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
Non- GcP Island
eN (A) = 0.25 eN (G) = 0.25
eN (C) = 0.25 eN (T) = 0.25
s: A
: N
s: A
: N
G C
N Y
0.04  0.4
0.1
=0.0016
s: A
: N
G C
N N
0.04  0.25
;
0.8
=0.008
G
Y
C
Y
0.00224
Max s : A
: N
G
N
C
N
0.008
Finding the best path
BEGIN
GcP Island
a0Y= 0.2
aYN = 0.2
Y
aYY = 0.7
eY (A) = 0.1 eY (G) = 0.4
eY (C) = 0.4 eY (T) = 0.1
a0N = 0.8
Non- GcP Island
aNN = 0.8
N
aNY = 0.1
aN0 = 0.1
aY0 = 0.1
END
eY (A) = 0.25 eY (G) = 0.25
eY (C) = 0.25 eY (T) = 0.25
Iterating until the last position of the sequence:
A
N
G
Y
C
Y
G
Y
C
Y
G
Y
T
Y
A
N
A
N
T
N
C
Y
T
Y
G
Y
 0.1 (aY0)
A
N
G
N
C
N
G
N
C
N
G
N
T
N
A
N
A
N
T
N
C
N
T
N
G
N
 0.1 (aN0)
Choose the Maximum
Viterbi Algorithm
* = argmax  [ P(  , s | M ) ]
The computation of P(s,*| M) can be decomposed in simplest problems
Let Vk(i) be the probability of the most probable path for generating the
subsequence s1s2s3……s i ending in the state k at iteration i
Initialisation: VBEGIN (0) = 1
Vi (0) = 0
 i  BEGIN
Recurrence: Vl ( i+1) = e l ( s i+1 )  Max k ( Vk ( i )  a k l )
ptr i ( l ) = argmax k ( Vk ( i )  a k l )
Termination: P( s, * ) =Maxk (Vk ( T )  a k 0 )
Traceback:
* ( T ) = argmax k (Vk ( T )  a k 0 )
* ( i-1 ) = ptr i (* ( i ))
Viterbi Algorithm
STATE
Vi (1) ∙ aiB
P(s, *| M)
END
eB (s2)
VB (2)
B
A
MAX
BEGIN
0
1
ptr2 (B)
2
Iteration
T
T+1
Viterbi Algorithm
STATE
END
B
A
BEGIN
0
1
2
T–1
T
Iteration
Viterbi path
Different paths can have the same probability
T+1
Hidden Markov Models:Algorithms
•Resumé
•Evaluating P(s | M): Forward Algorithm
•Evaluating P(s | M): Backward Algorithm
•Showing the path: Viterbi decoding
•Showing the path: A posteriori decoding
•Training a model: EM algorithm
If we know the path generating the training sequence
BEGIN
a0Y= ?
GcP Island
aYN = ?
Y
aYY = ?
Emission:
Transition:
G
Y
C
Y
aN0 = ?
aY0 = ?
END
G
Y
Non- GcP Island
aNN = ?
N
aNY = ?
eY (A) = ? eY (G) = ?
eY (C) = ? eY (T) = ?
s: A
: Y
a0N = ?
C
Y
G
Y
T
Y
A
N
eN (A) = ? eN (G) = ?
eN (C) = ? eN (T) = ?
A
N
T
N
C
N
T
N
G
N
eY (A)eY (G)eY (A)e Y(G)eY (C)eY (G)eY (T)eN (A)eN (A)eN (T)eN (C)eN (T)eN (G)
a0Y
Just count!
Example:
 aYY  aYY 
aYY

aYY

aYY
 aYY  aYN  aNN 
aNN
 aNN  aNN aNN 
aYY= nYY /(nYY+ nYN)= 6/7
eY(A) = nY(A) /[nY(A)+nY(C) +nY(G) +nY(T)]= 1/7
aN0
Expectation-Maximisation algorithm
We need to estimate the Maximum Likelihood parameters when the
paths generating the training sequences are unknown
ML = argmax [P ( s | , M)]
Given a model with parameters 0 the EM algorithm finds new
parameters  that increase the likelihood of the model:
P( s |  ) > P( s| 0 )
Expectation-Maximisation algorithm
s: A G C G C G T A A T C T G
1: Y Y Y Y Y Y Y Y Y Y Y Y Y
2: Y Y Y Y Y Y Y Y Y Y Y Y N
3: Y Y Y Y Y Y Y Y Y Y Y N Y
4: Y Y Y Y Y Y Y Y Y Y Y N N
……………………………………………………………...
Given a path  we can count
the number of transition between states k and l: Ak,l()
the number on emissions of character c from state k: Ek (c,)
We can compute the expected values over all the paths
Ak,l = S P( | s,0) · Ak,l()
Ek (c) = S P( | s,0) · Ek (c,)
The updated parameters are:
ak,l =
Ak,l
SNm = 1 Ak,m
ek(c) =
Ek (c)
Sc Ek (c)
Expectation-Maximisation algorithm
We need to estimate the Maximum Likelihood parameters when the
paths generating the training sequences are unknown
ML = argmax [P ( s | , M)]
Given a model with parameters 0 the EM algorithm finds new
parameters  that increase the likelihood of the model:
P( s |  ) > P( s| 0 )
or equivalentely
log P( s |  ) > log P( s| 0 )
Expectation-Maximisation algorithm
log P( s |  ) = log P(s, | ) - log P( | s, )
Multiplying for P( | s,0) and summing over all the possible paths:
log P( s |  ) =
=S P( | s,0) ·log P(s, |  ) - S P( | s,0) · log P( | s,)
Q( |0) : Expectation value of log P(s, | ) over all the “current” paths
log P( s |  ) - log P(s | 0) =
P(π | s, )
0
P(
π
|
s,

)

log
= Q( |0) - Q(0|0) +π
P(π | s, 0 )
 Q( |0) - Q(0|0)
0
Expectation-Maximisation algorithm
The EM algorithm is an iterative process
Each iteration performs two steps:
E-step: evaluation of Q( |0) = S P( | s,0) ·log P(s, |  )
M-step: Maximisation of Q( |0) over all 
It does NOT assure to converge to the GLOBAL Maximum Likelihood
Baum-Welch implementation of the EM algorithm
E-step:
Q(  |0) = S P( | s,0) ·log P(s, | )
P(s, | ) = a0,(1) · PiT= 1 a(i),(i+1) ·e(i)(si) =
=P
N
k=0
P
Ak.l ()
N
l = 1 ak,l
·P
N
k=1
Pc  C ek (c)
Ek (c,)
Ak,l(): number of transitions between the states k and l in path 
Ek (c,): number of emissions of character c in path 
Ak,l = S P( | s,0) · Ak,l()
Ek (c) = S P( | s,0) · Ek (c,)
So:
Expected values over all the
“actual” paths
Q( |0) = SNk = 0 SNl = 1 Ak,l · log ak,l + SNk = 1 Sc  C Ek (c) ·log ek (c)
Baum-Welch implementation of the EM algorithm
M-step:

0
ak ,l

0
ek (c)
For any state k and l, with Sl ak,l = 1
For any state k and character c, with Sc ek(c) = 1
By means of Lagrange’s multipliers techniques, we can solve the system
ak,l =
Ak,l
SNm = 1 Ak,m
ek(c) =
Ek (c)
SNm = 1 Em (c)
Baum-Welch implementation of the EM algorithm
How to compute the expected number of transitions and emissions
over all the paths
Fk(i) = P( s1s2s3……s i,  (i) = k )
Bk(i) = P( s i+1s i+2s i+3……s T |  (i) = k )
i +1)  B (i + 1)
S
F
(i
)

a

e
(s
i k
kl
l
l
= P( (i ) = k ,  ( i +1) = l | s,) =
Ak,l
P (s )
Ek (c) = P( s = c ,  (i ) = k | s,) =
i
S s = c Fk(i )  Bl(i)
i
P (s )
Baum-Welch implementation of the EM algorithm
Algorithm
Start with random parameters
Compute Forward and Backward matrices on the known sequences
Compute Ak,l and Ek (c) expected numbers of transitions and emissions
Update a k,l Ak,l ek (c)  Ek (c)
Yes
Has P(s|M) incremented ?
No
End
Profile HMMs
•HMMs for alignments
Profile HMMs
•HMMs for alignments
How to align?
M0
M1
M2
M3
M4
M5
Each state represent a position in the alignment.
A
M0
C
M1
G
M2
G
M3
T
M4
A
M5
A
M0
C
M1
G
M2
A
M3
T
M4
C
M5
A
M0
T
M1
G
M2
T
M3
T
M4
C
M5
Each position has a peculiar composition
Given a set of sequences..
A
A
A
C
C
T
G
G
G
G
A
T
T
T
T
A
C
C
..we can train a model..
M0
M1
M2
M3
M4
M5
..estimating the emission probabilities.
A
C
G
T
1
0
0
0
0
0.66
0
0.33
0
0
1
0
0.33
0
0.33
0.33
0
0
0
1
0.33
0.66
0
0
Given a trained model..
M0
A
C
G
T
1
0
0
0
M1
M2
0
0.66
0
0.33
0
0
1
0
M3
M4
0.33
0
0.33
0.33
0
0
0
1
M5
0.33
0.66
0
0
..we can align a new sequence..
A
C
G
A
T
C
..computing the emission probability
P(s|M) = 1 × 0.66 × 1 × 0.33 × 1 × 0.66
And for the sequence AGATC ?
M0
A
C
G
T
1
0
0
0
A
M0
M1
M2
M3
M4
M5
0
0.66
0
0.33
0
0
1
0
0.33
0
0.33
0.33
0
0
0
1
0.33
0.66
0
0
M2
G
M3
A
M4
T
M5
C
M5
We need a way to introduce gaps
Silent states
Red transitions allow gaps
(N-1) ! transitions
To reduce the number of parameters we can use states that
doesn’t emit any character
4N-8 transitions
Profile HMMs
Delete states
D1
D2
D3
D4
I0
I1
I2
I3
I4
M0
M1
M2
M3
M4
A
M0
C
M1
G
M2
G
M3
T
M4
A
M5
A C G C
M 0 I 0 I 0 M1
A
M2
G
M3
T
M4
C
M5
A
M0
G
M2
D1
A
M3
T
M4
Insert states
M5
C
M5
Match states
Example of alignment
D1
D2
D3
D4
I0
I1
I2
I3
I4
M0
M1
M2
M3
M4
A
S
M0 M1
A S
Sequence 1
T R A L
Viterbi path
M2 M3 M4 M5
T R A L
M5
D1
D2
D3
D4
A
I0
I1
I2
I3
I4
M0 M1 M2
A S T
Sequence 2
T A I L
Viterbi path
D3 M4 I4 M5
A I L
M0
M1
M2
M3
M4
D1
D2
D3
D4
I0
I1
I2
I3
I4
M0 M1
A R
Sequence 3
A R T I
Viterbi path
M2 D3 D4 M5
T
I
M0
M1
M2
M3
M4
S
M5
M5
Example of alignment
M0 M1 M2 M3 M4
A S T R A
M5
L
Sequence 1
M0 M1 M2 D3 M4 I4 M5
A S T
A I L
Sequence 2
M0 M1 M2 D3 D4
A R T
Sequence 3
M5
I
Grouping by vertical layers
0 1 2 3 4
5
L
s1 A S T R A
AI L
s2 A S T
I
s3 A R T
-Log P(s | M)
Alignment
ASTRA-L
AST-AIL
ART---I
Is an alignment score
Searching for a structural/functional pattern in protein sequence
Zn binding loop:
C
C
C
D
C
C
H
H
H
H
H
H
C
C
C
C
C
C
I
L
I
L
I
L
C
C
C
C
D
C
R
K
S
T
S
K
I
I
L
I
I
I
Cysteines can be replaced by an Aspartic Acid,
but only ONCE for each sequence
C
C
C
C
C
C
Searching for a structural/functional pattern in protein sequences
D1
D2
D3
D4
D5
D6
I0
I1
I2
I3
I4
I5
I6
M0
M1
M2
M3
M4
M5
M6
M7
..ALCPCHCLCRICPLIY..
obtains higher probability than
..WERWDHCIDSICLKDE..
.. because M0 and M4 have low emission probability for Aspartic
Acid and we multiply them twice.
Profile HMMs
•HMMs for alignments
•Example on globins
Structural alignment of globins
Structural alignment of globins
Bashdorf D, Chothia C & Lesk AM, (1987) Determinants of a protein fold:
unique features of the globin amino sequence. J.Mol.Biol. 196, 199-216
Alignment of globins reconstructed with profile HMMs
Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational
biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531
Discrimination power of profile HMMs
Log(P(s | M)) - <Log( P(s | M))>
Z-score =
s (Log(P(s | M)) )
Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology:
applications to protein modelling. J.Mol.Biol. 235, 1501-1531
Profile HMMs
•HMMs for alignments
•Example on globins
•Other applications
Finding a domain
Profile HMM specific for
the considered domain
Begin
End
I2
I2
Clustering subfamilies
HMM 1
HMM 2
B
E
G
I
N
HMM 3
.
E
N
D
.
HMM n
Each sequence s contributes to update HMM i with a weight equal
to P ( s | Mi )
Profile HMMs
•HMMs for alignments
•Example on globins
•Other applications
•Available codes and servers
HMMER at WUSTL: http://hmmer.wustl.edu/
Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755-763
HMMER
Alignment of a protein family
hmmbuild
Trained profile-HMM
hmmcalibate
Takes the aligned sequences,
checks for redundancy and sets
the emission and the transitions
probabilities of a HMM
Takes a trained HMM, generates a
great number of random
sequences, score them and fits the
Extreme Value Distribution to the
computed scores
HMM calibrated with
the accurate E-value
statistics
HMMER
Set of sequences
hmmalign
HMM
Alignment of all the
sequences to the model
Set of HMMs
hmmsearch
List of sequences that
match the HMM (sorted
by E-value)
Sequence
hmmpfam
List of HMMs that match the sequence
MSF Format: globins50.msf
!!AA_MULTIPLE_ALIGNMENT 1.0
PileUp of: *.pep
Symbol comparison table: GenRunData:blosum62.cmp
CompCheck: 6430
GapWeight: 12
GapLengthWeight: 4
pileup.msf
Name:
Name:
Name:
Name:
Name:
Name:
MSF: 308
lgb1_pea
lgb1_vicfa
myg_escgi
myg_horse
myg_progu
myg_saisc
Type: P
Len:
Len:
Len:
Len:
Len:
Len:
August 16, 1999 09:09
308
308
308
308
308
308
Check:
Check:
Check:
Check:
Check:
Check:
2200
214
3961
5619
6401
6606
Check: 9858 ..
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
1.00
1.00
1.00
1.00
1.00
1.00
//
lgb1_pea
lgb1_vicfa
myg_escgi
myg_horse
myg_progu
myg_saisc
1
~~~~~~~~~G
~~~~~~~~~G
~~~~~~~~~V
~~~~~~~~~G
~~~~~~~~~G
~~~~~~~~~G
FTDKQEALVN
FTEKQEALVN
LSDAEWQLVL
LSDGEWQQVL
LSDGEWQLVL
LSDGEWQLVL
SSSE.FKQNL
SSSQLFKQNP
NIWAKVEADV
NVWGKVEADI
NVWGKVEGDL
NIWGKVEADI
PGYSILFYTI
SNYSVLFYTI
AGHGQDILIR
AGHGQEVLIR
SGHGQEVLIR
PSHGQEVLIS
50
VLEKAPAAKG
ILQKAPTAKA
LFKGHPETLE
LFTGHPETLE
LFKGHPETLE
LFKGHPETLE
hmmbuild globin.hmm globins50.msf
Alignment of a protein family
hmmbuild
Trained profile-HMM
All the transition and emission
parameters are estimated by
means of the Expectation
Maximisation algorithm on the
aligned sequences.
In principle we could use also NON aligned sequences
to train the model.
Nevertheless it is more efficient to build the starting
alignment using, for example, CLUSTALW
hmmcalibrate [-num N] -histfile globin.histo globin.hmm
Trained profile-HMM
hmmcalibate
HMM calibrated with
the accurate E-value
statistics
E-value(S): expected
number of random
sequences with a
score > S
A number of N (default 5000)
random sequences are
generated and scored with the
model.
Random sequences
Range for globin
sequences
Log P(s|M)/P(s|N)
Trained model (globin.hmm)
HMMER2.0 [2.3.2]
NAME globins50
LENG 143
ALPH Amino
RF
no
CS
no
MAP
yes
COM
/home/gigi/bin/hmmbuild globin.hmm globins50.msf
COM
/home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmm
NSEQ 50
DATE Sun May 29 19:03:18 2005
CKSUM 9858
XT
-8455
-4 -1000 -1000 -8455
-4 -8455
-4
NULT
-4 -8455
NULE
595 -1558
85
338
-294
453 -1158
197
249
EVD
-38.893742
0.243153
HMM
A
C
D
E
F
G
H
I
K
m->m
m->i
m->d
i->m
i->i
d->m
d->d
b->m
m->e
-450
* -1900
1
591 -1587
159
1351 -1874
-201
151 -1600
998
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
-450
*
2
-926 -2616
2221
2269 -2845 -1178
-325 -2678
-300
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
*
*
3
-638 -1715
-680
497 -2043 -1540
23 -1671
2380
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
*
*
902
-1085
-142
-21
L
M
N
P
-1591
-466
-693
-720
389
275
-1272
394
-2596
-466
-1810
-720
220
275
-1592
394
-1641
-466
-840
-720
-222
275
-1595
394
Trained model (globin.hmm)
null model
HMMER2.0 [2.3.2]
NAME globins50
2
LENG 143
ALPH Amino
RF
no
CS
no
MAP
yes
COM
/home/gigi/bin/hmmbuild globin.hmm globins50.msf
COM
/home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmm
NSEQ 50
DATE Sun May 29 19:03:18 2005
CKSUM 9858
XT
-8455
-4 -1000 -1000 -8455
-4 -8455
-4
NULT
-4 -8455
NULE
595 -1558
85
338
-294
453 -1158
197
249
EVD
-38.893742
0.243153
HMM
A
C
D
E
F
G
H
I
K
m->m
m->i
m->d
i->m
i->i
d->m
d->d
b->m
m->e
-450
* -1900
1
591 -1587
159
1351 -1874
-201
151 -1600
998
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
-450
*
2
-926 -2616
2221
2269 -2845 -1178
-325 -2678
-300
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
*
*
3
-638 -1715
-680
497 -2043 -1540
23 -1671
2380
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
*
*
Score = INT [1000 log (prob/null_prob)]
null_prob = 1 for transitions
902
-1085
-142
-21
L
M
N
P
-1591
-466
-693
-720
389
275
-1272
394
-2596
-466
-1810
-720
220
275
-1592
394
-1641
-466
-840
-720
-222
275
-1595
394
Trained model (globin.hmm)
null model
HMMER2.0 [2.3.2]
NAME globins50
2
LENG 143
ALPH Amino
RF
no
CS
no
MAP
yes
COM
/home/gigi/bin/hmmbuild globin.hmm globins50.msf
COM
/home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmm
NSEQ 50
DATE Sun May 29 19:03:18 2005
CKSUM 9858
XT
-8455
-4 -1000 -1000 -8455
-4 -8455
-4
NULT
-4 -8455
NULE
595 -1558
85
338
-294
453 -1158
197
249
EVD
-38.893742
0.243153
HMM
A
C
D
E
F
G
H
I
K
m->m
m->i
m->d
i->m
i->i
d->m
d->d
b->m
m->e
-450
* -1900
1
591 -1587
159
1351 -1874
-201
151 -1600
998
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
-450
*
2
-926 -2616
2221
2269 -2845 -1178
-325 -2678
-300
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
*
*
3
-638 -1715
-680
497 -2043 -1540
23 -1671
2380
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
*
*
Score = INT [1000 log (prob/null_prob)]
= natural abundance for emissions
902
-1085
-142
-21
L
M
N
P
-1591
-466
-693
-720
389
275
-1272
394
-2596
-466
-1810
-720
220
275
-1592
394
-1641
-466
-840
-720
-222
275
-1595
394
Trained model (globin.hmm)
HMMER2.0 [2.3.2]
NAME globins50
LENG 143
ALPH Amino
RF
no
CS
no
MAP
yes
COM
/home/gigi/bin/hmmbuild globin.hmm globins50.msf
COM
/home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmm
NSEQ 50
DATE Sun May 29 19:03:18 2005
CKSUM 9858
XT
-8455
-4 -1000 -1000 -8455
-4 -8455
-4
NULT
-4 -8455
NULE
595 -1558
85
338
-294
453 -1158
197
249
EVD
-38.893742
0.243153
HMM
A
C
D
E
F
G
H
I
K
m->m
m->i
m->d
i->m
i->i
d->m
d->d
b->m
m->e
-450
* -1900
1
591 -1587
159
1351 -1874
-201
151 -1600
998
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
-450
*
2
-926 -2616
2221
2269 -2845 -1178
-325 -2678
-300
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
*
*
3
-638 -1715
-680
497 -2043 -1540
23 -1671
2380
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
*
*
Transitions
902
-1085
-142
-21
L
M
N
P
-1591
-466
-693
-720
389
275
-1272
394
-2596
-466
-1810
-720
220
275
-1592
394
-1641
-466
-840
-720
-222
275
-1595
394
Trained model (globin.hmm)
HMMER2.0 [2.3.2]
NAME globins50
LENG 143
ALPH Amino
RF
no
CS
no
MAP
yes
COM
/home/gigi/bin/hmmbuild globin.hmm globins50.msf
COM
/home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmm
NSEQ 50
DATE Sun May 29 19:03:18 2005
CKSUM 9858
XT
-8455
-4 -1000 -1000 -8455
-4 -8455
-4
NULT
-4 -8455
NULE
595 -1558
85
338
-294
453 -1158
197
249
EVD
-38.893742
0.243153
HMM
A
C
D
E
F
G
H
I
K
m->m
m->i
m->d
i->m
i->i
d->m
d->d
b->m
m->e
-450
* -1900
1
591 -1587
159
1351 -1874
-201
151 -1600
998
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
-450
*
2
-926 -2616
2221
2269 -2845 -1178
-325 -2678
-300
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
*
*
3
-638 -1715
-680
497 -2043 -1540
23 -1671
2380
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
*
*
Emissions
902
-1085
-142
-21
L
M
N
P
-1591
-466
-693
-720
389
275
-1272
394
-2596
-466
-1810
-720
220
275
-1592
394
-1641
-466
-840
-720
-222
275
-1595
394
Trained model (globin.hmm)
HMMER2.0 [2.3.2]
NAME globins50
LENG 143
ALPH Amino
RF
no
CS
no
MAP
yes
COM
/home/gigi/bin/hmmbuild globin.hmm globins50.msf
COM
/home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmm
NSEQ 50
DATE Sun May 29 19:03:18 2005
CKSUM 9858
XT
-8455
-4 -1000 -1000 -8455
-4 -8455
-4
NULT
-4 -8455
NULE
595 -1558
85
338
-294
453 -1158
197
249
EVD
-38.893742
0.243153
HMM
A
C
D
E
F
G
H
I
K
m->m
m->i
m->d
i->m
i->i
d->m
d->d
b->m
m->e
-450
* -1900
1
591 -1587
159
1351 -1874
-201
151 -1600
998
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
-450
*
2
-926 -2616
2221
2269 -2845 -1178
-325 -2678
-300
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
*
*
3
-638 -1715
-680
497 -2043 -1540
23 -1671
2380
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
*
*
Emissions
902
-1085
-142
-21
L
M
N
P
-1591
-466
-693
-720
389
275
-1272
394
-2596
-466
-1810
-720
220
275
-1592
394
-1641
-466
-840
-720
-222
275
-1595
394
Trained model (globin.hmm)
HMMER2.0 [2.3.2]
NAME globins50
LENG 143
ALPH Amino
RF
no
CS
no
MAP
yes
COM
/home/gigi/bin/hmmbuild globin.hmm globins50.msf
COM
/home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmm
NSEQ 50
DATE Sun May 29 19:03:18 2005
CKSUM 9858
XT
-8455
-4 -1000 -1000 -8455
-4 -8455
-4
NULT
-4 -8455
NULE
595 -1558
85
338
-294
453 -1158
197
249
EVD
-38.893742
0.243153
HMM
A
C
D
E
F
G
H
I
K
m->m
m->i
m->d
i->m
i->i
d->m
d->d
b->m
m->e
-450
* -1900
1
591 -1587
159
1351 -1874
-201
151 -1600
998
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
-450
*
2
-926 -2616
2221
2269 -2845 -1178
-325 -2678
-300
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
*
*
3
-638 -1715
-680
497 -2043 -1540
23 -1671
2380
-149
-500
233
43
-381
399
106
-626
210
-23 -6528 -7571
-894 -1115
-701 -1378
*
*
Emissions
902
-1085
-142
-21
L
M
N
P
-1591
-466
-693
-720
389
275
-1272
394
-2596
-466
-1810
-720
220
275
-1592
394
-1641
-466
-840
-720
-222
275
-1595
394
hmmemit [-n N] globin.hmm
Trained profile-HMM
hmmemit
Sequences generated
by the model
The parameters of the model
are used to generate new
sequences
hmmsearch globin.hmm Artemia.fa > Artemia.globin
Set of
sequences
Trained
profile-HMM
hmmsearch
List of sequences that
match the HMM (sorted
by E-value)
Search results (Artemia.globin)
Sequence Description
-------- ----------S13421
S13421
Parsed for domains:
Sequence Domain seq-f seq-t
hmm-f hmm-t
-------- ------- ----- --------- ----S13421
7/9
932 1075 ..
1
143 []
S13421
2/9
153
293 ..
1
143 []
S13421
3/9
307
450 ..
1
143 []
S13421
8/9
1089 1234 ..
1
143 []
S13421
9/9
1248 1390 ..
1
143 []
S13421
1/9
1
143 [.
1
143 []
S13421
4/9
464
607 ..
1
143 []
S13421
6/9
775
918 ..
1
143 []
S13421
5/9
623
762 ..
1
143 []
Start End
Score
----474.3
score
----76.9
63.7
59.8
57.6
52.3
51.2
46.7
42.2
23.9
E-value
------7.3e-24
6.8e-20
9.8e-19
4.5e-18
1.8e-16
4e-16
8.6e-15
2e-13
6.6e-08
E-value N
------- --1.7e-143
9
Number of domains
Domains sorted by
E-value
Alignments of top-scoring domains:
S13421: domain 7 of 9, from 932 to 1075: score 76.9, E = 7.3e-24
*->eekalvksvwgkveknveevGaeaLerllvvyPetkryFpkFkdLss
+e a vk+ w+ v+ ++ vG +++ l++ +P+ +++FpkF d+
S13421
932
REVAVVKQTWNLVKPDLMGVGMRIFKSLFEAFPAYQAVFPKFSDVPL 978
Consensus sequence
Sequence
S13421
adavkgsakvkahgkkVltalgdavkkldd...lkgalakLselHaqklr
d++++++ v +h
V t+l++ ++ ld++ +l+
++L+e H+ lr
979 -DKLEDTPAVGKHSISVTTKLDELIQTLDEpanLALLARQLGEDHIV-LR 1026
S13421
vdpenfkllsevllvvlaeklgkeftpevqaalekllaavataLaakYk<
v+
fk +++vl+ l++ lg+ f+ ++ +++k+++++++ +++ +
1027 VNKPMFKSFGKVLVRLLENDLGQRFSSFASRSWHKAYDVIVEYIEEGLQ 1075
hmmalign globin.hmm globins630.fa
Set of
sequences
Trained
profile-HMM
hmmalign
Alignment of all
sequences to the model
Insertions
BAHG_VITSP
GLB1_ANABR
GLB1_ARTSX
GLB1_CALSO
GLB1_CHITH
GLB1_GLYDI
GLB1_LUMTE
GLB1_MORMR
GLB1_PARCH
GLB1_PETMA
GLB1_PHESE
Gaps
QAG-..VAAAHYPIV.GQELLGAIKEV.L.G.D.AATDDILDAWGKAYGV
TR-K..ISAAEFGKI.NGPIKKVLAS-.-.-.K.NFGDKYANAWAKLVAV
NRGT..-DRSFVEYL.KESL-----GD.S.V.D.EFT------VQSFGEV
TRGI..TNMELFAFA.LADLVAYMGTT.I.S.-.-FTAAQKASWTAVNDV
-KSR..ASPAQLDNF.RKSLVVYLKGA.-.-.T.KWDSAVESSWAPVLDF
GNKH..IKAQYFEPL.GASLLSAMEHR.I.G.G.KMNAAAKDAWAAAYAD
ER-N..LKPEFFDIF.LKHLLHVLGDR.L.G.T.HFDF---GAWHDCVDQ
QSFY..VDRQYFKVL.AGII-------.-.-.A.DTTAPGDAGFEKLMSM
DLNK..VGPAHYDLF.AKVLMEALQAE.L.G.S.DFNQKTRDSWAKAFSI
KSFQ..VDPQYFKVL.AAVI-------.-.-.V.DTVLPGDAGLEKLMSM
QHTErgTKPEYFDLFrGTQLFDILGDKnLiGlTmHFD---QAAWRDCYAV
HMMER applications:PFAM
http://www.sanger.ac.uk/Software/Pfam/
PFAM Exercise
Generate with hmmemit a sequence from the globin model
and search it in PFAM database
PFAM Exercise
Search in the SwissProt database the sequences
CG301_HUMAN
Q9H5F4_HUMAN
1) search them in the PFAM data base.
2)launch PSI-BLAST searches. Is it possible to annotate the
sequences by means of the BLAST results?
SAM at UCSD:http://www.soe.ucsc.edu/research/compbio/sam.html
Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in
computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531
SAM applications:
http://www.cse.ucsc.edu/research/compbio/HMM-apps/T02-query.html
HMMPRO: http://www.netid.com/html/hmmpro.html
Pierre Baldi, Net-ID
HMMs for Mapping problems
•Mapping problems in protein prediction
Secondary structure
Covalent structure
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Secondary structure
EEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........
3D structure
Ct
Topology of membrane proteins
Topography
position of Trans Membrane Segments along the sequence
ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK
Outer Membrane
-barrel
Porin
(Rhodobacter capsulatus)
Inner Membrane
-helices
Bacteriorhodopsin
(Halobacterium salinarum)
HMMs for Mapping problems
•Mapping problems in protein prediction
•Labelled HMMs
HMM for secondary structure prediction
Simplest model


c
Introducing a grammar
1
2
3
c
1
2
HMM for secondary structure prediction
Labels
The states 1, 2 and 3 share the same label, so states 1 and 2 do.
Decoding the
Viterbi path for emitting a sequence s, makes a mapping between the
sequence s and a sequence of labels y
S A L K M N Y T R E I M V A S N Q
c 1 2 3 4 4 4 c c c c 1 2 2 2 c c
c       c c c c     c c
1
2
3
c
1
2
s: Sequence
: Path
Y(): Labels
Computing P(s, y | M)
P ( s, y | M )   |Y ( )  y P ( s,  | M )
Only the path whose labelling is y have to be considered in the sum
In Forward and Backward algorithms it means to set
Fk(i) = 0, Bk(i) = 0
if
Y(k)  yi
S A L K M N Y T R E I M V A S N Q s: Sequence
c       c c c c     c c y: Labels
1
2
3
4
1
2
c






c
States Labelling
Baum-Welch training algorithm for labelled HMMs
Given a set of known labelled sequences (e.g. amino acid sequences
and their native secondary structure) we want to find the parameters of
the model, without knowing the generating paths:
ML = argmax [P ( s, y | , M)]
The algorithm is the same as in the non-labelled case if we use the
Forward and Backward matrices defined in the last slide.
Supervised learning of the mapping
HMMs for Mapping problems
•Mapping problems in protein prediction
•Labelled HMMs
•Duration modelling
Self loops and geometric decay
p
Begin
1-p
P(l)
1
0,9
0,8
P(l) = p l-1 ·( 1-p )
End
p=0.9
0,7
0,6
0,5
0,4
0,3
0,2
p=0.5
0,1
p=0.1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
l
The length distribution of the generated segments is always exp-like
How can we model other length distributions?
Limited case
p1
Begin
P(l)
0,350
0,300
0,250
0,200
0,150
0,100
0,050
0,000
p2
1
2
p3
3
p4
4
….
N
pN
End
P(1)
P(6)
P(2) P(4)
P(3)
P(5)
1
2
3
4
5
P(8)
This topology can model any length
distribution between 1 and N
P(7)
6

 P (1)  p
1

 P (2)  (1  p1 )  p2

 P (3)  (1  p1 )  (1  p2 )  p3
......................

N 1

 P ( N )   (1  pi )  p N
i 1

7
l
8
 p1  P (1)
 p  P ( 2) /(1  p )
1
 2
..............

 p  P(k )
 k k 1
(1  pi )


i 1

How can we model other length distributions?
Non limited case
p1
Begin
1
p2
2
p3
3
p4
4
….
N
pN+1
End
pN
This topology can model any length distribution between 1 and N-1
and a geometrical decay from N and 
Secondary structure: length statistic
0,25
Frequency
0,2
Helix
Strand
Coil
0,15
0,1
0,05
0
0
2
4
6
8 10 12 14 16 18 20 22 24 26 28 30 32 34
Length (residues)
Secondary structure: model
1
2
3
4
c1
c2
c3
c4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
5
Do we use the same emission probabilities for states sharing the same
label?
HMMs for Mapping problems
•Mapping problems in protein prediction
•Labelled HMMs
•Duration modelling
•Models for membrane proteins
Outer Membrane
-barrel
Porin
(Rhodobacter capsulatus)
Inner Membrane
-helices
Bacteriorhodopsin
(Halobacterium salinarum)
Topology of membrane proteins
Topography
position of Trans Membrane Segments along the sequence
ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK
Outer Membrane
-barrel
Porin
(Rhodobacter capsulatus)
Inner Membrane
-helices
Bacteriorhodopsin
(Halobacterium salinarum)
A generic model for membrane proteins (TMHMM)
End
Outer Side
Transmembrane
Begin
Inner Side
Model of -barrel membrane proteins
End
Begin
Outer Side
Transmembrane
Inner Side
Model of -barrel membrane proteins
End
Begin
Outer Side
Labels:
Transmembrane
Transmembrane states
Loop states
Inner Side
Model of -barrel membrane proteins
End
Begin
Outer Side
Transmembrane
Length of transmembrane -strands:
Minimum: 6 residues
Maximum: unbounded
Inner Side
Model of -barrel membrane proteins
End
Begin
Outer Side
Transmembrane
Inner Side
Six different sets of emission parameters:
Outer loop
TM strands edges
Inner loop
Long globular domains
TM strands core
Model of -helix membrane proteins (HMM1)
....
.....
.....
x 13
x 12
....
x 10
x 10
...
x12
.....
Outer Side
x13
.....
Transmembrane
...
Inner Side
Model of -helix membrane proteins (HMM2)
....
....
x 10
x 10
...
Outer Side
...
Transmembrane
Inner Side
Dynamic programming filtering procedure
TMS probability
TMS
probability
TMS
probability
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1
51
101
151
201
251
301
Sequence (1A0S)
351
401
451
Dynamic programming filtering procedure
TMS probability
TMS probability
Predicted TMS
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1
51
101
151
201
251
301
Sequence (1A0S)
351
401
Maximum-scoring subsequences with constrained segment length
and number
451
Dynamic programming filtering procedure
TMS probability
TMS probability
Observed TMS
Predicted TMS
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1
51
101
151
201
251
301
Sequence (1A0S)
351
401
Maximum-scoring subsequences with constrained segment length
and number
451
Predictors of alpha-transmembrane topology
www.cbs.dtu.dk/services/TMHMM
Hybrid systems: Basics
•Sequence profile based HMMs
MSA
Sequence profiles
1
2
3
4
5
6
7
8
9
10
Y
Y
Y
Y
Y
Y
Y
G
T
T
K
R
R
R
R
K
R
Y
K
K
D
D
D
D
D
D
D
G
G
G
Y
Y
Y
Y
Y
Y
Y
F
Y
Y
H
Q
Q
V
Q
N
Q
G
G
G
S
T
S
S
F
T
T
F
F
G
G
D
D
D
D
D
H
D
L
L
L
K
Q
H
H
Q
Q
H
I
I
I
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
N
N
N
G
G
G
G
G
N
A
T
T
T
E
D
E
E
S
E
D
E
E
E
L
L
L
L
L
S
L
T
T
T
T
T
T
K
K
K
Sequence profile
sequence position
A
C
D
E
F
G
H
K
I
L
M
N
P
Q
R
S
T
V
W
Y
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
20
0
0
70
0
0
0
0
0
0
0
40
0
0
0
0
0
0
50
0
0
0
10
0
0
0
70
0
0
30
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
0
0
0
0
0
30
10
0
0
0
0
10
0
40
0
0
0
10
0
0
0
0
0
0
0
0
0
0
33
0
0 100
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
33
0
33
0
0
0
0
0
0
0
0
0
60
0
0
0
10
0
0
30
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
30
0
10 100
30
0
0
0
0
0
0
0
0
0
30
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
70
0
0
0
30
0
0
0
0
0
0
0
0
10
0
0
0
0
50
0
0
0
0
0
10
0
0
0
0
30
0
0
0
0
0
20
70
0
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 100
0
0
0
0
0
0
60
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
0
0
30 100
0
0
0
0
0
0
0
0
0
0
Sequence-profile-based HMM
.. A
90
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
n
C
0
85
0
0
5
0
0
0
0
2
0
8
0
0
0
0
0
0
0
0
L
0
0
0
0
4
0
13
0
4
0
5
0
6
0
0
23
0
1
44
0
P R
0
0
22
0
23
0
0
5
0
23
0
3
0
11
0
0
2
0
11
0
0
34
0
0
0
24
0
0
0
0
0
2
0
22
0
18
0
0
0
0
P
8
0
0
0
0
0
0
0
0
0
0
0
92
0
0
0
0
0
0
0
E
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
77
0
23
Sequence of characters
T ...
3
0
2
7
4
0
8
6
1
3
6
5
5
12
5
6
17
2
2
6
st
t
Sequence of
M-dimensional vectors
vt
For proteins M=20
Constraints
0  vt (n)  S
 t, n
Sk=1 v (k) = S
t
M
t
Sequence-profile-based HMM
Probability of emission from state k
Sequence of characters
Sequence of
M-dimensional vectors
constraints
P(vt|k)
Sn=1 e (n) = 1
dM vt
=1
st
vt
P(st|k) = ek(st)
P(vt|k)
1
=
Z
Sn=1 v (n)e (n)
SA
Z=
A!
M
t
k
Sn=1 e (n)
M
k
M
If
k
Z is independent of the state
Algorithms for training and probability computation can be derived
Hybrid systems: Basics
•Sequence profile based HMMs
•Membrane protein topology
Scoring the prediction
1) Accuracy:
Q2=P/N
where P is the total number of correctly predicted residues and N is the total number of residues
2) Correlation coefficient C:
C(s)=(p(s)*n(s) - u(s)*o(s))/[(p(s)+u(s))(p(s)+o(s))(n(s)+u(s))(n(s)+o(s))]1/2
where, for each class s, p(s) and n(s) are respectively the total number of correct predictions and
correctly rejected assignments while u(s) and o(s) are the numbers of under and over predictions
3) Accuracy for each discriminated structure s:
Q(s)=p(s)/[p(s)+u(s)]
where p(s) and u(s) are the same as in equation 2
4) Probability of correct predictions P(s) :
P(s)=p(s)/[p(s)+o(s)]
where p(s) and o(s) are the same as in equation 2
5) Segment-based measure (Sov):
Sov 

S
 S2  L1
 S1 

 S2  N
 S1 
Topology of -barrel membrane proteins
Performance of sequence-profile-based HMM
Q2
83%
76%
78%
QTMS Qloop
83%
77%
74%
82%
76%
82%
PTMS
79%
72%
81%
Ploop
85%
80%
76%
Corr
Sov
HMM based on
Multiple Sequence
Alignment
0.65
0.83
0.53
Standard HMM based
0.64 on Single Sequence
0.56
0.79
NN based on
Multiple Sequence
Alignment
Martelli PL, Fariselli P, Krogh A, Casadio R -A sequence-profile-based HMM for
predicting and discriminating beta barrel membrane proteins- Bioinformatics 18:
S46-S53 (2002)
Topology of -barrel membrane proteins
Discriminative power of the profile-based HMM
100
90
80
Percentage
70
Beta barrel membrane proteins (145)
Globular proteins(1239)
All helical membrane proteins (188)
60
50
40
30
20
10
0
2.75
2.8
2.85
2.9
I(s | M) = -1/L log P(s | M)
2.95
The Bologna predictor
for the topology of
all  membrane
proteins
NN
Sequence
Sequence Profiles
HMM1
HMM2
S
MaxSubSeq
Topography
Von Heijne rule
Topology
Prediction
Topology of all- membrane proteins
Performance
Q2 %
Corr
QTM %
NN°
85.8
0.714
84.1
QLoop
%
87.3
PTM
%
84.7
PLoop
%
86.8
Qtopography
Qtoplogy
49/59 (83%)
38/59 (64%)
0.908
HMM1°
84.4
0.692
88.6
80.9
79.5
89.4
48/59 (81%)
38/59 (64%)
0.896
HMM2°
82.4
0.658
88.0
78.1
77.1
88.1
48/59 (81%)
39/59 (66%)
0.872
Jury°
85.3
0.708
86.9
84.1
82.1
88.4
53/59 (90%)
42/59 (71%)
0.926
TMHMM 2.0 82.3
0.661
70.9
92.9
89.3
79.2
42/59 (71%)
32/59 (54%)
0.840
MEMSAT
83.6
0.672
70.6
93.9
90.5
75.7
35/49 (71%)
24/49 (49%)
0.823
PHD
80.1
0.614
63.6
94.0
89.9
75.5
43/59 (73%)
30/59 (51%)
0.847
HMMTOP
81.2
0.627
68.5
91.8
87.5
77.7
45/59 (76%)
35/59 (59%)
0.862
Nir Ben-Tal
46/59 (78%)
KD
43/59 (73%)
SOV
° ° Test
Martelli PL, Fariselli P, Casadio R -An ENSEMBLE machine learning approach for
the prediction of all-alpha membrane proteins- Bioinformatics (in press, 2003)
HMM: Application in gene finding
•Basics
Eukaryotic gene structure
A. Krogh
Simple model for coding regions
A. Krogh
Simple model for unspliced gene
A. Krogh
Simple model for spliced gene
A. Krogh