Sequence classification & hidden Markov models

Download Report

Transcript Sequence classification & hidden Markov models

Sequence classification &
hidden Markov models
Bioinformatics,
Models & algorithms,
8th November 2005
Patrik Johansson,
Dept. of Cell & Molecular Biology,
Uppsala University
A family of proteins share a similar structure but
not necessarily sequence
Classification of an unknown sequence s to family
A or B using HMMs
A
A
A A
A
A A A
A
s
B
B
B
B B
B
B B
Hidden Markov Models, introduction
• General method for pattern recognition, comp. Neural networks
• An HMM generates sequences / sequence distributions
• Markov chain of events
Three coins A, B & C gives a Markov chain Г = CAABA..
A
A
A
B
B
B
C
C
C
Outcome, e.g. Heads Heads Tails,
generated by hidden Markov chain Г
Hidden Markov Models, introduction..
• Model M is emitting a symbol (T, H) in each state i based on some
probability ei
• The next state j is chosen based on some transition probability ai,j
A
A
A
e.g, the sequence s = ‘Tails Heads Tails’
B
B
B
over the path Г = BCC
C
C
C
Tails
Heads
Tails
B
C
C
P
(
s
|
M
)

a

e
(
T
a
i
l
s
)

a

e
(
H
e
a
d
s
)

a

e
(
T
a
i
l
s
)
0
,
B
1
1
B
1
,
C
2
2
C
2
,
C
3
3
P
(s'|M
)1

s'
S
Profile hidden Markov Model architecture
• A first approach for sequence
distribution modelling
B
M1
Mj
N
P( s | M )   e Mj ( si )
j 1
MN
E
Profile hidden Markov Model architecture..
• Insertion modelling
Ij
B
Mj -
Mj
Mj+
E
Insertions random ; ejI(a) =q(a)
 k

P(k insertions )  aMj, Ij q (s1 )  aIj , Ijq (si ) aIj , Mj1
 i 2

Profile Hidden Markov Model architecture..
• Deletion modelling
Mj
Alt.
Dj
Mj
Profile Hidden Markov Model architecture..
Insert & deletestates are generalized to all positions. The model M
can generate sequences from state B by successive emissions and
transitions to state E
Dj
Ij
B
Mj
E
Probabilistic sequence modelling
• Classification criteria
P
(
M
|
s
)

P
(
M
|
s
)
,
i

1
.
.
.
k
j
i
(1)
Bayes theorem ;
P( M | s ) 
P( s | M ) P ( M )
P( s )
(2)
..but, P(M) & P(s)..?
P (M | s) P( s | M ) P( M )

P( N | s ) P( s | N ) P( N )
(3)
Probabilistic sequence modelling..
If N models the whole sequence space (N = q)
P (M | s)
1
P( q | s)
Since
(4)
P
(s|M
)1, logarithm probabilities more convenient


log P ( M | s)  log P( q | s )  log P( s | M )  log P( s | q)  log
Def., log-odds score V;
s
c
o
r
e

l
o
g
P
(
s
|M
)

l
o
g
P
(
s
|q
)
(5)
P (M )
P (q )
Probabilistic sequence modelling..
Eq. ( 4 ) & ( 5 ) gives new classification criteria ;
s
c
o
r
e

l
o
g
P
(
q
)

l
o
g
P
(
M
)

d
(
)
z
z
score =
logzP(s | M)
logzP(s | q)
≥ d
(6)
..for a certain significance level  (i.e. the number of incorrect
classifications in an n big database) a threshold d is required
  nzd 
(7)
d

lo
g

lo
g
z
zn
Probabilistic sequence modelling..
Example
If z=e or z=2, the significance level is chosen to one incorrect classification
(false positive) per 1000 trials in a database of n=10000 sequences ;
d


l
n
l
n
n

1
6
nits,
lo
g
lo
g

2
3bits
2 
2n
Large vs. small threshold d
High d
Low d
A
A
A A
A
A A A
B
B
B
B B
B
B B
A
False positive
True positives
Model characteristics
One can define sensitivity, ‘how many are found’ ;
recall , accuracy , sensitivit y 
true positives
true examples
..and selectivity, ‘how many are correct’ ;
precision , reliabilit y, selectivit y 
true positives
all positives
Model construction
• From initial alignment
Most common method. Start from an initial multiple
alignment of e.g. a protein family
• Iteratively
By successive database searches incorporating new similar
sequences into the model
• Neural-inspired
The model is trained using some continuous minimization
algorithm, e.g. Baum-Welsh, Steepest Descent etc.
Model construction..
A short family alignment gives a simple model M,
potential matchstates marked with an ()
A
A
A
S
A
*
_
D
D
D
E
*
_
_
_
_
L
_
_
_
_
G
K
R
R
K
R
*
B
E
Model construction..
A more generalized model
Ex. evaluate sequence s=‘AIEH’
A
A
A
S
A
*
_
D
D
D
E
*
_
_
_
_
L
A
A
A
S
A
*
_
_
D
D
D
E
*
A
_
_
_
_
L
_
_
_
_
G
_
_
_
_
G
K
R
R
K
R
*
K
R
R
K
R
*
IEH_
A
A
A
S
A
*
A
E
B
_
_
_
_
_
B
E
A
A
A
S
A
*
A
_
D
D
D
E
*
I
_
_
_
_
_
_
_
_
_
L
K
R
R
K
R
*
E_H
_
D
D
D
E
*
IE
_
_
_
_
G
_
_
_
_
I
K
R
R
K
R
*
__H
B
E
_
_
_
_
G
B
E
Sequence evaluation
The optimal alignment, i.e. the path that has the greatest probability
of generating sequences s, can be determined through dynamic
programming
Dj-1
The maximum log-odds score VjM(si) for
matchstate j that is emitting si is
calculated from the emission score,
previous maximum score plus transition
score
Ij-1
Mj-1
V jM1( s i1 )  log z a Mj1, Mj
eMj ( si )
 I
M
V j ( si )  log z
 max  V j 1 ( si1 )  log z aIj 1, Mj
q( si )
V D ( s )  log a
z Dj 1, Mj
 j1 i1
Mj
Sequence evaluation..
Viterbis Algorithm,
V jM1( s i1 )  log z a Mj1, Mj
eMj ( si )

V jM ( si )  log z
 max  V jI1 ( si1 )  log z aIj 1, Mj
q( si )
V D ( s )  log a
z Dj 1, Mj
 j 1 i  1
(8)
V jM ( si 1 )  log z aMj , Ij
e (s )

V jI ( si )  log z Ij i  max  V jI ( si 1 )  log z aIj , Ij
q ( si )
V D ( s )  log a
z Dj , Ij
 j i 1
(9)
V jM1 ( si )  log z aMj 1 , Dj

V jD ( si )  max  V jI1( si )  log z a Ij 1, Dj
V D ( s )  log a
z Dj 1, Dj
 j 1 i
( 10 )
B
V
(
s
)

0
0
0

P
(
b
ö
r
j
a
r
m
e
d
s
)

1
1
B
V
(
s
)



,
i

0

P
(
b
ö
r
j
a
r
e
j
m
e
d
s
)

0
0
i
1
Parameter estimation, background
• Proteins with similar structures can have very different sequences
• Classical sequence alignment based only on heuristic rules &
parameters cannot deal with sequence identities below ~ 50-60%
• Substitution matrices add static a priori information about amino acids
and protein sequences  good alignments down to ~ 25-30% sequence
identity, ex. CLUSTAL
• How to get further down into ‘the twilight zone’..?
- More and dynamic a priori information..!
Parameter estimation
A
A
A
S
A
*
_
D
D
D
E
*
_
_
_
_
L
_
_
_
_
G
K
R
R
K
R
*
Probability of emitting an alanine in the first
matchstate, eM1(‘A’)..?
• Maximum likelihood-estimation
eMj ( a) 
c j ( a)
 c j (a ' )
a'
Parameter estimation..
• Add-one pseudocount estimation
Add-one pseudocount
eMj ( a) 
c j (a )  1
 c j (a ' )  20
a'
• Background pseudocount estimation
Background Pseudocount
eMj ( a) 
c j ( a )  A q (a )
 c j (a' )  A
a'
Parameter estimation..
• Substitution mixture estimation
Score :
s( a, b)  log
P( a, b)
q ( a) q( b)

s(a
,b
)
P
(b
|a
)
q
(b
)e
Maximum likelihood gives pseudocounts  :
f j (a ) 
c j ( a)
 c j (a' )
(
a
)

A
fj(
b
)
P
(
a
|b
)

j
b
a'
Total estimation :
eMj ( a) 
c j ( a )  j ( a)
 c j (a ' )  j (a ' )
a'
S ubst i t ut i on M i x t ur e
0. 4
0. 35
0. 3
0. 25
0. 2
A mi noA ci d
0. 15
0. 1
0. 05
0
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
Parameter estimation..
All above methods are in spite of their dynamic implementation, still
based on heuristic parameters
Method that compensates & complements lack of data in a
statistically correct way ;
• Dirichlet mixture estimation
Looking at sequence alignments, several different amino acid distributions
seem to be reoccurring, not just the background distribution q
Assume that there are k probability densities

g

g

.
.
.
g
11
11
kk
k
g
j 1
j
1
( p ) that generates these
Parameter estimation, Dirichlet Mixture style..
Given the data, a countvector nn1 n20 , this method allows a linear
combination of k individual estimations weighted with the probability that
n is generated by each component
Dirichlet Mixture
ni 
k
ei   P(
j
| n)
j 1
20
n
l
( j)
i

( j)
l
l 1
The k componets can be modelled from a curated database of alignments.
Using some parametric form of the probability density, an explicit
expression for the probability that n has been generated by the jth
component can be derived
Ex.

( p) 
20
i 1
pii 1
,
 i 1
20
i
(n 1)( ( j ) ) 20 (ni  i( j) )
P(n | j ) 
i1 (ni  1)( i )
(n  ( j) )
Parameter estimation, Dirichlet Mixture style..
n
The k components describe peaks of
aa distributions in some kind of
multidimensional space
Depending on where in sequence
space our countvector n lies, i.e.
depending on which components
that can be assumed to have
generated n, distribution
information is incorporated into the
probability estimation e
Classification example
Alignment of some known glycoside hydrolase family 16 sequences
• Define which columns are to be regarded as matchstates (*)
• Build the corresponding model M & HMM graph
• Estimate all emission and transition probabilities, ej & ajk
• Evaluate the log-odds score / probability that an unknown sequence s has
been generated by M using Viterbis algorithm
• If score(s | M) > d, the sequence can be classified as a GH16 family member
Classification example..
A certain sequence s1=WHKLRQ.. is evaluated and gets a score of -17.63
nits, i.e. the probability that M has generated s1 is very small
Another sequence s2=SDGSYT.. gets a score of 27.49 nits and can with
good significance be classified as a family member
Summary
• Hidden Markov models are used mainly for classification /
searching (PFAM), but also for sequence mapping / alignment
• As compared to normal alignment, a position specific approach is
used for sequence distributions, insertions and deletions
• Model building is usually a compromise between sensitivity and
selectivity. If more a priori information is incorporated, the
sensitivity goes up whereas the selectivity goes down