CS5263 Bioinformatics - Department of Computer Science

Download Report

Transcript CS5263 Bioinformatics - Department of Computer Science

CS5263 Bioinformatics
Lecture 14: Hidden Markov
Models and applications
• Last lecture
– Some practical issues in HMM modeling
– HMMs for pairwise sequence alignment
• Today
– HMM for multiple sequence alignment
– HMM for gene prediction
Duration models
s
P(L)
p
1-p
L
p
s
s
s
s
1-p
Min, then geometric
p
p
p
p
s
s
s
s
1-p
1-p
Negative binominal
1-p
1-p
Explicit duration modeling
Exon
Generalized HMM.
Often used in gene
finders
Intron
Intergenic
P(A | I) = 0.3
P(C | I) = 0.2
P(G | I) = 0.2
P(T | I) = 0.3
P
L
Empirical intron length distribution
Silent states
• Silent states are states that do not emit symbols
(e.g., the state 0 in our previous examples)
• Silent states can be introduced in HMMs to
reduce the number of transitions
Dj
B
E
B
Mj
Silent state
E
FSA for global alignment
e


Xi aligned
to a gap
d
Xi and Yj
aligned
d

Yj aligned
to a gap
e
Pair-HMM for global alignment
1- 
1-2
Xi and Yj
aligned
P(xi,yj)
16 emission probabilities
1-
Xi aligned
to a gap
q(xi): 4 emission probabilities



q(yj): 4 emission probabilities
Yj aligned
to a gap

Pair-wise HMM: emit two sequences simultaneously
Algorithm is similar to regular HMM, but need an additional dimension.
e.g. in Viterbi, we need Vk(i, j) instead of Vk(i)
Pair-HMM and FSA for alignment
• After proper transformation between the
probabilities and substitution scores, the two are
identical
 (a, b)  log [p(a, b) / (q(a) q(b))]
 d  log 
 e  log 
• Details in Durbin book chap 4
• Finding an optimal FSA alignment is equivalent
to finding the most probable path with Viterbi
• Has more use in comparative gene finding than
in seq alignment
HMM for multiple sequence alignment
Question
• How to represent a multiple alignment so
that comparing a single sequence with an
alignment can be done efficiently?
Solution 1
• Use regular expression to represent the
consensus sequences
– Implemented in the ProSite database
– for example
C - x(2) - P - F - x - [FYWIV] - x(7) - C x(8,10) - W - C - x(4) - [DNSR] - [FYW] x(3,5) - [FYW] - x - [FYWI] - C
Multi-alignments represented by
regular expression
•
•
•
•
•
Intuitive
Flexible
Efficient matching algorithms
Mismatches allowed possibly
Problems
– Limited power in expressing more divergent positions
– Specify boundaries of indels not so easy
– May have to change the regular expression when a
new member of a protein family is identified
Solution 2
• For a non-gapped
alignment, we can
create a positionspecific weight matrix
(PWM)
– PSWM
– PSSM
• Also called a profile
• May use
pseudocounts
A
4
8
3
4
7
6
1
1
5
6
R
3
1
0
10
2
1
1
2
1
13
N
3
1
2
8
40
1
4
0
7
1
D
6
0
8
5
0
1
1
3
5
2
C
0
9
12
0
0
3
0
2
2
4
E
1
5
3
0
2
0
2
4
3
3
Q
3
0
1
2
3
3
2
0
2
11
G
3
6
5
3
5
4
2
1
0
6
H
2
4
4
2
4
4
32
7
6
7
I
7
2
25
13
2
2
1
50
6
4
L
4
4
6
8
0
1
1
3
10
8
K
33
5
1
2
4
1
1
9
31
2
M
7
1
2
10
4
2
1
4
1
2
F
6
7
8
3
2
4
2
1
7
10
P
1
27
2
7
9
1
3
3
1
1
S
2
4
1
9
2
2
1
0
1
4
T
5
0
8
8
2
60
37
1
2
4
W
2
7
1
3
7
2
3
1
3
6
Y
4
0
5
1
4
1
1
5
3
1
V
4
8
2
1
1
0
4
3
2
6
Scoring by PWMs
A
4
8
3
4
7
6
1
1
5
6
R
3
1
0
10
2
1
1
2
1
13
N
3
1
2
8
40
1
4
0
7
1
D
6
0
8
5
0
1
1
3
5
2
C
0
9
12
0
0
3
0
2
2
4
E
1
5
3
0
2
0
2
4
3
3
Q
3
0
1
2
3
3
2
0
2
11
G
3
6
5
3
5
4
2
1
0
6
H
2
4
4
2
4
4
32
7
6
7
I
7
2
25
13
2
2
1
50
6
4
L
4
4
6
8
0
1
1
3
10
8
K
33
5
1
2
4
1
1
9
31
2
M
7
1
2
10
4
2
1
4
1
2
F
6
7
8
3
2
4
2
1
7
10
P
1
27
2
7
9
1
3
3
1
1
S
2
4
1
9
2
2
1
0
1
4
T
5
0
8
8
2
60
37
1
2
4
W
2
7
1
3
7
2
3
1
3
6
Y
4
0
5
1
4
1
1
5
3
1
V
4
8
2
1
1
0
4
3
2
6
x = KCIDNTHIKR
P(x | M) = i ei(xi)
Random model: each amino
acid xi can be generated with
probability q(xi)
P(x | random) = i q(xi)
Log-odds ratio:
S = log P(X|M) / P(X|random)
= i log ei(xi) / q(xi)
PWMs
• Advantage:
– Can be used to identify both strong and weak
homologies
– Easy to implement and use
– Probabilistic interpretation
• Problem:
– Not intuitive to deal with gaps
PWMs are HMMs
B
M1
Mk
Transition probability = 1
20 emission probabilities for each state
• This can only be used to search for sequences without
insertion / deletions (indels)
• We can add additional states for indels!
E
Dealing with insertions
Ij
B
M1
Mj
Mk
E
• This would allow an arbitrary number of
insertions after the j-th position
– i.e. the sequence being compared can be
longer than the PWM
Dealing with insertions
B
I1
Ij
Ik
M1
Mj
Mk
E
• This allows insertions at any position
Dealing with Deletions
B
Mi
Mj
E
• This would allow a subsequence [i, j] being
deleted
– i.e. the sequence being compared can be
shorter than the PWM
Dealing with Deletions
B
E
• This would allow an arbitrary length of
deletion
– Completely forward connected
– Too many transitions
Deletion with silent states
B
Dj
Silent state
Mj
E
• Still allows any length of deletions
• Fewer parameters
• Less detailed control
Full model
• Called profile-HMM
Dj
D: deletion state
I: insertion state
M: matching state
Ij
B
Mj
E
Algorithm: basically the same as a regular HMM
Using profile HMM
• Alignment
– Align a sequence to a profile HMM
– Viterbi
• Searching
– Given a sequence and HMMs for different protein families, which
family the sequence may belong to?
– Given a HMM for a protein family and many proteins, which
protein may belong to the family?
– Viterbi or forward
– How to score?
• Training / Learning
– Given a multiple alignment, how to construct a HMM?
– Given an unaligned protein family, how to construct a HMM?
Pfam
• A database of protein families
– Developed by Sean Eddy and colleagues while
working in Durbin’s lab
•
•
•
•
Hand-curated “seed” multiple alignment
Train HMM from seed alignment
Hand-chosen score thresholds
Automatic classification / classification of all
other protein sequences
• 7973 families in Rfam 18.0, 8/2005
(covers ~75% of proteins)
Modeling building from aligned sequences
• Match for columns with
no gaps
• When gaps exist, how to
decide whether they are
insertions or matches?
– Heuristic rule: >50%
gaps, insertion,
otherwise, match
• How to add
pseudocount?
– Simply add one
– According to
background distribution
– Use a mixture of priors
(Dirchlet mixtures)
• Sequence weighting
– Very similar sequences
should each get less
weight
Modeling building from unaligned sequences
• Choose a model length and initial parameters
– Commonly use average seq length as model length
• Baum-Welch or Viterbi training
– Usually necessary to use multiple starting points or
other heuristics to escape from local optima
• Align all sequences to the final model using
Viterbi
– Why not F-B?
Align a seq to profile HMMs
Viterbi
Searching
Protein
Database
+
• Scoring
– Log likelihood: Log P(X | M)
– Log odds: Log P(X | M) / P(X | random)
– Length-normalization
• Is the hit real?
– How does the score compare with that of
the sequences already in the family?
– How does the score compare with that of
random sequences?
Score for each protein
Example: modeling and searching
for globins
• 300 random picked globin sequence
• Build a profile HMM from scratch (without
pre-alignment)
• Align 60,000 proteins to the HMM
– Majority are non-globin
• Even after length normalization, LL is still lengthdependent
• Log-odd score provides better separation
– Takes amino acid composition into account
– Real globins could have scores less than 0
• Estimate mean score and standard deviation for nonglobin sequences for each length
• Z-score = (raw score – mean) / (standard deviation)
– Z-score is length-invariant
– Real globins have positive scores
Gene prediction
Gene structure
exon1
intron1
exon2
intron2
exon3 Intergenic
5’
3’
transcription
splicing
translation
Exon: coding
Intron: non-coding
Intergenic: non-coding
Finding genes
GATCGGTCGAGCGTAAGCTAGCTAG
ATCGATGATCGATCGGCCATATATC
ACTAGAGCTAGAATCGATAATCGAT
CGATATAGCTATAGCTATAGCCTAT
Human
Fugu
worm
E.coli
As the coding/non-coding length ratio decreases, exon
prediction becomes more complex
Gene prediction in prokaryote
• Finding long ORFs (open reading frame)
• An ORF may not contain stop codons
– Average ORF length = 64/3
– Expect 300bp ORF per 36kbp
– Actual ORF length ~ 1000bp
• Codon biases
– Some triplets are used more frequently than
others
– Codon third position biases
HMM for eukaryote gene finding
• Most basic idea is the same: the distributions of
nucleotides is different in exon and other regions
– Alone won’t work very well
• More signals are needed
exon1
intron1
exon2
intron2
exon3
Intergenic
5’
3’
Promoter
ATG
5’-UTR
Splicing acceptor: AG
Splicing donor: GT
STOP Poly-A
3’-UTR
• How to combine all the signal together?
Simplest model
Intergenic
4 emission
probability
exon
64 triplets
emission
probabilities
intron
4 emission
probability
Actually more accurate at the di-aminoacid level, i.e. 2 codons. Many methods
use 5th-order Markov model for all regions
• Exon length may not be exact multiple of 3
• Basically have to triple the number of states to remember the excess
number of bases in the previous state
More detailed model
Single exon
Init
exon
intron
Intergenic
Term
exon
Internal
Exon
Sub-models
CDS: coding sequence
Init exon
Term exon
5’-UTR
START
CDS
CDS
STOP
3’-UTR
• START, STOP are PWMs
• Including start and stop codons and surrounding bases
Sub-model for intron
Intron
Splice donor
Intron
body
Splice
acceptor
• Sequence logos: an informative display of PWMs
• Within each column, relative height represents probability
• Height of each column reflects “information content”
Explicit duration modeling
x1 x2 x3 ………………………………………..xN
1
2
Pk(i)
K
• For each position j and each state i
– Need to consider the transition from all previous
positions
• Time: O(N2K2)
• N can be 108
Speedup GHMM
• Restrict maximum duration length to be L
– O(LNK2)
• However, intergenic and intron can be quite long
– L can be 105
• Compromise: explicit duration for exons only,
geometric for all other states
• Pre-compute all possible starting points of ORFs
– For init exon: ATG
– For internal/terminal exon: splice donor signal (GT)
GeneScan model
Approaches to gene finding
• Homology
– BLAST, BLAT, etc.
• Ab initio
– Genscan, Glimmer, Fgenesh, GeneMark, etc.
– Each one has been tuned towards certain organisms
• Hybrids
– Twinscan, SLAM
– Use pair-HMM, or pre-compute score for potential
coding regions based on alignment
• None are perfect, never used alone in practice
Current status
• More accurate on internal exons
• Determining boundaries of init and term
exons is hard
• Biased towards multiple-exon genes
• Alternative splicing is hard
• Non-coding RNA is hard
• State of the Art:
– predictions ~ 60% similar to real proteins
– ~80% if database similarity used
– lab verification still needed, still expensive
HMM wrap up
• We’ve talked about
–
–
–
–
–
Probability, mainly Bayes Theorem
Markov models
Hidden Markov models
HMM parameter estimation given state path
Decoding given HMM and parameters
• Viterbi
• F-B
– Learning
• Baum-Welch (Expectation-Maximization)
• Viterbi
HMM wrap up
• We’ve also talked about
– Extension to gHMMs
– HMM for multiple sequence alignment
– gHMM for gene finding
• We did not talk about
– Higher-order Markov models
– How to escape from local optima in learning
Next lectures
• String pattern matching algorithms
– Suffix tree and its applications
• Motif finding
– Biology
– Algorithm
– Combinatorial
– Probabilistic