CSCE590/822 Data Mining Principles and Applications
Download
Report
Transcript CSCE590/822 Data Mining Principles and Applications
CSCE555 Bioinformatics
Lecture 6 Hidden Markov Models
HAPPY CHINESE NEW YEAR
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008 www.cse.sc.edu.
Roadmap
Probablistic Models of Sequences
Introduction to HMM
Profile HMMs as MSA models
Measuring Similarity between Sequence and HMM
Profile model
Summary
7/20/2015
2
Multiple Sequence Alignment
Alignment containing multiple DNA / protein sequences
Look for conserved regions → similar function
Example:
#Rat
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
#Mouse
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
#Rabbit
ATGGTGCATCTGTCCAGT--GAGGAGAAGTCTGC
#Human
ATGGTGCACCTGACTCCT--GAGGAGAAGTCTGC
#Oppossum ATGGTGCACTTGACTTTT--GAGGAGAAGAACTG
#Chicken
ATGGTGCACTGGACTGCT--GAGGAGAAGCAGCT
#Frog
---ATGGGTTTGACAGCACATGATCGT---
3
Probablistic Model: Position-specific
scoring matrices (PSSM)
Limitations
of PSSM?
Difficulty in biological sequences
Variation in a family of sequences
◦
◦
◦
◦
Gaps of variable lengths
Conserved segments with different degrees
PSSM cannot handle variable-length gaps
Need a statistical sequence model
5
Regular Expressions Model
Regular expressions
◦ Protein spelling is much more free that English
spelling
◦
Limitation
of Regular
expression
model?
◦ [AT] [CG] [AC] [ACGT]* A [TG] [GC]
6
Roadmap
Probablistic Models of Sequences
Introduction to HMM
Profile HMMs as MSA models
Measuring Similarity between Sequence and HMM
Profile model
Summary
7/20/2015
7
Hidden Markov Model (HMM)
HMM is:
◦ Statistical model
◦ Well suited for many tasks in molecular
biology
Using HMM in molecular biology
◦ Probabilistic profile (profile HMM)
From a family of proteins, for searching a database
for other members of the family
Resemble the profile and weight matrix methods
◦ Grammatical structure
Gene finding
Recognize signals
Prediction (must follow the rules of a gene)
8
Detect Cheating in Coin Toss Game
Fair and biased coins could
be used
Question: is it possible to
determine whether a
biased coin has been used
based on the output
sequence of the Head/Tail
sequence?
HTTTHTHTHTTHHHHT
HTHTHTHHHHTHT
EXAMPLE : Fair Coin Toss
Consider the single coin scenario
We could model the process producing the
sequence of H’s and T’s as a Markov model with
two states, and equal transition probabilities:
0.5
0.5
H
T
0.5
Only one fair coin is
used here
0.5
Example: Fair and Biased Coins
Consider the scenario where there are two coins: Fair
coin and Biased coin
Visible state do not correspond to hidden state
- Visible state : Output of H or T
- Hidden state : Which coin was tossed
HTTTHTHTHTTHHHHTHTHTHTHHHHTHT
Hidden Markov Models
12
Ingredients of a HMM
Collection of states:
{S1, S2,…,SN}
State transition probabilities (transition matrix)
Aij = P(qt+1 = Si | qt = Sj)
Initial state distribution
= P(q
i
1
= Si)
Observations:{O1, O2,…,OM}
Observation probabilities:
Bj(k) = P(vt = Ok | qt = Sj)
13
Ingredients of Our HMM
States:
{Ssunny, Srainy, Ssnowy}
State transition probabilities (transition matrix)
.08 .15 .05
A = .38 .6
.02
.75 .05 .2
Initial state distribution
= (.7
i
.25 .05)
Observations:{O1, O2,…,OM}
Observation probabilities (emission matrix): B =
.08 .15 .05
.38 .6
.02
.75 .05 .2
14
Probability of a Sequence of Events
P(O) = P(Ogloves, Ogloves, Oumbrella,…,
q ,…q
Oumbrella)
1
7
= P(O | Q)P(Q) = P(O | q1,…,q7)
all Q
= 0.7 x 0.86 x 0.32 x 0.14 x 0.6 + …
15
Typical HMM Problems
Annotation Given a model M and an observed string S,
what is the most probable path through M generating S
Classification Given a model M and an observed string S,
what is the total probability of S under M
Consensus Given a model M, what is the string having the
highest probability under M
Training Given a set of strings and a model structure, find
transition and emission probabilities assigning high
probabilities to the strings
16
Roadmap
Probablistic Models of Sequences
Introduction to HMM
Profile HMMs as MSA models
Measuring Similarity between Sequence and HMM
Profile model
Summary
7/20/2015
17
HMM Profiles as Sequence Models
Given the multiple alignment of sequences, we
can use HMM to model the sequences
Each column of the alignment may be
represented by a hidden state that produced
that column
Insertions and deletions may be represented by
other states
Profile HMMs
HMM with a structure that in a natural
way allows position-dependent gap
penalties
◦ Main states
model the columns of the alignment
◦ Insert states
model highly variable regions
◦ Delete states
to jump over one or more columns
i.e. to model the situation when just a few of the
sequences have a “-” in the multiple alignment at a
position
19
HMM Sequences Continued
Profile HMM Example
Consider the following six sequences shown below
A multiple sequence alignment of these sequences is the first
step towards the processing of inducing the hidden markov
model
SEQ1
SEQ2
SEQ3
SEQ4
SEQ5
SEQ6
G C C CA
AGC
AA G C
A GAA
AAA C
AGC
Profile HMM Topology
The topology of HMM is established using consensus sequence
The structure of a Profile HMM is shown below:The square box represent match states
Diamonds represent insert states
Circles represent delete states
Profile HMM Example Continued
The aligned columns correspond to either emissions from the
match state or to emissions from the insert state
The consensus columns are used to define the match states
M1,M2,M3 for the HMM
After defining the match states, the corresponding insert and
delete states are used to define the complete HMM topology
Transition Probabilities
The values of the transition probabilities are computed
using the frequency of the transitions as each sequence
is considered
The model parameters are computed using the state
transition sequences shown in the figure below:-
Transition Probabilities Continued
The frequency of each of the transitions and the
corresponding emission probabilities are shown below
State
0 1 2 3
MM
MD
MI
4 5 6 4
1 0 0 1 0 0 2
IM
ID
II
1 0 0 2
0 0 0 0 0 0 2
DM
DD
DI
- 1 0 0
- 0 0 - 0 0 0
Emission Probabilities
The emission probability is computed
using the formula:-
The emission probability specifies the
probability of emitting each of the
symbols in |∑ | in the state k
Emission Probabilities Continued
The emission probability for each state is
computed as shown below:
Searching the Profile HMM
Sequences can be searched against the HMM to detect
whether or not they belong to a particular family of
sequences described by the profile HMM
Using a global alignment, the probability of the most
probable alignment and sequence can be determined
using the Viterbi algorithm
Full probability of a sequence aligning to the profile
HMM determined using the forward algorithm
How A Sequence Fit a Model?
◦ Probability depends on the length of the
sequence
◦ Not suitable to use as a score
29
Length-independent Score
Log-odds score
◦ The logarithm of the probability of the
sequence divided by the probability according
to a null model
◦
◦
30
Length-independent Score
HMM using log-odds
◦
◦
31
Summary
HMM
How to build Profile HMM model
Scoring Fit between Sequence and HMM
model
Next Lecture
Gene-finding
Reading:
◦ Textbook (CG) chapter 4
◦ Textbook (EB) chapter 8