CSCE590/822 Data Mining Principles and Applications

Transcript CSCE590/822 Data Mining Principles and Applications

CSCE555 Bioinformatics
Lecture 6 Hidden Markov Models
HAPPY CHINESE NEW YEAR
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008 www.cse.sc.edu.
Roadmap

Probablistic Models of Sequences

Introduction to HMM

Profile HMMs as MSA models

Measuring Similarity between Sequence and HMM
Profile model

Summary
7/20/2015
2
Multiple Sequence Alignment
Alignment containing multiple DNA / protein sequences
 Look for conserved regions → similar function
 Example:

#Rat
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
#Mouse
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
#Rabbit
ATGGTGCATCTGTCCAGT--GAGGAGAAGTCTGC
#Human
ATGGTGCACCTGACTCCT--GAGGAGAAGTCTGC
#Oppossum ATGGTGCACTTGACTTTT--GAGGAGAAGAACTG
#Chicken
ATGGTGCACTGGACTGCT--GAGGAGAAGCAGCT
#Frog
---ATGGGTTTGACAGCACATGATCGT---
3
Probablistic Model: Position-specific
scoring matrices (PSSM)
Limitations
of PSSM?
Difficulty in biological sequences

Variation in a family of sequences
◦
◦
◦
◦
Gaps of variable lengths
Conserved segments with different degrees
PSSM cannot handle variable-length gaps
Need a statistical sequence model
5
Regular Expressions Model

Regular expressions
◦ Protein spelling is much more free that English
spelling
◦
Limitation
of Regular
expression
model?
◦ [AT] [CG] [AC] [ACGT]* A [TG] [GC]
6
Roadmap

Probablistic Models of Sequences

Introduction to HMM

Profile HMMs as MSA models

Measuring Similarity between Sequence and HMM
Profile model

Summary
7/20/2015
7
Hidden Markov Model (HMM)

HMM is:
◦ Statistical model
◦ Well suited for many tasks in molecular
biology

Using HMM in molecular biology
◦ Probabilistic profile (profile HMM)
 From a family of proteins, for searching a database
for other members of the family
 Resemble the profile and weight matrix methods
◦ Grammatical structure
 Gene finding
 Recognize signals
 Prediction (must follow the rules of a gene)
8
Detect Cheating in Coin Toss Game
Fair and biased coins could
be used
 Question: is it possible to
determine whether a
biased coin has been used
based on the output
sequence of the Head/Tail
sequence?
 HTTTHTHTHTTHHHHT
HTHTHTHHHHTHT

EXAMPLE : Fair Coin Toss
Consider the single coin scenario
 We could model the process producing the
sequence of H’s and T’s as a Markov model with
two states, and equal transition probabilities:

0.5
0.5
H
T
0.5
Only one fair coin is
used here
0.5
Example: Fair and Biased Coins


Consider the scenario where there are two coins: Fair
coin and Biased coin
Visible state do not correspond to hidden state
- Visible state : Output of H or T
- Hidden state : Which coin was tossed
HTTTHTHTHTTHHHHTHTHTHTHHHHTHT
Hidden Markov Models
12
Ingredients of a HMM

Collection of states:
{S1, S2,…,SN}

State transition probabilities (transition matrix)
Aij = P(qt+1 = Si | qt = Sj)

Initial state distribution
 = P(q
i
1
= Si)

Observations:{O1, O2,…,OM}

Observation probabilities:
Bj(k) = P(vt = Ok | qt = Sj)
13
Ingredients of Our HMM

States:
{Ssunny, Srainy, Ssnowy}

State transition probabilities (transition matrix)
.08 .15 .05
A = .38 .6
.02
.75 .05 .2

Initial state distribution
 = (.7
i


.25 .05)
Observations:{O1, O2,…,OM}
Observation probabilities (emission matrix): B =
.08 .15 .05
.38 .6
.02
.75 .05 .2
14
Probability of a Sequence of Events
P(O) = P(Ogloves, Ogloves, Oumbrella,…,
q ,…q
Oumbrella)
1
7
=  P(O | Q)P(Q) =  P(O | q1,…,q7)
all Q
= 0.7 x 0.86 x 0.32 x 0.14 x 0.6 + …
15
Typical HMM Problems
Annotation Given a model M and an observed string S,
what is the most probable path through M generating S
Classification Given a model M and an observed string S,
what is the total probability of S under M
Consensus Given a model M, what is the string having the
highest probability under M
Training Given a set of strings and a model structure, find
transition and emission probabilities assigning high
probabilities to the strings
16
Roadmap

Probablistic Models of Sequences

Introduction to HMM

Profile HMMs as MSA models

Measuring Similarity between Sequence and HMM
Profile model

Summary
7/20/2015
17
HMM Profiles as Sequence Models
Given the multiple alignment of sequences, we
can use HMM to model the sequences
 Each column of the alignment may be
represented by a hidden state that produced
that column
 Insertions and deletions may be represented by
other states

Profile HMMs

HMM with a structure that in a natural
way allows position-dependent gap
penalties
◦ Main states
 model the columns of the alignment
◦ Insert states
 model highly variable regions
◦ Delete states
 to jump over one or more columns
 i.e. to model the situation when just a few of the
sequences have a “-” in the multiple alignment at a
position
19
HMM Sequences Continued
Profile HMM Example


Consider the following six sequences shown below
A multiple sequence alignment of these sequences is the first
step towards the processing of inducing the hidden markov
model
SEQ1
SEQ2
SEQ3
SEQ4
SEQ5
SEQ6
G C C CA
AGC
AA G C
A GAA
AAA C
AGC
Profile HMM Topology





The topology of HMM is established using consensus sequence
The structure of a Profile HMM is shown below:The square box represent match states
Diamonds represent insert states
Circles represent delete states
Profile HMM Example Continued



The aligned columns correspond to either emissions from the
match state or to emissions from the insert state
The consensus columns are used to define the match states
M1,M2,M3 for the HMM
After defining the match states, the corresponding insert and
delete states are used to define the complete HMM topology
Transition Probabilities


The values of the transition probabilities are computed
using the frequency of the transitions as each sequence
is considered
The model parameters are computed using the state
transition sequences shown in the figure below:-
Transition Probabilities Continued

The frequency of each of the transitions and the
corresponding emission probabilities are shown below
State
0 1 2 3
MM
MD
MI
4 5 6 4
1 0 0 1 0 0 2
IM
ID
II
1 0 0 2
0 0 0 0 0 0 2
DM
DD
DI
- 1 0 0
- 0 0 - 0 0 0
Emission Probabilities

The emission probability is computed
using the formula:-

The emission probability specifies the
probability of emitting each of the
symbols in |∑ | in the state k
Emission Probabilities Continued

The emission probability for each state is
computed as shown below:
Searching the Profile HMM

Sequences can be searched against the HMM to detect
whether or not they belong to a particular family of
sequences described by the profile HMM

Using a global alignment, the probability of the most
probable alignment and sequence can be determined
using the Viterbi algorithm

Full probability of a sequence aligning to the profile
HMM determined using the forward algorithm
How A Sequence Fit a Model?
◦ Probability depends on the length of the
sequence
◦ Not suitable to use as a score
29
Length-independent Score

Log-odds score
◦ The logarithm of the probability of the
sequence divided by the probability according
to a null model
◦
◦
30
Length-independent Score

HMM using log-odds
◦
◦
31
Summary
HMM
 How to build Profile HMM model
 Scoring Fit between Sequence and HMM
model

Next Lecture
Gene-finding
 Reading:

◦ Textbook (CG) chapter 4
◦ Textbook (EB) chapter 8

CSCE590/822 Data Mining Principles and Applications

Transcript CSCE590/822 Data Mining Principles and Applications

Directory