Transcript pptx

Using PFAM database’s profile HMMs in
MATLAB Bioinformatics Toolkit
Presentation by: Athina Ropodi
University of Athens- Information
Technology in Medicine and Biology
 Introduction


HMMs
Profile HMMs
 Pfam



Database
General info
Useful links
Available Data
 Bioinformatics

Toolkit
Function presentation
 Other
available software
 Bibliography
In order to approach sequential data without
failing to exploit any correlation between
observations close to each other, we need a
probabilistic model that calculates the joint
distributions for the sequence of observations.
 A simple way to do this is by assuming a
Markovian chain model. The probability of going
form one state to another is called transition
probability.
 In Hidden Markov Models(HMM), assuming a
sequence of symbols (X), e.g. nucleotides in a
DNA sequence or amino-acids in the case of
protein sequences, the emission probabilities
are defined as the probability of having symbol b
when in state k.

 The
M-states produce one of 20 amino-acid
letters, according to P(x|mi).
For each state, there is a delete state(di), where
no amino-acid is produced.
 There is a total of M+1 insert states to either
side of match states according to P(x|di).

 Pfam
is a collection of multiple sequence
alignments and profile hidden Markov models
(HMMs). Each Pfam HMM represents a protein
family or domain.
 For each Pfam entry there is a family page
which can be accessed in several ways.
 Pfam contains two types of families, Pfam-A
and Pfam-B. Pfam-A families are manually
curated HMM based families which we build
using an alignment of a small number of
representative sequences.
 For
each family we build two HMMs, one to
represent fragment matches and one to
represent full length matches. We use the
HMMER2 software to build and search our
profile HMMs.
 Available
links:
http://pfam.sanger.ac.uk/
http://hmmer.janelia.org/
Each family has the following data:




A seed alignment which is a hand edited multiple
alignment representing the family.
Hidden Markov Models (HMM) derived from the seed
alignment, which can be used to find new members
of the domain and also take a set of sequences to
realign them to the model. One HMM is in ls mode
(global) the other is an fs mode (local) model.
A full alignment which is an automatic alignment of
all the examples of the domain using the two HMMs
to find and then align the sequences.
Annotation that contains a brief description of the
domain, links to other databases and some Pfam
specific data. To record how the family was
constructed.
 v.
3.1 for MATLAB (2008a)
 Uses the profile HMMs found in PFAM.
 The search is usually done by accession
number or name of the family.
 Multiple sequence profiles — MATLAB
implementations for multiple alignment and
profile hidden Markov model.
 algorithms (gethmmprof, gethmmalignment,
gethmmtree, pfamhmmread, hmmprofalign,
hmmprofestimate, hmmprofgenerate,
hmmprofmerge, hmmprofstruct,
showhmmprof).
 HMMStruct
= gethmmprof(‘2’)
Number of
match states
Name: '7tm_2'
emission
PfamAccessionNumber: 'PF00002.14'
probabilities
ModelDescription: [1x42 char]
in the MATCH
states.
ModelLength: 296
Alphabet: 'AA'
Symbol emission
probabilities in the MATCH
MatchEmission: [296x20 double]
and INSERT states for the
InsertEmission: [296x20 double]
NULL model.
NullEmission: [1x20 double]
BeginX: [297x1 double]
MatchX: [295x4 double]
InsertX: [295x2 double]
DeleteX: [295x2 double]
FlankingInsertX: [2x2 double]
LoopX: [2x2 double]
NullX: [2x1 double]
>>site='http://pfam.sanger.ac.uk/';
hmm = pfamhmmread([site
'family/gethmm?mode=ls&id=7tm_2']);
Ή
>>pfamhmmread(‘pf00002.ls’);
>>model = pfamhmmread('pf00002.ls');
showhmmprof(model, 'Scale', 'logodds');
hydrophobic = 'IVLFCMAGTSWYPHNDQEKR';
showhmmprof(model, 'Order', hydrophobic);
 'logprob'
— Log probabilities
 'prob' — Probabilities
 'logodds' — Log-odd ratios
Choices for TypeValue are:
 'seed' — Returns a tree with only the
alignments used to generate the HMM model.
 'full' (default) — Returns a tree with all of the
alignments that match the model.
>>tree = gethmmtree(2, 'type', 'seed');
And
>>tr = phytreeread('pf00002.tree');
 Gethmmalignment:
retrieve multiple sequence
alignment associated with hmm profile from
Pfam database
 Hmmprofalign: Align query sequence to profile
using hidden Markov model alignment
>>load('hmm_model_examples','model_7tm_2');
exampleload('hmm_model_examples','sequences');
exampleSCCR_RABIT=sequences(2).Sequence;
[a,s]=hmmprofalign(model_7tm_2,SCCR_RABIT,'sh
owscore',true);
a=
514.7448
s =


LLKLKVMYTVGYSSSLVMLLVALGILCAFRRLHCTRNYIHMHLFLSFILRALSNFI
KDAVLFSSDdaihcdahrvgCKLVMVFFQYCIMANYAWLLV
EGLYLHSLLVVS--FFSERKCLQGFVVLGWGSPAMFVTSWAVTR-----------HFLEDSGC-WDINANAAIWWVIRGPVILSILINFILFINILRILTRKLR---TQETRGQDMNHYKRLARSTLLLIPLFGVHYIVFVFSPEG
-----AMEIQLFFELALGSFQGLVVAVLYCFLNGEV
 hmmprofestimate
- Estimate profile hidden
Markov model (HMM) parameters using
pseudocounts
 Hmmprofgenerate - Generate random
sequence drawn from profile hidden Markov
model (HMM)
 Hmmprofmerge - Concatenate prealigned
strings of several sequences to profile hidden
Markov model (HMM)
>> load('hmm_model_examples','model_7tm_2‘)%load
modelload('hmm_model_examples','sequences') %load
sequences
for ind =1:length(sequences)
[scores(ind),sequences(ind).Aligned] =...
hmmprofalign(model_7tm_2,sequences(ind).Sequence);
end
hmmprofmerge(sequences, scores)
HMMER:
http://hmmer.wustl.edu/
SAM:
http://www.cse.ucsc.edu/research/compbio/sam.html
PFTOOLS:
http://www.isrec.isb-sib.ch/ftp-server/pftools/
GENEWISE:
http://www.ebi.ac.uk/Wise2/
PROBE:
ftp://ftp.ncbi.nih.gov/pub/neuwald/probe1.0/
META-MEME:
http://metameme.sdsc.edu/
PSI-BLAST:
http://www.ncbi.nlm.nih.gov/BLAST/newblast.html
[1] Durbin et al. “Biological Sequence Analysis“,
Cambridge University Press, 1998
[2] Anders Krogh et al. “Hidden Markov Models in
Computational Biology- Applications to protein
modeling”, 1994
[3] Sean R.Eddy “Profile Hidden Markov Models”,
1998
[4] Sean R.Eddy “Hidden Markov Models”, 1996
[5] http://hmmer.janelia.org/#thanks
[6] E.L.L. Sonnhammer, S.R. Eddy and R. Durbin,
“Pfam: a comprehensive database of protein
families based on seed alignments”, 1997
[7] R.D. Finn et al. “Pfam: clans, web tools and
services”, 2006
[8] http://www.mathworks.com/