Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU Objectives • Visualization of binding motifs – Construction of sequence logos • Understand the concepts.

Download Report

Transcript Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU Objectives • Visualization of binding motifs – Construction of sequence logos • Understand the concepts.

Sequence motifs, information content,
logos, and HMM’s
Morten Nielsen,
CBS, BioCentrum,
DTU
Objectives
• Visualization of binding motifs
– Construction of sequence logos
• Understand the concepts of weight matrix
construction
– One of the most important methods of bioinformatics
• Introduce Hidden Markov models and
understand that they are just weight matrices
with gaps
Binding Motif. MHC class I with peptide
Anchor positions
Sequence information
SLLPAIVEL
LLDVPTAAV
HLIDYLVTS
ILFGHENRV
LERPGGNEI
PLDGEYFTL
ILGFVFTLT
KLVALGINA
KTWGQYWQV
SLLAPGAKQ
ILTVILGVL
TGAPVTYST
GAGIGVAVL
KARDPHSGH
AVFDRKSDA
GLCTLVAML
VLHDDLLEA
ISNDVCAQV
YTAFTIPSI
NMFTPYIGV
VVLGVVFGI
GLYDGMEHL
EAAGIGILT
YLSTAFARV
FLDEFMEGV
AAGIGILTV
AAGIGILTV
YLLPAIVHI
VLFRGGPRG
ILAPPVVKL
ILMEHIHKL
ALSNLEVKL
GVLVGVALI
LLFGYPVYV
DLMGYIPLV
TITDQVPFS
KIFGSLAFL
KVLEYVIKV
VIYQYMDDL
IAGIGILAI
KACDPHSGH
LLDFVRFMG
FIDSYICQV
LMWITQCFL
VKTDGNPPE
RLMKQDFSV
LMIIPLINV
ILHNGAYSL
KMVELVHFL
TLDSQVMSL
YLLEMLWRL
ALQPGTALL
FLPSDFFPS
FLPSDFFPS
TLWVDPYEV
MVDGTLLLL
ALFPQLVIL
ILDQKINEV
ALNELLQHV
RTLDKVLEV
GLSPTVWLS
RLVTLKDIV
AFHHVAREL
ELVSEFSRM
FLWGPRALV
VLPDVFIRC
LIVIGILIL
ACDPHSGHF
VLVKSPNHV
IISAVVGIL
SLLMWITQC
SVYDFFVWL
RLPRIFCSC
TLFIGSHVV
MIMVKCWMI
YLQLVFGIE
STPPPGTRV
SLDDYNHLV
VLDGLDVLL
SVRDRLARL
AAGIGILTV
GLVPFLVSV
YMNGTMSQV
GILGFVFTL
SLAGGIIGV
DLERKVESL
HLSTAFARV
WLSLLVPFV
MLLAVLYCL
YLNKIQNSL
KLTPLCVTL
GLSRYVARL
VLPDVFIRC
LAGIGLIAA
SLYNTVATL
GLAPPQHLI
VMAGVGSPY
QLSLLMWIT
FLYGALLLA
FLWGPRAYA
SLVIVTTFV
MLGTHTMEV
MLMAQEALA
KVAELVHFL
RTLDKVLEV
SLYSFPEPE
SLREWLLRI
FLPSDFFPS
KLLEPVLLL
MLLSVPLLL
STNRQSGRQ
LLIENVASL
FLGENISNF
RLDSYVRSL
FLPSDFFPS
AAGIGILTV
MMRKLAILS
VLYRYGSFS
FLLTRILTI
AVGIGIAVV
VDGIGILTI
RGPGRAFVT
LLGRNSFEV
LLWTLVVLL
LLGATCMFV
VLFSSDFRI
RLLQETELV
VLQWASLAV
MLGTHTMEV
LMAQEALAF
IMIGVLVGV
GLPVEYLQV
ALYVDSLFF
LLSAWILTA
AAGIGILTV
LLDVPTAAV
SLLGLLVEV
GLDVLTAKV
FLLWATAEA
ALSDHHIYL
YMNGTMSQV
CLGGLLTMV
YLEPGPVTA
AIMDKNIIL
YIGEVLVSV
HLGNVKYLV
LVVLGLLAV
GAGIGVLTA
NLVPMVATV
PLTFGWCYK
SVRDRLARL
RLTRFLSRV
LMWAKIGPV
SLFEGIDFY
ILAKFLHWL
SLADTNSLA
VYDGREHTV
ALCRWGLLL
KLIANNTRV
SLLQHLIGL
AAGIGILTV
FLWGPRALV
LLDVPTAAV
ALLPPINIL
RILGAVAKV
SLPDFGISY
GLSEFTEYL
GILGFVFTL
FIAGNSAYE
LLDGTATLR
IMDKNIILK
CINGVCWTV
GIAGGLALL
ALGLGLLPV
AAGIGIIQI
GLHCYEQLV
VLEWRFDSR
LLMDCSGSI
YMDGTMSQV
SLLLELEEV
SLDQSVVEL
STAPPHVNV
LLWAARPRL
YLSGANLNL
LLFAGVQCQ
FIYAGSLSA
ELTLGEFLK
AVPDEIPPL
ETVSEQSNV
LLDVPTAAV
TLIKIQHTL
QVCERIPTI
KKREEAPSL
STAPPAHGV
ILKEPVHGV
KLGEFYNQM
ITDQVPFSV
SMVGNWAKV
VMNILLQYV
GLQDCTMLV
GIGIGVLAA
QAGIGILLA
PLKQHFQIV
TLNAWVKVV
CLTSTVQLV
FLTPKKLQC
SLSRFSWGA
RLNMFTPYI
LLLLTVLTV
GVALQTMKQ
RMFPNAPYL
VLLCESTAV
KLVANNTRL
MINAYLDKL
FAYDGKDYI
ITLWQRPLV
Outline
• Pattern recognition
– Regular expressions and
probabilities
• Information content
– Sequence logos
• Multiple alignment and
sequence motifs
• Weight matrix construction
– Sequence weighting
– Low (pseudo) counts
• Examples from the real world
• HMM’s
– Viterbi decoding
• HMM’s in immunology
– Profile HMMs
– TAP binding
• MHC class II binding
– Gibbs sampling
• Links to HMM packages
Sequence Information
• Say that a peptide must have L
at P2 in order to bind, and that
A,F,W,and Y are found at P1.
Which position has most
information?
• How many questions do I need
to ask to tell if a peptide binds
looking at only P1 or P2?
Sequence Information
• Say that a peptide must have L
at P2 in order to bind, and that
A,F,W,and Y are found at P1.
Which position has most
information?
• How many questions do I need
to ask to tell if a peptide binds
looking at only P1 or P2?
• P1: 4 questions (at most)
• P2: 1 question (L or not)
• P2 has the most information
Sequence Information
• Say that a peptide must have L
at P2 in order to bind, and that
A,F,W,and Y are found at P1.
Which position has most
information?
• How many questions do I need
to ask to tell if a peptide binds
looking at only P1 or P2?
• P1: 4 questions (at most)
• P2: 1 question (L or not)
• P2 has the most information
Calculate pa at each position
Entropy
Information content
Conserved positions
– PV=1, P!v=0 => S=0, I=log(20)
Mutable positions
– Paa=1/20 => S=log(20), I=0
Information content
1
2
3
4
5
6
7
8
9
A
0.10
0.07
0.08
0.07
0.04
0.04
0.14
0.05
0.07
R
0.06
0.00
0.03
0.04
0.04
0.03
0.01
0.09
0.01
N
0.01
0.00
0.05
0.02
0.04
0.03
0.03
0.04
0.00
D
0.02
0.01
0.10
0.11
0.04
0.01
0.03
0.01
0.00
C
0.01
0.01
0.02
0.01
0.01
0.02
0.02
0.01
0.02
Q
0.02
0.00
0.02
0.04
0.04
0.03
0.03
0.05
0.02
E
0.02
0.01
0.01
0.08
0.05
0.03
0.04
0.07
0.02
G
0.09
0.01
0.12
0.15
0.16
0.04
0.03
0.05
0.01
H
0.01
0.00
0.02
0.01
0.04
0.02
0.05
0.02
0.01
I
0.07
0.08
0.03
0.10
0.02
0.14
0.07
0.04
0.08
L
0.11
0.59
0.12
0.04
0.08
0.13
0.15
0.14
0.26
K
0.06
0.01
0.01
0.03
0.04
0.02
0.01
0.04
0.01
M
0.04
0.07
0.03
0.01
0.01
0.03
0.03
0.02
0.01
F
0.08
0.01
0.05
0.02
0.06
0.07
0.07
0.05
0.02
P
0.01
0.00
0.06
0.09
0.10
0.03
0.06
0.05
0.00
S
0.11
0.01
0.06
0.07
0.02
0.05
0.07
0.08
0.04
T
0.03
0.06
0.04
0.04
0.06
0.08
0.04
0.10
0.02
W
0.01
0.00
0.04
0.02
0.02
0.01
0.03
0.01
0.00
Y
0.05
0.01
0.04
0.00
0.05
0.03
0.02
0.04
0.01
V
0.08
0.08
0.07
0.05
0.09
0.15
0.08
0.03
0.38
S
3.96
2.16
4.06
3.87
4.04
3.92
3.98
4.04
2.78
I
0.37
2.16
0.26
0.45
0.28
0.40
0.34
0.28
1.55
Sequence logos
Height of a column equal to I
Relative height of a letter is p
Highly useful tool to visualize
sequence motifs
http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html
HLA-A0201
High information
positions
Characterizing a binding motif
from small data sets
10 MHC restricted peptides
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
What can we learn?
1.
A at P1 favors
binding?
2. I is not allowed at P9?
3. K at P4 favors binding?
4. Which positions are
important for binding?
Simple motifs
Yes/No rules
10 MHC restricted peptides
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
• Only 11 of 212 peptides identified!
• Need more flexible rules
•If not fit P1 but fit P2 then ok
• Not all positions are equally important
•We know that P2 and P9
determines binding more than
other positions
•Cannot discriminate between good and
very good binders
Simple motifs
Yes/No rules
10 MHC restricted peptides
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
• Example
RLLDDTPEV 84 nM
GLLGNVSTV 23 nM
ALAKAAAAL 309 nM
•Two first peptides will not fit the motif.
They are all good binders (aff< 500nM)
Extended motifs
Fitness of aa at each position
given by P(aa)
Example P1
PA = 6/10
PG = 2/10
PT = PK = 1/10
PC = PD = …PV = 0
Problems
– Few data
– Data redundancy/duplication
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
RLLDDTPEV 84 nM
GLLGNVSTV 23 nM
ALAKAAAAL 309 nM
Sequence information
Raw sequence counting
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Sequence weighting
Poor or biased sampling
of sequence space
Example P1
PA = 2/6
PG = 2/6
PT = PK = 1/6
PC = PD = …PV = 0
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
}
Similar
sequences
Weight 1/5
RLLDDTPEV 84 nM
GLLGNVSTV 23 nM
ALAKAAAAL 309 nM
Sequence weighting
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Pseudo counts
I is not found at position P9.
Does this mean that I is
forbidden (P(I)=0)?
No! Use Blosum substitution
matrix to estimate pseudo
frequency of I at P9
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
The Blosum matrix
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
0.29
0.04
0.04
0.04
0.07
0.06
0.06
0.08
0.04
0.05
0.04
0.06
0.05
0.03
0.06
0.11
0.07
0.03
0.04
0.07
R
0.03
0.34
0.04
0.03
0.02
0.07
0.05
0.02
0.05
0.02
0.02
0.11
0.03
0.02
0.03
0.04
0.04
0.02
0.03
0.02
N
0.03
0.04
0.32
0.07
0.02
0.04
0.04
0.04
0.05
0.01
0.01
0.04
0.02
0.02
0.02
0.05
0.04
0.02
0.02
0.02
D
0.03
0.03
0.08
0.40
0.02
0.05
0.09
0.03
0.04
0.02
0.02
0.04
0.02
0.02
0.03
0.05
0.04
0.02
0.02
0.02
C
0.02
0.01
0.01
0.01
0.48
0.01
0.01
0.01
0.01
0.02
0.02
0.01
0.02
0.01
0.01
0.02
0.02
0.01
0.01
0.02
Q
0.03
0.05
0.03
0.03
0.01
0.21
0.06
0.02
0.04
0.01
0.02
0.05
0.03
0.01
0.02
0.03
0.03
0.02
0.02
0.02
E
0.04
0.05
0.05
0.09
0.02
0.10
0.30
0.03
0.05
0.02
0.02
0.07
0.03
0.02
0.04
0.05
0.04
0.02
0.03
0.02
G
0.08
0.03
0.07
0.05
0.03
0.04
0.04
0.51
0.04
0.02
0.02
0.04
0.03
0.03
0.04
0.07
0.04
0.03
0.02
0.02
H
0.01
0.02
0.03
0.02
0.01
0.03
0.03
0.01
0.35
0.01
0.01
0.02
0.02
0.02
0.01
0.02
0.01
0.02
0.05
0.01
I
0.04
0.02
0.02
0.02
0.04
0.03
0.02
0.02
0.02
0.27
0.12
0.03
0.10
0.06
0.03
0.03
0.05
0.03
0.04
0.16
L
0.06
0.05
0.03
0.03
0.07
0.05
0.04
0.03
0.04
0.17
0.38
0.04
0.20
0.11
0.04
0.04
0.07
0.05
0.07
0.13
K
0.04
0.12
0.05
0.04
0.02
0.09
0.08
0.03
0.05
0.02
0.03
0.28
0.04
0.02
0.04
0.05
0.05
0.02
0.03
0.03
M
0.02
0.02
0.01
0.01
0.02
0.02
0.01
0.01
0.02
0.04
0.05
0.02
0.16
0.03
0.01
0.02
0.02
0.02
0.02
0.03
F
0.02
0.02
0.02
0.01
0.02
0.01
0.02
0.02
0.03
0.04
0.05
0.02
0.05
0.39
0.01
0.02
0.02
0.06
0.13
0.04
P
0.03
0.02
0.02
0.02
0.02
0.02
0.03
0.02
0.02
0.01
0.01
0.03
0.02
0.01
0.49
0.03
0.03
0.01
0.02
0.02
S
0.09
0.04
0.07
0.05
0.04
0.06
0.06
0.05
0.04
0.03
0.02
0.05
0.04
0.03
0.04
0.22
0.09
0.02
0.03
0.03
Some amino acids are highly conserved (i.e. C),
some have a high change of mutation (i.e. I)
T
0.05
0.03
0.05
0.04
0.04
0.04
0.04
0.03
0.03
0.04
0.03
0.04
0.04
0.03
0.04
0.08
0.25
0.02
0.03
0.05
W
0.01
0.01
0.00
0.00
0.00
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.02
0.00
0.01
0.01
0.49
0.03
0.01
Y
0.02
0.02
0.02
0.01
0.01
0.02
0.02
0.01
0.06
0.02
0.02
0.02
0.02
0.09
0.01
0.02
0.02
0.07
0.32
0.02
V
0.07
0.03
0.03
0.02
0.06
0.04
0.03
0.02
0.02
0.18
0.10
0.03
0.09
0.06
0.03
0.04
0.07
0.03
0.05
0.27
What is a pseudo count?
A
A 0.29
R 0.04
N 0.04
D 0.04
C 0.07
….
Y 0.04
V 0.07
R
0.03
0.34
0.04
0.03
0.02
N
0.03
0.04
0.32
0.07
0.02
D
0.03
0.03
0.08
0.40
0.02
C
0.02
0.01
0.01
0.01
0.48
Q
0.03
0.05
0.03
0.03
0.01
E
0.04
0.05
0.05
0.09
0.02
G
0.08
0.03
0.07
0.05
0.03
H
0.01
0.02
0.03
0.02
0.01
I
0.04
0.02
0.02
0.02
0.04
L
0.06
0.05
0.03
0.03
0.07
K
0.04
0.12
0.05
0.04
0.02
M
0.02
0.02
0.01
0.01
0.02
F
0.02
0.02
0.02
0.01
0.02
P
0.03
0.02
0.02
0.02
0.02
S
0.09
0.04
0.07
0.05
0.04
T
0.05
0.03
0.05
0.04
0.04
W
0.01
0.01
0.00
0.00
0.00
Y
0.02
0.02
0.02
0.01
0.01
V
0.07
0.03
0.03
0.02
0.06
0.03 0.02 0.02 0.01 0.02 0.03 0.02 0.05 0.04 0.07 0.03 0.02 0.13 0.02 0.03 0.03 0.03 0.32 0.05
0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.16 0.13 0.03 0.03 0.04 0.02 0.03 0.05 0.01 0.02 0.27
• Say V is observed at P2
• Knowing that V at P2 binds, what is the probability that
a peptide could have I at P2?
• P(I|V) = 0.16
Pseudo count estimation
• Calculate observed amino acids
frequencies fa
• Pseudo frequency for amino acid b
• Example
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Weight on pseudo count
• Pseudo counts are important when only
limited data is available
• With large data sets only “true”
observation should count
•  is the effective number of sequences
(N-1),  is the weight on prior
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Weight on pseudo count
• Example
• If  large, p ≈ f and only the observed
data defines the motif
• If  small, p ≈ g and the pseudo counts
(or prior) defines the motif
•  is [50-200] normally
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Sequence weighting and pseudo
counts
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
RLLDDTPEV 84nM
GLLGNVSTV 23nM
ALAKAAAAL 309nM
P7P and P7S > 0
Position specific weighting
We know that positions 2 and 9
are anchor positions for most
MHC binding motifs
– Increase weight on high
information positions
Motif found on large data set
Weight matrices
Estimate amino acid frequencies from alignment including
sequence weighting and pseudo count
1
2
3
4
5
6
7
8
9
A
0.08
0.04
0.08
0.08
0.06
0.06
0.10
0.05
0.08
R
0.06
0.01
0.04
0.05
0.04
0.03
0.02
0.07
0.02
N
0.02
0.01
0.05
0.03
0.05
0.03
0.04
0.04
0.01
D
0.03
0.01
0.07
0.10
0.03
0.03
0.04
0.03
0.01
C
0.02
0.01
0.02
0.01
0.01
0.03
0.02
0.01
0.02
Q
0.02
0.01
0.03
0.05
0.04
0.03
0.03
0.04
0.02
E
0.03
0.02
0.03
0.08
0.05
0.04
0.04
0.06
0.03
G
0.08
0.02
0.08
0.13
0.11
0.06
0.05
0.06
0.02
H
0.02
0.01
0.02
0.01
0.03
0.02
0.04
0.03
0.01
I
0.08
0.11
0.05
0.05
0.04
0.10
0.08
0.06
0.10
What do the numbers mean?
L
0.11
0.44
0.11
0.06
0.09
0.14
0.12
0.13
0.23
K
0.06
0.02
0.03
0.05
0.04
0.04
0.02
0.06
0.03
M
0.04
0.06
0.03
0.01
0.02
0.03
0.03
0.02
0.02
F
0.06
0.03
0.06
0.03
0.06
0.05
0.06
0.05
0.04
P
0.02
0.01
0.04
0.08
0.06
0.04
0.07
0.04
0.01
S
0.09
0.02
0.06
0.06
0.04
0.06
0.06
0.08
0.04
T
0.04
0.05
0.05
0.04
0.05
0.06
0.05
0.07
0.04
W
0.01
0.00
0.03
0.02
0.02
0.01
0.03
0.01
0.00
Y
0.04
0.01
0.05
0.01
0.05
0.03
0.03
0.04
0.02
V
0.08
0.10
0.07
0.05
0.08
0.13
0.08
0.05
0.25
– P2(V)>P2(M). Does this mean that V enables binding more than M.
– In nature not all amino acids are found equally often
• In nature V is found more often than M, so we must somehow
rescale with the background
• qM = 0.025, qV = 0.073
• Finding 7% V is hence not significant, but 7% M highly significant
Weight matrices
• A weight matrix is given as
Wij = log(pij/qj)
– where i is a position in the motif, and j an amino acid. qj is the
background frequency for amino acid j.
1
2
3
4
5
6
7
8
9
A
0.6
-1.6
0.2
-0.1
-1.6
-0.7
1.1
-2.2
-0.2
R
0.4
-6.6
-1.3
-0.1
-0.1
-1.4
-3.8
1.0
-3.5
N
-3.5
-6.5
0.1
-2.0
0.1
-1.0
-0.2
-0.8
-6.1
D
-2.4
-5.4
1.5
2.0
-2.2
-2.3
-1.3
-2.9
-4.5
C
-0.4
-2.5
0.0
-1.6
-1.2
1.1
1.3
-1.4
0.7
Q
-1.9
-4.0
-1.8
0.5
0.4
-1.3
-0.3
0.4
-0.8
E
-2.7
-4.7
-3.3
0.8
-0.5
-1.4
-1.3
0.1
-2.5
G
0.3
-3.7
0.4
2.0
1.9
-0.2
-1.4
-0.4
-4.0
H
I
L
K
M
F
-1.1 1.0 0.3 0.0 1.4 1.2
-6.3 1.0 5.1 -3.7 3.1 -4.2
0.5 -1.0 0.3 -2.5 1.2 1.0
-3.3 0.1 -1.7 -1.0 -2.2 -1.6
1.2 -2.2 -0.5 -1.3 -2.2 1.7
-1.0 1.8 0.8 -1.9 0.2 1.0
2.1 0.6 0.7 -5.0 1.1 0.9
0.2 -0.0 1.1 -0.5 -0.5 0.7
-2.6 0.9 2.8 -3.0 -1.8 -1.4
• W is a L x 20 matrix, L is motif length
P
-2.7
-4.3
-0.1
1.7
1.2
-0.4
1.3
-0.3
-6.2
S
1.4
-4.2
-0.3
-0.6
-2.5
-0.6
-0.5
0.8
-1.9
T
-1.2
-0.2
-0.5
-0.2
-0.1
0.4
-0.9
0.8
-1.6
W
-2.0
-5.9
3.4
1.3
1.7
-0.5
2.9
-0.7
-4.9
Y
V
1.1 0.7
-3.8 0.4
1.6 0.0
-6.8 -0.7
1.5 1.0
-0.0 2.1
-0.4 0.5
1.3 -1.1
-1.6 4.5
Scoring a sequence to a weight matrix
Score sequences to weight matrix by looking up and
adding L values from the matrix
1
2
3
4
5
6
7
8
9
A
0.6
-1.6
0.2
-0.1
-1.6
-0.7
1.1
-2.2
-0.2
R
0.4
-6.6
-1.3
-0.1
-0.1
-1.4
-3.8
1.0
-3.5
N
-3.5
-6.5
0.1
-2.0
0.1
-1.0
-0.2
-0.8
-6.1
D
-2.4
-5.4
1.5
2.0
-2.2
-2.3
-1.3
-2.9
-4.5
C
-0.4
-2.5
0.0
-1.6
-1.2
1.1
1.3
-1.4
0.7
RLLDDTPEV
GLLGNVSTV
ALAKAAAAL
Q
-1.9
-4.0
-1.8
0.5
0.4
-1.3
-0.3
0.4
-0.8
E
-2.7
-4.7
-3.3
0.8
-0.5
-1.4
-1.3
0.1
-2.5
G
0.3
-3.7
0.4
2.0
1.9
-0.2
-1.4
-0.4
-4.0
H
I
L
K
M
F
-1.1 1.0 0.3 0.0 1.4 1.2
-6.3 1.0 5.1 -3.7 3.1 -4.2
0.5 -1.0 0.3 -2.5 1.2 1.0
-3.3 0.1 -1.7 -1.0 -2.2 -1.6
1.2 -2.2 -0.5 -1.3 -2.2 1.7
-1.0 1.8 0.8 -1.9 0.2 1.0
2.1 0.6 0.7 -5.0 1.1 0.9
0.2 -0.0 1.1 -0.5 -0.5 0.7
-2.6 0.9 2.8 -3.0 -1.8 -1.4
11.9 84nM
14.7 23nM
4.3 309nM
P
-2.7
-4.3
-0.1
1.7
1.2
-0.4
1.3
-0.3
-6.2
S
1.4
-4.2
-0.3
-0.6
-2.5
-0.6
-0.5
0.8
-1.9
T
-1.2
-0.2
-0.5
-0.2
-0.1
0.4
-0.9
0.8
-1.6
W
-2.0
-5.9
3.4
1.3
1.7
-0.5
2.9
-0.7
-4.9
Y
V
1.1 0.7
-3.8 0.4
1.6 0.0
-6.8 -0.7
1.5 1.0
-0.0 2.1
-0.4 0.5
1.3 -1.1
-1.6 4.5
Which peptide is most
likely to bind?
Which peptide second?
Example from real life
• 10 peptides from MHCpep
database
• Bind to the MHC complex
• Relevant for immune
system recognition
• Estimate sequence motif
and weight matrix
• Evaluate motif
“correctness” on 528
peptides
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Prediction accuracy
Measured affinity
Pearson correlation 0.45
Prediction score
Predictive performance
End of first part
Take a deep breath
Smile to you neighbor
Hidden Markov Models
• Weight matrices do not deal with insertions and
deletions
• In alignments, this is done in an ad-hoc manner
by optimization of the two gap penalties for
first gap and gap extension
• HMM is a natural frame work where
insertions/deletions are dealt with explicitly
Why hidden?
The unfair casino: Loaded die p(6) = 0.5; switch fair to load:0.05;
switch load to fair: 0.1
Model generates numbers
– 312453666641
Does not tell which die was
used
Alignment (decoding) can give
the most probable
solution/path (Viterby)
– FFFFFFLLLLLL
Or most probable set of
states
– FFFFFFLLLLLL
0.95
1:1/6
1:1/10
2:1/6 0.05 2:1/10
3:1/6
3:1/10
4:1/6
4:1/10
5:1/6 0.10 5:1/10
6:1/6
6:1/2
Fair
Loaded
0.9
HMM (a simple example)
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
Core of alignment
• Example from A. Krogh
• Core region defines the
number of states in the
HMM (red)
• Insertion and deletion
statistics are derived
from the non-core part
of the alignment (black)
HMM construction
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
• 5 matches. A, 2xC, T, G
• 5 transitions in gap region
• C out, G out
• A-C, C-T, T out
• Out transition 3/5
• Stay transition 2/5
.4
.2
.4
.2
.2
A
C
G
T
.6
.6
A
C
G
T
.8
A
1. C
G
T
.2
ACA---ATG
A
.8 1. C
.2
G
T
A
.8
.2 .4 C
G
T
1
A
1. C
G
T
0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2
A
1. C
.2
.8
G
T
= 3.3x10-2
.8
.2
Align sequence to HMM
ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2
TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8
= 0.0075x10-2
ACAC--AGC = 1.2x10-2
Consensus:
ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2
Exceptional:
TGCT--AGG = 0.0023x10-2
Align sequence to HMM - Null model
• Score depends
strongly on length
• Null model is a random
model. For length L
the score is 0.25L
• Log-odds score for
sequence S
Log( P(S)/0.25L)
• Positive score means
more likely than Null
model
ACA---ATG = 4.9
TCAACTATC = 3.0
ACAC--AGC = 5.3
AGA---ATC = 4.9
ACCG--ATC = 4.6
Consensus:
ACAC--ATC = 6.7
ACA---ATC = 6.3
Exceptional:
TGCT--AGG = -0.97
Note!
Model decoding (Viterby)
Example: 1245666. What was the series of
dice used to generate this output?
0.9
0.95
1:1/6
2:1/6
3:1/6
4:1/6
5:1/6
6:1/6
Fair
0.05
0.10
1:1/10
2:1/10
3:1/10
4:1/10
5:1/10
6:1/2
Loaded
Model decoding (Viterby)
Example: 1245666. What was the series of dice used to
generate this output?
-0.02
Log model
1:-0.78
2:-0.78 -1.3
3:-0.78
4:-0.78
5:-0.78
-1
6:-0-78
Fair
1
F
-0.78
L
Null
2
-3.08
4
5
6
1:-1
2:-1
3:-1
4:-1
5:-1
6:-0.3
Loaded
6
6
-0.05
Model decoding (Viterby)
-0.02
Log model
1:-0.78
2:-0.78 -1.3
3:-0.78
4:-0.78
5:-0.78
-1
6:-0-78
Fair
1
2
F
-0.78
-1.58
L
Null
-3.08
4
-3.88
5
6
1:-1
2:-1
3:-1
4:-1
5:-1
6:-0.3
Loaded
6
6
-0.05
Model decoding (Viterby)
-0.02
Log model
1:-0.78
2:-0.78 -1.3
3:-0.78
4:-0.78
5:-0.78
-1
6:-0-78
Fair
1:-1
2:-1
3:-1
4:-1
5:-1
6:-0.3
Loaded
1
2
4
5
6
6
6
F
-0.78
-1.58
-2.38
-3.18
-3.98
-4.78
-5.58
L
Null
-3.08
-3.88
-4.68
-4.78
-5.13
-5.48
Series of dice is FFFFLLL
-0.05
HMM’s and weight matrices
• In the case of un-gapped alignments
HMM’s become simple weight matrices
• To achieve high performance, the emission
frequencies are estimated using the
techniques of
– Sequence weighting
– Pseudo counts
Profile HMM’s
• Alignments based on conventional scoring
matrices (BLOSUM62) scores all positions in a
sequence in an equal manner
• Some positions are highly conserved, some are
highly variable (more than what is described in
the BLOSUM matrix)
• Profile HMM’s are ideal suited to describe such
position specific variations
Profile HMM’s
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN
TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I
-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I
IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD---TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V
ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
Must have a G
Core: Position with < 2 gaps
Any thing can match
HMM vs. alignment
• Detailed description of core
– Conserved/variable positions
• Price for insertions/deletions varies at
different locations in sequence
• These features cannot be captured in
conventional alignments
Profile HMM’s
All M/D pairs must
be visited once
L1- Y2A3V4R5- I6
P1D2P3P4I4P5D6P7
Example. Sequence profiles
• Alignment of protein sequences 1PLC._ and 1GYC.A
• E-value > 1000
• Profile alignment
– Align 1PLC._ against Swiss-prot
– Make position specific weight matrix from alignment
– Use this matrix to align 1PLC._ against 1GYC.A
• E-value < 10-22. Rmsd=3.3
Example continued
Score = 97.1 bits (241), Expect = 9e-22
Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%)
Query: 3
Sbjct: 26
Query: 57
Sbjct: 80
ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56
F
+
G++
N+
+
+G + +
------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79
MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98
A G +F
G + ++
G+ G
V
AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126
Rmsd=3.3 Å
Model red
Structure blue
Class II MHC binding
• MHC class II binds
peptides in the class II
antigen presentation
pathway
• Binds peptides of
length 9-18 (even whole
proteins can bind!)
• Binding cleft is open
• Binding core is 9 aa
Gibbs sampler
www.cbs.dtu.dk/biotools/EasyGibbs
100 10mer peptides
2100~1030 combinations
Monte Carlo
simulations can do it
Gibbs sampler. Prediction accuracy
HMM packages
•
•
•
HMMER (http://hmmer.wustl.edu/)
– S.R. Eddy, WashU St. Louis. Freely available.
SAM (http://www.cse.ucsc.edu/research/compbio/sam.html)
– R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa Cruz.
Freely available to academia, nominal license fee for commercial users.
META-MEME (http://metameme.sdsc.edu/)
– William Noble Grundy, UC San Diego. Freely available. Combines
features of PSSM search and profile HMM search.
• NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html)
– Freely available to academia, nominal license fee for commercial users.
– Allows HMM architecture construction.
• EasyGibbs (http://www.cbs.dtu.dk/biotools/EasyGibbs/)
– Webserver for Gibbs sampling of proteins sequences