Morten Nielsen, CBS, BioCentrum, DTU CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s.

Download Report

Transcript Morten Nielsen, CBS, BioCentrum, DTU CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s.

Morten Nielsen,
CBS, BioCentrum,
DTU
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Sequence motifs, information
content, logos, and HMM’s
• Multiple alignment and sequence motifs
• Weight matrix construction and consensus
sequence
– Sequence weighting
– Low (pseudo) counts
• Information content
– Sequence logos
– Mutual information
• Example from the real world
• HMM’s and profile HMM’s
– TMHMM (trans-membrane protein)
– Gene finding
• Links to HMM packages
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Outline
• Core
• Consensus
sequence
• Weight matrices
• Problems
– Sequence weights
– Low counts
----------MLEFVVEADLPGIKA-----------------MLEFVVEFALPGIKA-----------------MLEFVVEFDLPGIAA--------------------YLQDSDPDSFQD----------GSDTITLPCRMKQFINMWQE------------RNQEERLLADLMQNYDPNLR----------------YDPNLRPAERDSDVVNVSLK---------------NVSLKLTLTNLISLNEREEA------EREEALTTNVWIEMQWCDYR------------------WCDYRLRWDPRDYEGLWVLR----LWVLRVPSTMVWRPDIVLEN----------------------IVLENNVDGVFEVALYCNVL-------------YCNVLVSPDGCIYWLPPAIF
---------PPAIFRSACSISVTYFPFDW---*********
FVVEFDLPG
Consensus
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Multiple alignment and
sequence motifs
Clustering (slow, but accurate)
----------MLEFVVEADLPGIKA-----------------MLEFVVEFALPGIKA-----------------MLEFVVEFDLPGIAA--------------------YLQDSDPDSFQD----------GSDTITLPCRMKQFINMWQE------------RNQEERLLADLMQNYDPNLR----------------YDPNLRPAERDSDVVNVSLK---------------NVSLKLTLTNLISLNEREEA------EREEALTTNVWIEMQWCDYR------------------WCDYRLRWDPRDYEGLWVLR----LWVLRVPSTMVWRPDIVLEN----------------------IVLENNVDGVFEVALYCNVL-------------YCNVLVSPDGCIYWLPPAIF
---------PPAIFRSACSISVTYFPFDW----
*********
}
Homologous sequences
Weight = 1/n (1/3)
Consensus sequence
YRQELDPLV
Previous
FVVEFDLPG
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Sequences weighting 1 -
Henikoff & Henikoff (fast)
• waa’ = 1/rs
• r: Number of different aa in a column
• s: Number occurrences
• Normalize S waa= 1 for each column
• Sequence weight is sum of waa in
sequence
F: r=7 (FYMLPVW), s=4
w’=1/28, w = 0.055
Y: s=3, w`=1/21, w = 0.073
M,P,W: s=1, w’=1/7, w = 0.218
L,V: s=2, w’=1/14, w = 0.109
FVVEADLPG
FVVEFALPG
FVVEFDLPG
YLQDSDPDS
MKQFINMWQ
LMQNYDPNL
PAERDSDVV
LKLTLTNLI
VWIEMQWCD
YRLRWDPRD
WRPDIVLEN
VLENNVDGV
YCNVLVSPD
FRSACSISV
w
0.37
0.43
0.32
0.59
0.90
0.68
0.75
0.85
0.84
0.51
0.71
0.59
0.71
0.75
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Sequences weighting 2 -
P1
--------MLEFVVEADLPGIKA---------------MLEFVVEFALPGIKA---------------MLEFVVEFDLPGIAA------------------YLQDSDPDSFQD--------GSDTITLPCRMKQFINMWQE----------RNQEERLLADLMQNYDPNLR--------------YDPNLRPAERDSDVVNVSLK-------------NVSLKLTLTNLISLNEREEA----EREEALTTNVWIEMQWCDYR----------------WCDYRLRWDPRDYEGLWVLR--LWVLRVPSTMVWRPDIVLEN--------------------IVLENNVDGVFEVALYCNVL-----------YCNVLVSPDGCIYWLPPAIF
-------PPAIFRSACSISVTYFPFDW---*********
• Limited number of data
• Poor sampling of
sequence space
• I is not found at position
P1. Does this mean
that I can never be
found at P1?
• No! Use Blosum matrix
to estimate pseudo
frequency of I
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Low count correction
• Every time for instance
L/V is observed, I is
also likely to occur
• Estimate low (pseudo)
count correction using
this approach
• As more data are
included the pseudo
count correction
becomes less important

Blosum62 substitution frequencies
#
L
V
I
L
V
0.12 0.38 0.10
0.16 0.13 0.27
N L  2,NV  2,N eff  12,
2  0.12  2  0.16
p 
 0.05
12
s
I
p
N eff  p I    p sI
N eff  
,
p  120100.05
 0.02
1210
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Low count correction using
Blosum matrices
• Information and entropy
– Conserved amino acid regions contain high degree of
information (high order == low entropy)
– Variable amino acid regions contain low degree of
information (low order == high entropy)
• Shannon information
D = log2(N) + S pi log2 pi (for proteins N=20, DNA N=4)
• Conserved residue pA=1, pi<>A=0,
D = log2(N) ( = 4.3 for proteins)
• Variable region pA=0.05, pC=0.05, ..,
D=0
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Information content
MHC class I
• Height of a column equal
to D
• Relative height of a letter
is pA
• Highly useful tool to
visualize sequence
motifs
High information
positions
http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Sequence logo
• Information content
D = S pi log2 (pi/qi)
• Shannon, qi = 1/N = 0.05
D = S pi log2 (pi) - S pi log2 (1/N)
= log2 N + S pi log2 (pi)
• Kullback-Leibler, qi = background frequency
– V/L/A more frequent than for instance C/H/W
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
More on logos
P1
I    P(aai,aa j )  log
aai aa j
P(aai ,aa j )
P(aai )  P(aa j )
P(G1) = 2/9 = 0.22, ..
P(V6) = 4/9 = 0.44,..
P(G1,V6) = 2/9 = 0.22,
P(G1)*P(V6) = 8/81 = 0.10
log(0.22/0.10) > 0
P6
ALWGFFPVA
ILKEPVHGV
ILGFVFTLT
LLFGYPVYV
GLSPTVWLS
YMNGTMSQV
GILGFVFTL
WLSLLVPFV
FLPSDFFPS
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Mutual information
313 binding peptides
313 random peptides
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Mutual information
• Neural networks can
learn higher order
correlations!
– What does this
mean?
0 0 => 0
0 1 => 1
1 0 => 1
1 1 => 0
No linear function
can learn this
pattern
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Learning higher order
correlation
Take a deep breath
Smile to you neighbor
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
End of first part
• Estimate amino acid frequencies from alignment
including sequence weighting and pseudo counts
• Construct a weight matrix as
Wij = log(pij/qj)
• Here i is a position in the motif, and j an amino acid.
qj is the prior frequency for amino acid j.
• W is a L x 20 matrix, L is motif length
• Score sequences to weight matrix by looking up and
adding L values from matrix
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Weight matrices
• What are log-odds scores?
– Does an monthly income of 2000 $ mean that you are rich?
– Depends on where you live
• In Denmark no
• In Argentina yes
– You must always compare your measured value to a
background
• For proteins the background is either the flat
distribution 0.05 or the distribution in Swiss-prot
• In nature not all amino acids are found equally often
– PA = 0.070, PW = 0.013
– Finding 6% A is hence not significant, but 6% W highly
significant
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Weight matrix 2
1
2
3
4
5
6
7
8
9
A
0.6
-1.6
0.2
-0.1
-1.6
-0.7
1.1
-2.2
-0.2
R
0.4
-6.6
-1.3
-0.1
-0.1
-1.4
-3.8
1.0
-3.5
N
-3.5
-6.5
0.1
-2.0
0.1
-1.0
-0.2
-0.8
-6.1
D
-2.4
-5.4
1.5
2.0
-2.2
-2.3
-1.3
-2.9
-4.5
C
-0.4
-2.5
0.0
-1.6
-1.2
1.1
1.3
-1.4
0.7
Q
-1.9
-4.0
-1.8
0.5
0.4
-1.3
-0.3
0.4
-0.8
ILYQVPFSV
ALPYWNFAT
MTAQWWLDA
E
-2.7
-4.7
-3.3
0.8
-0.5
-1.4
-1.3
0.1
-2.5
G
0.3
-3.7
0.4
2.0
1.9
-0.2
-1.4
-0.4
-4.0
H
I
L
K
M
F
-1.1 1.0 0.3 0.0 1.4 1.2
-6.3 1.0 5.1 -3.7 3.1 -4.2
0.5 -1.0 0.3 -2.5 1.2 1.0
-3.3 0.1 -1.7 -1.0 -2.2 -1.6
1.2 -2.2 -0.5 -1.3 -2.2 1.7
-1.0 1.8 0.8 -1.9 0.2 1.0
2.1 0.6 0.7 -5.0 1.1 0.9
0.2 -0.0 1.1 -0.5 -0.5 0.7
-2.6 0.9 2.8 -3.0 -1.8 -1.4
15.0
-3.4
0.8
P
-2.7
-4.3
-0.1
1.7
1.2
-0.4
1.3
-0.3
-6.2
S
1.4
-4.2
-0.3
-0.6
-2.5
-0.6
-0.5
0.8
-1.9
T
-1.2
-0.2
-0.5
-0.2
-0.1
0.4
-0.9
0.8
-1.6
W
-2.0
-5.9
3.4
1.3
1.7
-0.5
2.9
-0.7
-4.9
Y
V
1.1 0.7
-3.8 0.4
1.6 0.0
-6.8 -0.7
1.5 1.0
-0.0 2.1
-0.4 0.5
1.3 -1.1
-1.6 4.5
Which peptide bindes the best?
Which peptide second?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Scoring sequences to a
weight matrix
• 10 peptides from
MHCpep database
• Bind to the MHC
complex
• Relevant for immune
system recognition
• Estimate sequence
motif and weight matrix
• Evaluate motif
“correctness” on 528
peptides
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Example from real life
Pearson correlation 0.45
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Prediction accuracy
• Raw sequence counting
– No sequence weighting
– No pseudo count
– Prediction accuracy 0.45
• Sequence weighting
– No pseudo count
– Prediction accuracy 0.5
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Example (cont.)
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
• Sequence weighting
and pseudo count
– Prediction accuracy
0.60
• Motif found on all
data (485)
– Prediction accuracy
0.79
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Example (cont.)
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
• Weight matrices do not deal with insertions
and deletions
• In alignments, this is done in an ad-hoc
manner by optimization of the two gap
penalties for first gap and gap extension
• HMM is a natural frame work where
insertions/deletions are dealt with explicitly
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Hidden Markov Models
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
Core of alignment
• Example from A. Krogh
• Core region defines the
number of states in the
HMM (red)
• Insertion and deletion
statistics are derived
from the non-core part
of the alignment (black)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
HMM (a simple example)
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
• 5 matches. A, 2xC, T, G
• 5 transitions in gap region
• C out, G out
• A-C, C-T, T out
• Out transition 3/5
• Stay transition 2/5
.4
.2
.4
.2
.2
A
C
G
T
.6
.6
A
C
G
T
.8
.2
A
1. C
G
T
ACA---ATG
A
.8 1. C
.2
G
T
A
.8
.4
C
.2
G
T
1
A
A
1. C
1. C
G
T
0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2
.2
.8
G
T
= 3.3x10-2
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
HMM construction
.8
.2
ACA---ATG
TCAACTATC
0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2
= 3.3x10-2
0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8
0.0075x10-2
ACAC--AGC = 1.2x10-2
AGA---ATC = 3.3x10-2
ACCG--ATC = 0.59x10-2
Consensus:
ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2
Exceptional:
TGCT--AGG = 0.0023x10-2
=
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Align sequence to HMM
• Score depends strongly
on length
• Null model is a random
model. For length L the
score is 0.25L
• Log-odds score for
sequence S
Log( P(S)/0.25L)
• Positive score means
more likely than Null
model
ACA---ATG = 4.9
TCAACTATC = 3.0
ACAC--AGC = 5.3
AGA---ATC = 4.9
ACCG--ATC = 4.6
Consensus:
Note!
ACAC--ATC = 6.7
ACA---ATC = 6.3
Exceptional:
TGCT--AGG = -0.97
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Align sequence to HMM Null model
• In the case of un-gapped alignments
HMM’s become simple weight matrices
• It still might be useful to apply a HMM
tool package to estimate a weight matrix
– Sequence weighting
– Pseudo counts
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
HMM’s and weight matrices
• Alignments based on conventional scoring
matrices (BLOSUM62) scores all positions in
a sequence in an equal manner
• Some positions are highly conserved, some
are highly variable (more than what is
described in the BLOSUM matrix)
• Profile HMM’s are ideal suited to describe
such position specific variations
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Profile HMM’s
Sequence profiles
• Alignment of protein sequences 1PLC._ and
1GYC.A
• E-value > 1000
• Profile alignment
– Align 1PLC._ against Swiss-prot
– Make position specific weight matrix from
alignment
– Use this matrix to align 1PLC._ against 1GYC.A
• E-value < 10-22. Rmsd=3.3
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Example
Score = 97.1 bits (241), Expect = 9e-22
Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%)
Query: 3
Sbjct: 26
Query: 57
Sbjct: 80
ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56
F
+
G++
N+
+
+G + +
------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79
MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98
A G +F
G + ++
G+ G
V
AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126
Rmsd=3.3 Å
Model red
Template blue
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Example continued
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN
TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I
-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I
IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD---TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V
ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Profile HMM’s
All M/D pairs must
be visited once
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Profile HMM’s
(Sonnhammer, von Heijne, and Krogh)
Model TM length
distribution.
Power of HMM.
Difficult in alignment.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
TMHMM (trans-membrane
HMM)
x
Inter-genic
region
xxxxxxxxATGccc
Region around
start codon
ccc
cccTAAxxxxxxxx
Coding
region
Region around
stop codon
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Combination of HMM’s Gene finding
•
•
•
HMMER (http://hmmer.wustl.edu/)
– S.R. Eddy, WashU St. Louis. Freely available.
SAM (http://www.cse.ucsc.edu/research/compbio/sam.html)
– R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa Cruz.
Freely available to academia, nominal license fee for commercial users.
META-MEME (http://metameme.sdsc.edu/)
– William Noble Grundy, UC San Diego. Freely available. Combines
features of PSSM search and profile HMM search.
• NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html)
– Freely available to academia, nominal license fee for commercial users.
– Allows HMM architecture construction.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
HMM packages
Copyright (C) 1998 by Anders Krogh
Header {alphabet ACGT;}
begin {trans s1 s2;}
S1 {trans s2 End;}
S2 {trans s1 End;}
End {letter NULL;}
0.5
B
S2
S1
0.5
1.0
A 0.25
C 0.25
G 0.25
T 0.25
End
0.5
A 0.25
C 0.25
G 0.25
T 0.25
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
trainanhmm 1.221