Predicting peptide MHC interactions Morten Nielsen, CBS, Depart of Systems Biology, DTU MHC Class I pathway Finding the needle in the haystack 1/200 peptides make to.

Download Report

Transcript Predicting peptide MHC interactions Morten Nielsen, CBS, Depart of Systems Biology, DTU MHC Class I pathway Finding the needle in the haystack 1/200 peptides make to.

Predicting peptide MHC interactions
Morten Nielsen,
CBS, Depart of Systems Biology,
DTU
MHC Class I pathway
Finding the needle in the haystack
1/200 peptides make
to the surface
Figure by Eric A.J. Reits
Or, Finding the needle in the haystack
Objectives
• Visualization of binding motifs
– Construction of sequence logos
• Understand the concepts of weight matrix construction
– One of the most important methods of bioinformatics
• A few word on Artificial neural networks
• MHC binding rules
– No other factors in the MHC (I and II) pathways are
(as) decisive for T cell epitope identification
• All known T cell epitopes have specific MHC restrictions
matching their host
• MHC binding is the single most important feature for
understanding cellular immunity
Binding Motif. MHC class I with peptide
Anchor positions
Sequence information
SLLPAIVEL
LLDVPTAAV
HLIDYLVTS
ILFGHENRV
LERPGGNEI
PLDGEYFTL
ILGFVFTLT
KLVALGINA
KTWGQYWQV
SLLAPGAKQ
ILTVILGVL
TGAPVTYST
GAGIGVAVL
KARDPHSGH
AVFDRKSDA
GLCTLVAML
VLHDDLLEA
ISNDVCAQV
YTAFTIPSI
NMFTPYIGV
VVLGVVFGI
GLYDGMEHL
EAAGIGILT
YLSTAFARV
FLDEFMEGV
AAGIGILTV
AAGIGILTV
YLLPAIVHI
VLFRGGPRG
ILAPPVVKL
ILMEHIHKL
ALSNLEVKL
GVLVGVALI
LLFGYPVYV
DLMGYIPLV
TITDQVPFS
KIFGSLAFL
KVLEYVIKV
VIYQYMDDL
IAGIGILAI
KACDPHSGH
LLDFVRFMG
FIDSYICQV
LMWITQCFL
VKTDGNPPE
RLMKQDFSV
LMIIPLINV
ILHNGAYSL
KMVELVHFL
TLDSQVMSL
YLLEMLWRL
ALQPGTALL
FLPSDFFPS
FLPSDFFPS
TLWVDPYEV
MVDGTLLLL
ALFPQLVIL
ILDQKINEV
ALNELLQHV
RTLDKVLEV
GLSPTVWLS
RLVTLKDIV
AFHHVAREL
ELVSEFSRM
FLWGPRALV
VLPDVFIRC
LIVIGILIL
ACDPHSGHF
VLVKSPNHV
IISAVVGIL
SLLMWITQC
SVYDFFVWL
RLPRIFCSC
TLFIGSHVV
MIMVKCWMI
YLQLVFGIE
STPPPGTRV
SLDDYNHLV
VLDGLDVLL
SVRDRLARL
AAGIGILTV
GLVPFLVSV
YMNGTMSQV
GILGFVFTL
SLAGGIIGV
DLERKVESL
HLSTAFARV
WLSLLVPFV
MLLAVLYCL
YLNKIQNSL
KLTPLCVTL
GLSRYVARL
VLPDVFIRC
LAGIGLIAA
SLYNTVATL
GLAPPQHLI
VMAGVGSPY
QLSLLMWIT
FLYGALLLA
FLWGPRAYA
SLVIVTTFV
MLGTHTMEV
MLMAQEALA
KVAELVHFL
RTLDKVLEV
SLYSFPEPE
SLREWLLRI
FLPSDFFPS
KLLEPVLLL
MLLSVPLLL
STNRQSGRQ
LLIENVASL
FLGENISNF
RLDSYVRSL
FLPSDFFPS
AAGIGILTV
MMRKLAILS
VLYRYGSFS
FLLTRILTI
AVGIGIAVV
VDGIGILTI
RGPGRAFVT
LLGRNSFEV
LLWTLVVLL
LLGATCMFV
VLFSSDFRI
RLLQETELV
VLQWASLAV
MLGTHTMEV
LMAQEALAF
IMIGVLVGV
GLPVEYLQV
ALYVDSLFF
LLSAWILTA
AAGIGILTV
LLDVPTAAV
SLLGLLVEV
GLDVLTAKV
FLLWATAEA
ALSDHHIYL
YMNGTMSQV
CLGGLLTMV
YLEPGPVTA
AIMDKNIIL
YIGEVLVSV
HLGNVKYLV
LVVLGLLAV
GAGIGVLTA
NLVPMVATV
PLTFGWCYK
SVRDRLARL
RLTRFLSRV
LMWAKIGPV
SLFEGIDFY
ILAKFLHWL
SLADTNSLA
VYDGREHTV
ALCRWGLLL
KLIANNTRV
SLLQHLIGL
AAGIGILTV
FLWGPRALV
LLDVPTAAV
ALLPPINIL
RILGAVAKV
SLPDFGISY
GLSEFTEYL
GILGFVFTL
FIAGNSAYE
LLDGTATLR
IMDKNIILK
CINGVCWTV
GIAGGLALL
ALGLGLLPV
AAGIGIIQI
GLHCYEQLV
VLEWRFDSR
LLMDCSGSI
YMDGTMSQV
SLLLELEEV
SLDQSVVEL
STAPPHVNV
LLWAARPRL
YLSGANLNL
LLFAGVQCQ
FIYAGSLSA
ELTLGEFLK
AVPDEIPPL
ETVSEQSNV
LLDVPTAAV
TLIKIQHTL
QVCERIPTI
KKREEAPSL
STAPPAHGV
ILKEPVHGV
KLGEFYNQM
ITDQVPFSV
SMVGNWAKV
VMNILLQYV
GLQDCTMLV
GIGIGVLAA
QAGIGILLA
PLKQHFQIV
TLNAWVKVV
CLTSTVQLV
FLTPKKLQC
SLSRFSWGA
RLNMFTPYI
LLLLTVLTV
GVALQTMKQ
RMFPNAPYL
VLLCESTAV
KLVANNTRL
MINAYLDKL
FAYDGKDYI
ITLWQRPLV
Sequence Information
• Say that a peptide must have L
at P2 in order to bind, and that
A,F,W,and Y are found at P1.
Which position has most
information?
• How many questions do I need
to ask to tell if a peptide binds
looking at only P1 or P2?
Sequence Information
• Say that a peptide must have L
at P2 in order to bind, and that
A,F,W,and Y are found at P1.
Which position has most
information?
• How many questions do I need
to ask to tell if a peptide binds
looking at only P1 or P2?
• P1: 4 questions (at most)
• P2: 1 question (L or not)
• P2 has the most information
Sequence Information
• Say that a peptide must have L
at P2 in order to bind, and that
A,F,W,and Y are found at P1.
Which position has most
information?
• How many questions do I need
to ask to tell if a peptide binds
looking at only P1 or P2?
• P1: 4 questions (at most)
• P2: 1 question (L or not)
• P2 has the most information
• Calculate pa at each position
• Entropy
• Information content
• Conserved positions
– PV=1, P!v=0 => S=0, I=log(20)
• Mutable positions
– Paa=1/20 => S=log(20), I=0
Information content
1
2
3
4
5
6
7
8
9
A
0.10
0.07
0.08
0.07
0.04
0.04
0.14
0.05
0.07
R
0.06
0.00
0.03
0.04
0.04
0.03
0.01
0.09
0.01
N
0.01
0.00
0.05
0.02
0.04
0.03
0.03
0.04
0.00
D
0.02
0.01
0.10
0.11
0.04
0.01
0.03
0.01
0.00
C
0.01
0.01
0.02
0.01
0.01
0.02
0.02
0.01
0.02
Q
0.02
0.00
0.02
0.04
0.04
0.03
0.03
0.05
0.02
E
0.02
0.01
0.01
0.08
0.05
0.03
0.04
0.07
0.02
G
0.09
0.01
0.12
0.15
0.16
0.04
0.03
0.05
0.01
H
0.01
0.00
0.02
0.01
0.04
0.02
0.05
0.02
0.01
I
0.07
0.08
0.03
0.10
0.02
0.14
0.07
0.04
0.08
L
0.11
0.59
0.12
0.04
0.08
0.13
0.15
0.14
0.26
K
0.06
0.01
0.01
0.03
0.04
0.02
0.01
0.04
0.01
M
0.04
0.07
0.03
0.01
0.01
0.03
0.03
0.02
0.01
F
0.08
0.01
0.05
0.02
0.06
0.07
0.07
0.05
0.02
P
0.01
0.00
0.06
0.09
0.10
0.03
0.06
0.05
0.00
S
0.11
0.01
0.06
0.07
0.02
0.05
0.07
0.08
0.04
T
0.03
0.06
0.04
0.04
0.06
0.08
0.04
0.10
0.02
W
0.01
0.00
0.04
0.02
0.02
0.01
0.03
0.01
0.00
Y
0.05
0.01
0.04
0.00
0.05
0.03
0.02
0.04
0.01
V
0.08
0.08
0.07
0.05
0.09
0.15
0.08
0.03
0.38
S
3.96
2.16
4.06
3.87
4.04
3.92
3.98
4.04
2.78
I
0.37
2.16
0.26
0.45
0.28
0.40
0.34
0.28
1.55
Sequence logos
•Height of a column equal to I
•Relative height of a letter is p
•Highly useful tool to visualize
sequence motifs
http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html
HLA-A0201
High information
positions
Characterizing a binding motif from
small data sets
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Sequence weighting
•Poor or biased sampling
of sequence space
•Example P1
PA = 2/6
PG = 2/6
PT = PK = 1/6
PC = PD = …PV = 0
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
}
Similar
sequences
Weight 1/5
RLLDDTPEV 84 nM
GLLGNVSTV 23 nM
ALAKAAAAL 309 nM
Sequence weighting
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Pseudo counts
•I is not found at position P9.
Does this mean that I is
forbidden (P(I)=0)?
•No! Use Blosum substitution
matrix to estimate pseudo
frequency of I at P9
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
The Blosum (substitution frequency) matrix
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
0.29
0.04
0.04
0.04
0.07
0.06
0.06
0.08
0.04
0.05
0.04
0.06
0.05
0.03
0.06
0.11
0.07
0.03
0.04
0.07
R
0.03
0.34
0.04
0.03
0.02
0.07
0.05
0.02
0.05
0.02
0.02
0.11
0.03
0.02
0.03
0.04
0.04
0.02
0.03
0.02
N
0.03
0.04
0.32
0.07
0.02
0.04
0.04
0.04
0.05
0.01
0.01
0.04
0.02
0.02
0.02
0.05
0.04
0.02
0.02
0.02
D
0.03
0.03
0.08
0.40
0.02
0.05
0.09
0.03
0.04
0.02
0.02
0.04
0.02
0.02
0.03
0.05
0.04
0.02
0.02
0.02
C
0.02
0.01
0.01
0.01
0.48
0.01
0.01
0.01
0.01
0.02
0.02
0.01
0.02
0.01
0.01
0.02
0.02
0.01
0.01
0.02
Q
0.03
0.05
0.03
0.03
0.01
0.21
0.06
0.02
0.04
0.01
0.02
0.05
0.03
0.01
0.02
0.03
0.03
0.02
0.02
0.02
E
0.04
0.05
0.05
0.09
0.02
0.10
0.30
0.03
0.05
0.02
0.02
0.07
0.03
0.02
0.04
0.05
0.04
0.02
0.03
0.02
G
0.08
0.03
0.07
0.05
0.03
0.04
0.04
0.51
0.04
0.02
0.02
0.04
0.03
0.03
0.04
0.07
0.04
0.03
0.02
0.02
H
0.01
0.02
0.03
0.02
0.01
0.03
0.03
0.01
0.35
0.01
0.01
0.02
0.02
0.02
0.01
0.02
0.01
0.02
0.05
0.01
I
0.04
0.02
0.02
0.02
0.04
0.03
0.02
0.02
0.02
0.27
0.12
0.03
0.10
0.06
0.03
0.03
0.05
0.03
0.04
0.16
L
0.06
0.05
0.03
0.03
0.07
0.05
0.04
0.03
0.04
0.17
0.38
0.04
0.20
0.11
0.04
0.04
0.07
0.05
0.07
0.13
K
0.04
0.12
0.05
0.04
0.02
0.09
0.08
0.03
0.05
0.02
0.03
0.28
0.04
0.02
0.04
0.05
0.05
0.02
0.03
0.03
M
0.02
0.02
0.01
0.01
0.02
0.02
0.01
0.01
0.02
0.04
0.05
0.02
0.16
0.03
0.01
0.02
0.02
0.02
0.02
0.03
F
0.02
0.02
0.02
0.01
0.02
0.01
0.02
0.02
0.03
0.04
0.05
0.02
0.05
0.39
0.01
0.02
0.02
0.06
0.13
0.04
P
0.03
0.02
0.02
0.02
0.02
0.02
0.03
0.02
0.02
0.01
0.01
0.03
0.02
0.01
0.49
0.03
0.03
0.01
0.02
0.02
S
0.09
0.04
0.07
0.05
0.04
0.06
0.06
0.05
0.04
0.03
0.02
0.05
0.04
0.03
0.04
0.22
0.09
0.02
0.03
0.03
Some amino acids are highly conserved (i.e. C),
some have a high change of mutation (i.e. I)
T
0.05
0.03
0.05
0.04
0.04
0.04
0.04
0.03
0.03
0.04
0.03
0.04
0.04
0.03
0.04
0.08
0.25
0.02
0.03
0.05
W
0.01
0.01
0.00
0.00
0.00
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.02
0.00
0.01
0.01
0.49
0.03
0.01
Y
0.02
0.02
0.02
0.01
0.01
0.02
0.02
0.01
0.06
0.02
0.02
0.02
0.02
0.09
0.01
0.02
0.02
0.07
0.32
0.02
V
0.07
0.03
0.03
0.02
0.06
0.04
0.03
0.02
0.02
0.18
0.10
0.03
0.09
0.06
0.03
0.04
0.07
0.03
0.05
0.27

Pseudo count estimation
• Calculate observed amino acids
frequencies fa
• Pseudo frequency for amino acid b
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
• Example
gI  0.2 qI |M  0.1 qI |R  ... 0.3 qI |V  0.1 qI |L
gI  0.2 0.1 0.1 0.02 ... 0.3 0.16 0.1 0.12  0.094
Weight on pseudo count
• Pseudo counts are important when only
limited data is available
• With large data sets only “true”
observation should count
•  is the effective number of sequences
(N-1),  is the weight on prior
– In clustering = #clusters -1
– In heuristics = <# different amino acids in
each column> -1
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Weight on pseudo count
• Example
• If  large, p ≈ f and only the observed
data defines the motif
• If  small, p ≈ g and the pseudo counts
(or prior) defines the motif
•  is [50-200] normally
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Sequence weighting and pseudo counts
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Position specific weighting
• We know that positions 2 and
9 are anchor positions for
most MHC binding motifs
– Increase weight on high
information positions
• Motif found on large data set
Weight matrices
• Estimate amino acid frequencies from alignment including
sequence weighting and pseudo count
1
2
3
4
5
6
7
8
9
A
0.08
0.04
0.08
0.08
0.06
0.06
0.10
0.05
0.08
R
0.06
0.01
0.04
0.05
0.04
0.03
0.02
0.07
0.02
N
0.02
0.01
0.05
0.03
0.05
0.03
0.04
0.04
0.01
D
0.03
0.01
0.07
0.10
0.03
0.03
0.04
0.03
0.01
C
0.02
0.01
0.02
0.01
0.01
0.03
0.02
0.01
0.02
Q
0.02
0.01
0.03
0.05
0.04
0.03
0.03
0.04
0.02
E
0.03
0.02
0.03
0.08
0.05
0.04
0.04
0.06
0.03
G
0.08
0.02
0.08
0.13
0.11
0.06
0.05
0.06
0.02
H
0.02
0.01
0.02
0.01
0.03
0.02
0.04
0.03
0.01
I
0.08
0.11
0.05
0.05
0.04
0.10
0.08
0.06
0.10
L
0.11
0.44
0.11
0.06
0.09
0.14
0.12
0.13
0.23
• What do the numbers mean?
K
0.06
0.02
0.03
0.05
0.04
0.04
0.02
0.06
0.03
M
0.04
0.06
0.03
0.01
0.02
0.03
0.03
0.02
0.02
F
0.06
0.03
0.06
0.03
0.06
0.05
0.06
0.05
0.04
P
0.02
0.01
0.04
0.08
0.06
0.04
0.07
0.04
0.01
S
0.09
0.02
0.06
0.06
0.04
0.06
0.06
0.08
0.04
T
0.04
0.05
0.05
0.04
0.05
0.06
0.05
0.07
0.04
W
0.01
0.00
0.03
0.02
0.02
0.01
0.03
0.01
0.00
Y
0.04
0.01
0.05
0.01
0.05
0.03
0.03
0.04
0.02
V
0.08
0.10
0.07
0.05
0.08
0.13
0.08
0.05
0.25
– P2(V)>P2(M). Does this mean that V enables binding more than M.
– In nature not all amino acids are found equally often
• In nature V is found more often than M, so we must somehow
rescale with the background
• qM = 0.025, qV = 0.073
• Finding 7% V is hence not significant, but 7% M highly significant
Weight matrices
• A weight matrix is given as
Wij = log(pij/qj)
– where i is a position in the motif, and j an amino acid. qj is the
background frequency for amino acid j.
1
2
3
4
5
6
7
8
9
A
0.6
-1.6
0.2
-0.1
-1.6
-0.7
1.1
-2.2
-0.2
R
0.4
-6.6
-1.3
-0.1
-0.1
-1.4
-3.8
1.0
-3.5
N
-3.5
-6.5
0.1
-2.0
0.1
-1.0
-0.2
-0.8
-6.1
D
-2.4
-5.4
1.5
2.0
-2.2
-2.3
-1.3
-2.9
-4.5
C
-0.4
-2.5
0.0
-1.6
-1.2
1.1
1.3
-1.4
0.7
Q
-1.9
-4.0
-1.8
0.5
0.4
-1.3
-0.3
0.4
-0.8
E
-2.7
-4.7
-3.3
0.8
-0.5
-1.4
-1.3
0.1
-2.5
G
0.3
-3.7
0.4
2.0
1.9
-0.2
-1.4
-0.4
-4.0
H
I
L
K
M
F
-1.1 1.0 0.3 0.0 1.4 1.2
-6.3 1.0 5.1 -3.7 3.1 -4.2
0.5 -1.0 0.3 -2.5 1.2 1.0
-3.3 0.1 -1.7 -1.0 -2.2 -1.6
1.2 -2.2 -0.5 -1.3 -2.2 1.7
-1.0 1.8 0.8 -1.9 0.2 1.0
2.1 0.6 0.7 -5.0 1.1 0.9
0.2 -0.0 1.1 -0.5 -0.5 0.7
-2.6 0.9 2.8 -3.0 -1.8 -1.4
• W is a L x 20 matrix, L is motif length
P
-2.7
-4.3
-0.1
1.7
1.2
-0.4
1.3
-0.3
-6.2
S
1.4
-4.2
-0.3
-0.6
-2.5
-0.6
-0.5
0.8
-1.9
T
-1.2
-0.2
-0.5
-0.2
-0.1
0.4
-0.9
0.8
-1.6
W
-2.0
-5.9
3.4
1.3
1.7
-0.5
2.9
-0.7
-4.9
Y
V
1.1 0.7
-3.8 0.4
1.6 0.0
-6.8 -0.7
1.5 1.0
-0.0 2.1
-0.4 0.5
1.3 -1.1
-1.6 4.5
Scoring a sequence to a weight matrix
• Score sequences to weight matrix by looking up
and adding L values from the matrix
1
2
3
4
5
6
7
8
9
A
0.6
-1.6
0.2
-0.1
-1.6
-0.7
1.1
-2.2
-0.2
R
0.4
-6.6
-1.3
-0.1
-0.1
-1.4
-3.8
1.0
-3.5
N
-3.5
-6.5
0.1
-2.0
0.1
-1.0
-0.2
-0.8
-6.1
D
-2.4
-5.4
1.5
2.0
-2.2
-2.3
-1.3
-2.9
-4.5
C
-0.4
-2.5
0.0
-1.6
-1.2
1.1
1.3
-1.4
0.7
RLLDDTPEV
GLLGNVSTV
ALAKAAAAL
Q
-1.9
-4.0
-1.8
0.5
0.4
-1.3
-0.3
0.4
-0.8
E
-2.7
-4.7
-3.3
0.8
-0.5
-1.4
-1.3
0.1
-2.5
G
0.3
-3.7
0.4
2.0
1.9
-0.2
-1.4
-0.4
-4.0
H
I
L
K
M
F
-1.1 1.0 0.3 0.0 1.4 1.2
-6.3 1.0 5.1 -3.7 3.1 -4.2
0.5 -1.0 0.3 -2.5 1.2 1.0
-3.3 0.1 -1.7 -1.0 -2.2 -1.6
1.2 -2.2 -0.5 -1.3 -2.2 1.7
-1.0 1.8 0.8 -1.9 0.2 1.0
2.1 0.6 0.7 -5.0 1.1 0.9
0.2 -0.0 1.1 -0.5 -0.5 0.7
-2.6 0.9 2.8 -3.0 -1.8 -1.4
11.9 84nM
14.7 23nM
4.3 309nM
P
-2.7
-4.3
-0.1
1.7
1.2
-0.4
1.3
-0.3
-6.2
S
1.4
-4.2
-0.3
-0.6
-2.5
-0.6
-0.5
0.8
-1.9
T
-1.2
-0.2
-0.5
-0.2
-0.1
0.4
-0.9
0.8
-1.6
W
-2.0
-5.9
3.4
1.3
1.7
-0.5
2.9
-0.7
-4.9
Y
V
1.1 0.7
-3.8 0.4
1.6 0.0
-6.8 -0.7
1.5 1.0
-0.0 2.1
-0.4 0.5
1.3 -1.1
-1.6 4.5
Which peptide is most
likely to bind?
Which peptide second?
Example from real life
• 10 peptides from MHCpep
database
• Bind to the MHC complex
• Relevant for immune
system recognition
• Estimate sequence motif
and weight matrix
• Evaluate motif
“correctness” on 528
peptides
ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV
Prediction accuracy
Measured affinity
Pearson correlation 0.45
Prediction score
Predictive performance
Pearsons correlation
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
CC
Simple
Seq.W
Seq.W+SC
Seq.W+SC+
PW
Large
dataset
0.45
0.5
0.6
0.65
0.79
Summary I. PSSMs
• Sequence logo is a power tool to visualize
(binding) motifs
– Information content identifies essential
residues for function and/or structural
stability
• Weight matrices and sequence profiles
can be derived from very limited number
of data using the techniques of
– Sequence weighting
– Pseudo counts
Is there anything beyond weight matrices
• The effect on the binding affinity
of having a given amino acid at one
position can be influenced by the
amino acids at other positions in the
peptide (sequence correlations).
– Two adjacent amino acids may for
example compete for the space in a
pocket in the MHC molecule.
• Artificial neural networks (ANN)
are ideally suited to take such
correlations into account
Higher order sequence correlations
Neural networks can learn higher order correlations!
– What does this mean?
Say that the peptide needs one and only
one large amino acid in the positions P3
and P4 to fill the binding cleft
How would you formulate this to test if
a peptide can bind?
S S => 0
L S => 1
S L => 1
L L => 0
No linear
function can
learn this (XOR)
pattern
Linear functions (like PSSM’s) cannot
learn higher order signals
XOR
XOR function:
0 0 => 0
1 0 => 1
0 1 => 1
1 1 => 0
(0,1)
(0,0)
No linear function can separate the points
(1,1)
(1,0)
OR
AND
Error estimates
XOR
0 0 => 0
1 0 => 1
0 1 => 1
1 1 => 0
Predict
0
1
1
1
Error
0
0
0
1
Mean error: 1/4
(0,1)
(1,1)
(0,0)
(1,0)
Neural networks
Linear function
y  x1  v1  x 2  v 2
v1
v2
Neural networks. How does it work?
Input
{
1 (Bias)
w11
1
O
1 exp( o)
o   xi  wi
w12
w21
w22
wt1
wt2
v1
v2
vt
Neural networks
Neural network learning higher order
correlations
Mutual information
• How is mutual information calculated?
• Information content was calculated as
• Gives information in a single position
I   pa log(
a
pa
)
qa
• Similar relation for mutual information
• Gives mutual information between two positions

I   pab log(
a,b

pab
)
pa  pb
Mutual information. Example
Knowing that you have G at P1 allows you to
make an educated guess on what you will find
at P6.
P(V6) = 4/9. P(V6|G1) = 1.0!
pab
I   pab log(
)
pa  pb
a,b
P(G1) = 2/9 = 0.22, ..
P(V6) = 4/9 = 0.44,..
P(G1,V6) = 2/9 = 0.22,
P(G1)*P(V6) = 8/81 = 0.10
log(0.22/0.10) > 0
P1
P6
ALWGFFPVA
ILKEPVHGV
ILGFVFTLT
LLFGYPVYV
GLSPTVWLS
YMNGTMSQV
GILGFVFTL
WLSLLVPFV
FLPSDFFPS
Mutual information
313 binding peptides
313 random peptides
Neural network training
• A Network contains a very large
– A network with 5 hidden
neurons predicting binding for
9meric peptides has more than
9x20x5=900 weights
• Over fitting is a problem
• Stop training when test
performance is optimal
Temperature
set of parameters
years
Neural network training. Cross validation
Cross validation
Train on 4/5 of data
Test on 1/5
=>
Produce 5 different
neural networks each
with a different
prediction focus
Neural network training curve
Maximum test set performance
Most cable of generalizing
Network ensembles
5 fold training
Which network to choose?
0.95
Pearsons correlation
0.9
0.85
Train
Test
0.8
0.75
0.7
syn.00
syn.01
syn.02
syn.03
syn.04
5 fold training
0.95
Pearson correlation
0.9
0.85
Train
Test
Eval
0.8
0.75
0.7
syn.00
syn.01
syn.02
syn.03
syn.04
ens
The Wisdom of the Crowds
• The Wisdom of Crowds. Why the Many
are Smarter than the Few. James
Surowiecki
One day in the fall of 1906, the British scientist Fracis
Galton left his home and headed for a country fair… He
believed that only a very few people had the
characteristics necessary to keep societies healthy. He
had devoted much of his career to measuring those
characteristics, in fact, in order to prove that the vast
majority of people did not have them. … Galton came
across a weight-judging competition…Eight hundred people
tried their luck. They were a diverse lot, butchers,
farmers, clerks and many other no-experts…The crowd
had guessed … 1.197 pounds, the ox weighted 1.198
Network ensembles
• No one single network with a particular
architecture and sequence encoding scheme,
will constantly perform the best
• Also for Neural network predictions will
enlightened despotism fail
– For some peptides, BLOSUM encoding with a four
neuron hidden layer can best predict the
peptide/MHC binding, for other peptides a sparse
encoded network with zero hidden neurons performs
the best
– Wisdom of the Crowd
• Never use just one neural network
• Use Network ensembles
Evaluation of prediction accuracy
1
0.9
0.8
0.7
0.6
0.5
Pear
Aroc
Motif
0.76
0.92
PSSM
0.80
0.95
NN-ensemble
0.92
0.98
NN-ensemble: Ensemble of neural networks trained using sparse,
Blosum
NetMHC
www.cbs.dtu.dk/services/NetMHC
Prediction of 10- and 11mers using 9mer
prediction tools
Figure by Melani Zolfagharian Khodaie and Mikael Holm Thomsen
Prediction of 10- and 11mers using 9mer
prediction tools
Prediction of 10- and 11mers using 9mer
prediction tools
• Final prediction = average of the 6 log
scores:
– (0.477+0.405+0.564+0.505+0.559+0.521)/6 = 0.505
• Affinity:
– Exp(log(50000)*(1 - 0.505))
= 211.5 nM
Prediction using ANN trained on 10mer
peptides
Prediction of 10- and 11mers using 9mer
prediction tools
9-10 mer approximation
0 .9 0 0
Pearson correlation coefficient
0 .8 0 0
0 .7 0 0
0 .6 0 0
0 .5 0 0
0 .4 0 0
0 .3 0 0
0 .2 0 0
0 .1 0 0
0 .0 0 0
approach
9 mer apprx
1 0 mer
Predicting binding for longer-mers
Allele
H2-Db
H2-Ld
HLA-A*0101
HLA-A*0101
HLA-A*0101
HLA-A*0201
HLA-A*0201
HLA-A*0201
HLA-A*0301
HLA-A*1101
HLA-A*1101
HLA-A*3101
HLA-A*6801
HLA-B*0702
HLA-B*0702
HLA-B*0702
HLA-B*2702
HLA-B*3501
HLA-B*3501
HLA-B*3501
HLA-B*3501
HLA-B*3508
HLA-B*3508
HLA-B*4402
Length
11
12
11
11
11
11
11
11
11
11
12
11
11
11
11
13
11
11
11
14
12
13
12
11
HLA-B*5703
11
Peptide
SGVENPGGYCL
IPQSLDSWWTSL
YSEHPTFTSQY
SSDYVIPIGTY
FLEGNEVGKTY
MLMAQEALAFL
GLAPPQHLIRV
LLPENNVLSPL
RLRDLLLIVTR
ACQGVGGPGHK
SVLGPISGHVLK
STLPETTVVRR
FVFPTKDVALR
SPSVDKARAEL
RPHERNGFTVL
RPQGGSRPEFVKL
RRARSLSAERY
HPVGEADYFEY
EPLPQGQLTAY
LPAVVGLSPGEQEY
TPRLPSSADVEF
LPEPLPQGQLTAY
CPSQEPMSIYVY
SELFRSGLDSY
KAFSPEVIPMF
% Rank
1.5
0.1
0.05
0.05
1.5
0.15
0.8
1
1
15
0.1
0.8
0.1
0.4
0.05
0.3
0.3
0.1
0.2
0.05
0.8
0.1
0.05
0.15
0.4
So we can find the needle in the haystack
• At least is some haystacks
Polymorphism of MHC
• Within a host limited number of loci (genes)
• only 6 different class I molecules (two A, B and C)
• only upto 12 different class II molecules
• Within a population > 100 alleles per locus
More MHC molecules: more diversity in the
presented peptides
~1% probability that an MHC molecule binds a peptide
Different hosts sample different peptides from same pathogen.
MHC polymorphism
Figure by Thomas Blicher ([email protected]
Immunological benefits of MHC
polymorphism
• Heterozygote advantage
– Heterozygotes have a selective advantage because
they can present more peptides (Hughes.n88).
• Coevolution
– Pathogens avoid presentation on common MHC alleles
(HIV)
– Frequency dependent selection
Variations among populations
• Allele frequency varies between populations
• Databases of HLA and MHC frequencies
– allelefrequencies.net
– dbMHC
Heterozygote disadvantage!
(for vaccine design)
• Few human beings will share the same
set of HLA alleles
– Different persons will react to a pathogen
infection in a non-similar manner
• A CTL based vaccine must include
epitopes specific for each HLA allele in a
population
– A CTL based vaccine must consist of ~800
HLA class I epitopes and ~400 class II
epitopes
HLA polymorphism
• The IMGT/HLA Sequence Database currently
encompass more than 1500 HLA class I proteins
Source: http://www.anthonynolan.com/HIG/index.html
HLA specificity clustering
A0201
A0101
A6802
B0702
HLA supertypes
Supertype Selected allele
A1
A*0101
A2
A*0201
A3
A*1101
A24
A*2401
A26 (new*)
A*2601
B7
B*0702
B8 (new*)
B*0801
B27
B*2705
B39(new*)
B*3901
B44
B*4001
B58
B*5801
B62
B*1501
Clustering in: O Lund et al., Immunogenetics. 2004 55:797-810
How little we know
• Alleles characterized with 5 or more data points
• 3% covered
HLA polymorphism
• ~70 HLA alleles are characterized by
binding data
• Reliable MHC class I binding predictions
(NetMHC-3.2) for ~50 HLA A and B
molecules
• No methods for HLA-C, and HLA-E
• Long way to cover 2500!
HLA polymorphism!
B0807
A6601
B4058
A3401
B5124
B2728
B4411
B0729
A0265
B3526
A3602
A0254
B4038
B1302
B0714
B3902
B0826
B7804
B3509
B4404
B4808
A2907
A1109
A2313
B4018
B4046
B0818
B5103
A2606
A0209
A2444
B5101
B1502
A6803
A2441
B4804
A0268
B1803
B5106
B4103
A3404
A0220
B3537
B5203
B4445
B0805
B2702
A0304
B4021
B1303
A2503
B3926
B0718
A3306
A3015
A7407
B4431
B3558
B0706
B4403
A0106
B5806
B5109
B1578
B0806
B4430
B1308
B3935
A0278
B5126
B0710
B0817
B1527
B3912
B0811
A6820
B1510
A2314
A3013
A0216
A6808
A6815
A7408
A2909
B1566
B1536
A2428
B4446
A6602
B5704
B1809
A0252
B5134
B1534
B1550
B9507
B0724
B5604
B1538
B4418
B0739
B4406
A2312
A3004
A2426
B1513
B5002
B3801
B1525
B3927
A3107
A2433
B0734
B3530
B1539
B4505
A3201
B7805
B3933
B2714
A0302
A1114
B4905
B1504
B4437
A0222
B4102
B5139
B5138
A0317
B3505
B7802
B1575
A2504
A2454
A3006
B4015
B4441
B4606
A1102
A6817
B5602
A6826
B5703
B4104
A2430
B5512
B3702
B4701
A3308
B1544
B1570
B3549
B4408
B3923
A3209
A2414
B9509
B5611
B4427
B4031
A2601
A0289
B0803
B4432
B4016
B3561
A3007
B1813
A2902
B2724
A2309
A3307
B1574
A2446
B5130
B3811
B5606
B4402
A1110
A0235
B5306
A0214
B4061
A2455
A0285
A0255
B1503
B4105
B5801
A0205
A3301
A0112
A2904
B8101
B1511
A6825
B5121
A2429
B4433
B3922
B0728
A2627
B4407
B8301
B1818
B8102
B1592
B1535
A0307
A0204
B4810
B0725
B0733
B1553
A2914
B1540
B4805
A0316
A0206
A3108
B5708
B4420
B0727
A2439
B2715
A0239
A0256
B3535
B4002
B4429
B5116
B4208
B5507
B3551
A7410
B1585
B3536
A0244
B4057
A2418
B0720
B0703
B1583
B1554
B3503
A0103
B5603
A2901
A2621
B1301
B5114
A0269
B4814
B4605
B5402
B4033
A1120
B5508
B2719
B5131
B4054
A6604
A2447
B3901
B1564
B5608
A0271
A6810
B9505
B1509
B2730
A2437
B1556
B5520
A3103
B4813
B4803
B1820
A0318
A2415
B1530
A0110
B0711
B5115
B4004
B3934
A3102
B2710
B2725
B6701
B4435
B1815
B4108
A0219
A0262
B0825
B4029
B6702
A1103
A2406
B4201
B2705
B1405
B8201
B0822
B4030
B3805
B5307
A2903
B5514
B3557
B0708
B3909
A3001
B0740
B4415
B1586
A6603
B1599
A2620
B5510
B5206
A7411
A0310
A6901
A2405
B5129
A3405
A2602
A6805
A0308
B1807
B1572
B3928
B1515
B5110
A2407
B2713
A3303
A3012
B4604
B4812
A0272
A6824
B0723
A6812
B5133
A2427
B1588
B3929
A3111
A3205
B3907
A0102
B1573
B1521
A6819
B3930
B4037
B0730
B4007
B0801
B1315
A2413
B5201
B3563
B5901
A2417
A2408
B5601
B4422
B4501
B3547
B5804
A0319
B3513
A1113
A2608
B1545
A2456
A2419
B1587
B5208
B3524
A0250
B7803
A0212
B4023
B5102
A0259
B0810
B3707
B0702
A1104
B4056
B4034
B0827
B3517
B1821
A1119
A0305
A2906
B1811
A6827
A2301
B2720
B3550
B4013
B4008
B4503
B3809
B5518
B2723
A0275
B4060
A0277
A0225
A0234
B3936
B5204
A6804
B3511
B2717
A0207
B0804
B5137
A3011
B5702
A2622
B5205
B4806
B5001
A1116
A0260
B1402
B4036
B1304
A2452
B1517
B4101
B2727
A2410
A3003
A0208
B5207
B5403
B3803
A2913
B4417
B5308
B4703
B5311
B0715
B3519
A2420
B3520
A2603
B4507
B4444
B1548
B3932
A1123
A1107
B5607
B1310
B5615
A3402
B0731
B4410
A0270
B1589
B3501
B3542
B0824
B3506
A3304
B2706
B5119
A0230
B1531
B3529
A0313
A2619
A0114
B3559
B5605
B0745
B0743
B4603
B1804
B3528
B5120
B4502
A3002
A2616
B4802
B1822
B7801
B4504
B5805
A0218
A0314
B4053
A6605
A2450
B1314
A2502
A2612
B1576
A0113
B1306
B1552
A3010
B1819
B3904
A2617
B3514
A0231
B3548
B1547
B9506
B5519
B0709
A2442
B3523
A2610
A0251
B4807
A6813
B5401
B4044
A6823
A0246
B4602
B1404
B3527
B4405
B1516
B1309
A1111
B1563
B5509
B1542
B4601
B5710
A2425
A1101
B0726
B2726
A2910
A3110
B9502
B2721
A0322
B5616
B3545
A0263
B5305
B1812
B3502
A6802
A3106
A2438
B5709
B0707
B3709
A4301
B3534
B1598
A2435
B3512
A2305
B4704
B8202
A3008
B4005
B4107
B1507
A2303
A7404
B5501
A0273
A3204
B3533
B5613
B5128
A6816
B4051
B0732
B4205
A0261
B1562
A0236
A0227
A3202
A2404
A6801
B1312
B5515
A2453
B3915
B3917
A0228
A3112
A2614
B0814
B4438
B1403
B4426
B3806
A3104
B2707
B5406
B4811
B3531
A0233
B1546
B3552
B4428
B0717
B3504
B3808
B1551
B4059
A7402
A2615
A2458
A0274
A2424
B0802
A7406
B5135
B1590
B4439
A2609
B2729
B4702
B1596
B0813
A7405
B5301
B4052
A6830
A2623
A6822
B4440
A0117
B3911
B4003
A0201
B0736
B3905
B3802
B5404
A2403
B3924
A2911
B5112
B3918
B4421
B5504
A2501
A2310
B0741
A3601
B0744
B1567
A0258
B1561
B3554
B3810
B5118
A3305
B5113
B1520
A6829
B0823
B5610
B4042
A0202
B5122
B4032
A2421
A2605
B4902
A2423
B4409
A3105
A0267
A2912
B3539
A0108
B4035
A0241
B4001
B4436
B4020
B4901
A1117
B4047
B3701
B4012
B5310
A2618
A0245
A0238
B3708
B2711
A0237
B3920
B4904
A8001
A3009
B1805
B5503
A3206
B3914
A2443
B1505
B1581
B1549
B5808
B4062
B1529
B3510
B5511
B1524
B2701
B5132
B1597
A7403
B4009
B5706
B3546
HLA polymorphism!
B1513
B3811
A3106
B3912
B5102
A3107
B3709
A2314
A7411
X
A0216
A3108
A2405
B4052
B4408
B4426
A0302
B4036
B5901
A2904
A3001
B1515
B4422
A0273
B4403
B5207
B3514
B1578
A6824
B2724
B5605
A2458
B0709
A2442
Predicting the specificity
Align A3001 (365) versus A3002 (365). Aln score 2445.000 Aln len 365 Id 0.9890
A3001
0 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFSTSVSRPGSGEPRFIAVGYVDDTQFVRFDSDAA
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
A3002
0 MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFSTSVSRPGSGEPRFIAVGYVDDTQFVRFDSDAA
A3001
A3002
A3001
A3002
A3001
A3002
A3001
A3002
A3001
A3002
65 SQRMEPRAPWIEQERPEYWDQETRNVKAQSQTDRVDLGTLRGYYNQSEAGSHTIQIMYGCDVGSD
:::::::::::::::::::::::::::: ::::: :::::::::::::::::::::::::::::
65 SQRMEPRAPWIEQERPEYWDQETRNVKAHSQTDRENLGTLRGYYNQSEAGSHTIQIMYGCDVGSD
130 GRFLRGYEQHAYDGKDYIALNEDLRSWTAADMAAQITQRKWEAARWAEQLRAYLEGTCVEWLRRY
::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::
130 GRFLRGYEQHAYDGKDYIALNEDLRSWTAADMAAQITQRKWEAARRAEQLRAYLEGTCVEWLRRY
195 LENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPA
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
195 LENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPA
260 GDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWELSSQPTIPIVGIIAGLVLLGAVITGA
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
260 GDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWELSSQPTIPIVGIIAGLVLLGAVITGA
325 VVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSLTACKV
::::::::::::::::::::::::::::::::::::::::
325 VVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSLTACKV
HLAA*3001
HLAA*3002
NetMHCpan - a pan-specific method
NetMHC
NetMHCpan
NetMHCpan, a Method for Quantitative Predictions of Peptide Binding to Any
HLA-A and -B Locus Protein of Known Sequence. Nielsen et al. PLoS ONE 2007
Example
Peptide
VVLQQHSIA
SQVSFQQPL
SQCQAIHNV
LQQSTYQLV
LQPFLQPQL
VLAGLLGNV
VLAGLLGNV
VLAGLLGNV
VLAGLLGNV
VLAGLLGNV
Amino acids of HLA pockets
YFAVLTWYGEKVHTHVDTLVRYHY
YFAVLTWYGEKVHTHVDTLVRYHY
YFAVLTWYGEKVHTHVDTLVRYHY
YFAVLTWYGEKVHTHVDTLVRYHY
YFAVLTWYGEKVHTHVDTLVRYHY
YFAVLTWYGEKVHTHVDTLVRYHY
YFAVWTWYGEKVHTHVDTLLRYHY
YFAEWTWYGEKVHTHVDTLVRYHY
YYAVLTWYGEKVHTHVDTLVRYHY
YYAVWTWYRNNVQTDVDTLIRYHY
HLA
A0201
A0201
A0201
A0201
A0201
A0201
A0202
A0203
A0206
A6802
Aff
0.131751
0.487500
0.364186
0.582749
0.206700
0.727865
0.706274
1.000000
0.682619
0.407855
Evaluation. MHC ligands from SYFPEITHI
Sort on
binding
Top Rank: F-rank=0.0
Random Rank: F-rank=0.5
SYFPEITHI benchmark
(1400 ligands restricted to 46 HLA molecules)
More than 90% of ligands are predicted
with a rank less than 2.5%.
If you select 5 peptides from a source
protein, the ligand will in 90% of the
cases be part of the pool.
Pan-specific predictions
• Pan-specific MHC peptide binding
prediction is the single most important
recent (in silico) development for
understanding presentation of T cell
epitopes/ligands
NetMHCpan
www.cbs.dtu.dk/services/NetMHCpan
NetMHCpan output
SKADVIAKY. Known BoLA Tp5 CTL epitope
What is the % rank score
1% rank (percentile) score
1% rank (percentile) score
Rational epitope discovery
• Forward epitope discovery
– Identify antigens using overlapping peptides
– Identify epitope using peptide truncations
• Reverse epitope discovery
– Predict potential epitopes using bioinformatics tool
– Validate predictions using tetra-mers
• Forward/Backwards epitope discovery
– Identify antigens using overlapping peptides
– Use bioinformatics tool to predict epitopes
– Validate predictions using tetra-mers
Forward epitope discovery
• Some numbers
– YF 3,411 amino acids precursor protein
– ~ 900 15mers overlapping with 11 amino acids
– One positive 15mer peptide will contain to 26 submer
peptides of length 8-11
– Testing all 26 submer peptides to each of the 6 HLA
alleles requires 156 validations
Rational epitope discovery
• Forward epitope discovery
– Identify antigens using overlapping peptides
– Identify epitope using peptide truncations
• Reverse epitope discovery
– Predict potential epitopes using bioinformatics tool
– Validate predictions using tetra-mers
• Forward/Backwards epitope discovery
– Identify antigens using overlapping peptides
– Use bioinformatics tool to predict epitopes
– Validate predictions using tetra-mers
Reverse discovery
• Problems
– Which alleles to include in selection of
potential epitopes
– Use HLA supertypes, predict 8-11mer, select
top 5% predicted binder => 8200 peptides
– And you might miss a lot
• Supertypes are not perfect, i.e. HLA-A*11:01
and HLA-A*03:01 do not bind the same set of
peptides
• Predictions are not perfect. Less than 80% of
predicted binders turn out to be actual
binders
Rational epitope discovery
• Forward epitope discovery
– Identify antigens using overlapping peptides
– Identify epitope using peptide truncations
• Reverse epitope discovery
– Predict potential epitopes using bioinformatics tool
– Validate predictions using tetra-mers
• Forward/Backwards epitope discovery
– Identify antigens using overlapping peptides
– Use bioinformatics tool to predict epitopes
– Validate predictions using tetra-mers
Forward/Backwards epitope discovery
• ~ 900 15mers overlapping with 11 amino
acids
• Identify immunogenic peptides using
peptide pools
• Identify HLA restriction and minimal
epitope using bioinformatic tools
– Reduces peptide set by 95% at a sensitivity
of 92%
Peptide pools
The HLArestrictor
www.cbs.dtu.dk/services/HLArestrictor
Output
Known B42:01 epitope
5145 18mer HIV EliSpot positive peptides
(Kiepiela et al. 2004)
100
8.7 %
22.5 %
Percent predicted
80
1.8
%
12.8 %
0.2 %
13.3 %
11.7 %
43.8 %
9.7%
48.0 %
60
48.5 %
44.8 %
7.1%
Not predicted
A
B
C
37.5 %
92% of positive EliSpot
responses are identified
at a 2 %rank threshold
40
25.9%
20
23.1%
30.3 %
34.7 %
37.5 %
38.1%
228 potential positive
per peptide
0
0.5
5 predicted positive per
peptides
1
2
5
Binding threshold (%)
10
=> Reduction of 98%
Tetra-mer validations
Tetra-mer validations
Patient
HIV 18’mer with ELIspot
Validated epitope
response
Validated
Pred.
Pred.
allele
affinity
%rank
N080
PRTLNAWVKVIEEKAF
VKVIEEKAF
B15:03
155 nM
6.0 %
N080
YHCLVCFQTKGLGISYGR
FQTKGLGISY
B15:03
8 nM
0.8 %
N080
VKAACWWAGIQQEFGIPY
IQQEFGIPY
B15:03
4 nM
0.1 %
N080
AVFIHNFKRKGGIGGYSA
FKRKGGIGGY
B15:03
24 nM
1.5 %
H044
WVKVIEEKAFSPEVIPMF
KAFSPEVIPMF
B57:03
NA
0.4 %
N021
ELKQEAVRHFPRPWLHGL
FPRPWLHGL
B42:01
49 nM
0.05 %
N012
ACQGVGGPSHKARVLAEA
GPSHKARVL*
B07:02
36 nM
0.3 %
R050 / N012
CRAIRNIPRRIRQGL
IPRRIRQGL*
B07:02
10 nM
0.1 %
R050
NYTPGPGVRYPLTFGWCF
TPGPGVRYPL
B07:02
45 nM
0.3 %
R050
QGWKGSPAIFQSSMTKIL
SPAIFQSSM
B07:02
10 nM
0.1 %
R050 / R039
WVKVIEEKAFSPEVIPMF
KAFSPEVIPMF
B57:01
67 nM
0.1 %
R039
PVGEIYKRWIILGLNKIV
KRWIILGLNK
B27:05
22 nM
0.05 %
R039
AVFIHNFKRKGGIGGYSA
KRKGGIGGY
B27:05
289 nM
1.0 %
R035
ELKNEAVRHFPRIWLHSL
VRHFPRIWL
B27:05
357 nM
1.0 %
R014
MASEFNLPPIVAKEIVA
LPPIVAKEI
B42:01
NA
1.5 %
N067
TGSEELRSLYNTVATLY
SLYNTVATL
A02:01
436 nM
4.0 %
N067 / N096
ELAENREILKEPVHGVYY
ILKEPVHGV
A02:01
42 nM
1.5 %
N096
SDIAGTTSTLQEQIAWM
TSTLQEQIAW
B58:01
32 nM
0.2 %
Visualization of binding motifs
www.cbs.dtu.dk/biotools/MHCMotifViewer
The MHC motif viewer: a visualization tool for MHC binding motifs. Rapin N, Hoof I, Lund O, Nielsen M. Curr Protoc Immunol.
2010 Feb;Chapter 18:Unit 18.17.
Going beyond humans
BoLA epitopes the hard way
Trimming prior to binding
QRSPMFEGTL - Rank=6%
RSPMFEGTL – Rank =0.1%
90
80
% Cytotoxicity
70
QRSPMFEGTL
60
RSPMFEGTL
50
RSPMFEGTLG
40
RSPMFEGT
30
SPMFEGTL
20
10
0
1000
100
10
1
0.1
peptide conc. ng/ml
BoLA Class I epitopes, Work by Ivan Morrison and co-workers
Trimming happens in both ends
90
80
SKFPKMRMG – Rank 16%
SKFPKMRM - Rank 1%
70
60
SKFPKMRMG
50
SKFPKMRMGKG
SKFPKMRMGK
40
KFPKMRMGK
SKFPKMRM
30
KFPKMRMG
20
10
0
100ng
10ng
1ng
0.1ng
0.01ng
-10
Processing effect: addition of GKG and possibly the G
BoLA Class I epitopes, Work by Ivan Morrison and co-workers
BoLA CTL epitopes - the rational way
Frank
Average
predicted rank
of 16 CTL BoLA
restricted
epitopes is 3%
So, we can find the needle in the haystack
• Given a protein sequence and an HLA molecule, we can
accurately predict with peptides will bind (70-95%)
• 15-80% of these will in turn be epitopes
Conclusions II. MHC binding
• Pan-specific MHC prediction method can deal
with the immense MHC polymorphism and is (in
my opinion) the most significant recent
contribution to our understanding of cellular
immune responses
• Rational epitope discovery is feasible
– Prediction methods are an important guide for
epitope identification
– Given a protein sequence and an HLA molecule, we can
predict the peptide binders (find the needle in the
haystack)
What defines a T cell epitope?
•
•
•
•
•
•
Processing (Proteasomal cleavage, TAP)?
MHC binding
Other proteases
T cell repertoire
MHC:peptide complex stability
Source protein abundance, cellular
location and function
Evaluation. MHC ligands from SYFPEITHI
Sort on
binding
Top Rank: F-rank=0.0
Random Rank: F-rank=0.5
Processing
Do proteasomal cleavage and TAP matter?
NetCTL, MHC-pathway said yes (in 2005)
NetCTL, 2005
Wcl=0.05, Wt=0.1 (AUC)
2010, NetCTLpan says maybe
– Wcl=0, Wt=0 (AUC)
– Wcl=0.225, Wt= 0.025 (AUC0.1)

Benchmark (Ligands and HIV epitopes)
S  MHC  wcl  ClC term  wtap  TAP
Wcl=0.225, wtap=0.025
MHC class I pathway co-evolution
Nielsen, Kesmir, Immunogenetics, (2005) 57: 33–41
Going pan-specific does most of it
Objectives
 Visualization of binding motifs
 Construction of sequence logos
 Understand the concepts of weight matrix construction
 One of the most important methods of bioinformatics
 A few word on Artificial neural networks
 MHC binding rules
 No other factors in the MHC (I and II) pathways are
(as) decisive for T cell epitope identification
 All known T cell epitopes have specific MHC restrictions
matching their host
 MHC binding is the single most important feature for
understanding cellular immunity
Class II MHC binding
• Binds peptides of length 9-18
(even whole proteins can bind!)
• Binding cleft is open
• Binding core is 9 aa
• Binding motif highly generate
• Amino acids flanking the binding
core affect binding
• Peptide structure might
determine binding
Gibbs sampler
www.cbs.dtu.dk/biotools/EasyGibbs
100 10mer peptides
2100~1030 combinations
E  C pa log
p,aa

ppa
qa
Monte Carlo
simulations can do it
The problem. Where is the binding core?
PEPTIDE
VPLTDLRIPS
GWPYIGSRSQIIGRS
ILVQAGEAETMTPSG
HNWVNHAVPLAMKLI
SSTVKLRQNEFGPAR
NMLTHSINSLISDNL
LSSKFNKFVSPKSVS
GRWDEDGAKRIPVDV
ACVKDLVSKYLADNE
NLYIKSIQSLISDTQ
IYGLPWMTTQTSALS
QYDVIIQHPADMSWC
IC50(nM)
48000
45000
34000
120
8045
47560
4
49350
86
67
11
15245
Effect of Peptide Flanking Residues
• PFR’s can affect binding dramatically
– RFYKTLRAEQASQ 34 nM
– YKTLRAEQA
>10000 nM
Update method to
Minimize prediction
error
PEPTIDE
VPLTDLRIPS
GWPYIGSRSQIIGRS
ILVQAGEAETMTPSG
HNWVNHAVPLAMKLI
SSTVKLRQNEFGPAR
NMLTHSINSLISDNL
LSSKFNKFVSPKSVS
GRWDEDGAKRIPVDV
ACVKDLVSKYLADNE
NLYIKSIQSLISDTQ
IYGLPWMTTQTSALS
QYDVIIQHPADMSWC
Pred
0.00
0.19
0.07
0.77
0.15
0.17
0.81
0.39
0.58
0.84
1.00
0.12
NN-align
Meas
0.03
0.08
0.24
0.59
0.19
0.02
0.97
0.45
0.57
0.66
0.93
0.11
Predict binding affinity
and core
GRWDEDGAKRIPVDV
0.45
GRWDEDGAKRIP
0.15
G RWDEDGAKRIPV
0.03
GR WDEDGAKRIPVD
0.39
GRW DEDGAKRIP VDV
0.05
Calculate prediction
error
Nielsen et al. BMC Bioinformatics 2009, 10:296
NetMHCII (NN-align)
P<0.001
P<0.05
Nielsen et al. BMC Bioinformatics 2009, 10:296
P<0.05
Network ensembles
Network ensembles
Pan NN-align
• Add MHC pseudo sequence to training
• Include polymorphic residues in potential
contact with the bound peptide
• The contact residues are defined as being
within 4.0 Å of the peptide in any of a
representative set of HLA-DR, -DQ, and DP
structures with peptides.
• Only polymorphic residues are included
• Pseudo-sequence consisting of 25
amino acid residues.
NetMHCIIPan-2.0
www.cbs.dtu.dk/services/NetMHCIIpan
But, can we find the haystack?
MTB (mycobacterium tuberculosis)
• Bacterial genome coding for more then
4000 proteins
• 700 known epitopes, found in only 30
proteins (ORFs)
TB
W epitopes
MTB (mycobacterium tuberculosis)
• Bacterial genome coding for more then
4000 proteins
• 700 known epitopes, found in only 30
proteins (ORFs)
• Is this biology, or history?
– More than 500.000 unique 9mer peptides
– Where to start?
• Each HLA allele will binding ~5000 of these
peptides..
Functional bias in TB epitope proteins
Tang et al. J Immunol. 2011 Jan 15;186(2):1068-80.
Functional bias in TB epitope proteins
Tang et al. J Immunol. 2011 Jan 15;186(2):1068-80.
Where are the epitopes?
Larsen MV et al., PLoS One. 2010 Sep 14;5
Conclusions
• Rational epitope discovery is feasible
– Prediction methods are an important guide for epitope
identification
– Given a protein sequence and an HLA molecule, we can
predict the peptide binders (find the needle in the
haystack)
• Pan-specific MHC prediction method can deal with the
immense MHC polymorphism
• All CTL epitopes have specific MHC restrictions
matching their host
– There is no such thing as a non-binding CTL epitope
• Processing have little impact in predicting of CTL
epitopes
• For large pathogens, we still have no good handle on how
to select immunogenic proteins
CBS immunology web servers
www.cbs.dtu.dk/services
Acknowledgements
Immunological Bioinformatics group,
CBS, DTU
•
– Ole Lund - Group leader
– Claus Lundegaard - Data bases, HLA
binding predictions
• Collaborators
– IMMI, University of Copenhagen
• Søren Buus: MHC binding
– La Jolla Institute of Allergy and
Infectious Diseases
• A. Sette, B. Peters: Epitope
database
• and many, many more
www.cbs.dtu.dk/services