protws1 3815

Download Report

Transcript protws1 3815

De novo interpretation
of peptide mass spectra
Vineet Bafna, UCSD
(joint work with
Nathan Edwards, ABI and
Noah Zaitlen, UCSD)
Nobel Citation 2002
Talk Outline
 Tandem MS for Peptide Identification
 Earlier work
 Description of algorithm
 Results and applications
Tandem MS
Secondary Fragmentation
Ionized parent peptide
The peptide backbone
The peptide backbone breaks to form
fragments with characteristic masses.
H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH
N-terminus
Ri-1
AA residuei-1
Ri
AA residuei
Ri+1
AA residuei+1
C-terminus
Ionization
The peptide backbone breaks to form
fragments with characteristic masses.
H+
H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH
N-terminus
Ri-1
AA residuei-1
Ri
AA residuei
Ri+1
AA residuei+1
Ionized parent peptide
C-terminus
Fragment ion generation
The peptide backbone breaks to form
fragments with characteristic masses.
H+
H...-HN-CH-CO
N-terminus
Ri-1
AA residuei-1
NH-CH-CO-NH-CH-CO-…OH
Ri
AA residuei
Ri+1
AA residuei+1
Ionized peptide fragment
C-terminus
Tandem MS for Peptide ID
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
% Intensity
100
[M+2H]2+
0
250
500
750
m/z
1000
b ions
y ions
Peak Assignment
88
S
1166
145
G
1080
292
F
1022
405
L
875
534
E
762
663
E
633
778
D
504
907
E
389
1020
L
260
1166
K
147
b ions
y ions
y6
100
% Intensity
Peak assignment implies
Sequence (Residue tag)
Reconstruction!
[M+2H]2+
y5
b3
y2
y7
y3
b4
y4 b5
b6
b7
b8
b9 y8
0
250
500
750
m/z
1000
y9
Database Searching
 For every peptide from a database
 Generate a hypothetical spectrum
 Compute a correlation between observed and
experimental spectra
 Choose the best
 Database searching is very powerful and
is the de facto standard for MS.
 Sequest, Mascot, and many others
A case for de novo sequencing
 With current technology, only about 30% of
spectra yield a meaningful hit in a database
 Multiple Reasons:
 Incomplete databases (ex: pathogens)
 Modifications/Mutations
 Poor quality of fragmentation
 Can we say anything about such ‘Orphan’
Spectra?
 In the absence of a complete interpretation,
can we have confidence in peak
assignments?
De Novo Interpretation: Example
0
88
S
420
145
G
333
274
E
276
402 b-ions
K
147
0
y-ions
Ion Offsets
b=P+1
y=S+19=M-P+19
y2
y1
b1
b2
100
200
300
M/Z
400
500
Putative Prefix Masses
Dancik et al., Recomb 98
Prefix Mass
M=401
88
145
147
276
S
0
b
87
144
146
275
y
332
275
273
144
G
E
K
87 144
273
401
Earlier Work: Spectral Graph
Dancik et al., Recomb 98
 Each peak, when assigned to a prefix/suffix ion type
generates a unique prefix residue mass.
 Spectral graph:
 Each node u defines a putative prefix residue M(u).
 (u,v) in E if M(v)-M(u) is the residue mass of an a.a. (tag) or
0.
 Paths in the spectral graph correspond to a interpretation
0
87
100
S
273275
144 146
G
200
E
332
300
K
401
Re-defining de novo interpretation
 Find a subset of nodes in spectral graph s.t.
 0, M are included
 Each peak contributes at most one node (interpretation)(*)
 Each adjacent pair (when sorted by mass) is connected by
an edge (valid residue mass)
 An appropriate objective function (ex: the number of peaks
interpreted) is maximized
 (*)In general, finding paths using forbidden
pairs is NP-hard
100
0
S
G
400
200
E
K
Earlier Work:
Non-Intersecting Forbidden pairs
Chen et al. , SODA 2000
0
87
S
100
G
200
E
300
332
400
K
 If we consider only b,y ions, ‘forbidden’ node pairs
are non-intersecting,
 The de novo problem can be solved efficiently using
a dynamic programming technique.
Multiple Ion Types
Peptide fragmentation possibilities
(ion types)
xn-i
yn-i
vn-i
wn-i
zn-i
yn-i-1
-HN-CH-CO-NH-CH-CO-NH-
CH-R’i+1
Ri
ai
bi
low energy fragments
R”
i+1
bi+1
ci
di+1
high energy fragments
Spectra have Multiple Ion types
0
100
200
400
 b-ions and y-ions often constitute a minority of
interpretable ions. a-ions, and different neutral
losses are very common. High energy spectra
display a larger fraction of ions.
 Multiple ions imply overlapping residue
assignments and Chen et al. does not apply.
BE’2003
 An efficient algorithm that handles all
prefix/suffix fragmentations and their neutral
losses using a different d.p. formulation
 An algorithm to obtain the core interpretation
of a spectrum, assignment of confidence
values to peak assignments
Multiple Ion Types
xn-i
yn-i
vn-i
Residue
wn-i
zn-i
yn-i-1
-HN-CH-CO-NH-CH-CO-NH-
CH-R’i+1
Ri
ai
bi
low energy fragments
R”
i+1
bi+1
ci
di+1
high energy fragments
Simple Ion Lists
Peak i
M/2
Span
Prefix residue masses
Simple Ion Lists
Peak j
Peak i
r_i
l_i
0
100
200
S
400
 Partition the putative residue masses for each peak around M/2
 l_i (r_i) is the smallest (largest) mass for peak i
 The span S of an ion-list is the maximum difference between the
putative residue assignments on either LHS or RHS.
 Define an Ion List as simple if
 span <= minimum residue mass.
 l_i < l_j implies r_i > r_j
 Most natural ion-lists are simple. Ex: a,b,y,b-NH3,b-H20,yNH3,y- H20
Simple Ion Lists and Spectral Peak
ordering
i
j
0
ri
rj
200
400
 Order the peaks by increasing rightmost putative residue
 Lemma: If the left side residues ri is
chosen from i, and rj is chosen from j >i
, then ri >= rj
Ordering spectral peaks
i<j
i
j
Span <= aa
Forward algorithm
i-1
i
M[v]
M[w]
 Goal: We only have peaks 1..i, and want to find
the “best” path from node v (M[v]<= M/2) to node
u (M[v]> M/2). Denote score S(i,v,w)
 Best: Many notions of best.
Forward algorithm
i-1
i
M[v]
M[u]
M[w]
 Since i is the outermost peak, a PRM from it is either
connected to v, or to w, or NOT used ever.
 One option is that none of the PRMs from peak i are
used. In that case
 S[i,v,w] = S[i-1,v,w]
Forward algorithm
i-1
i
M[v]
M[u]
M[w]
 Otherwise, we choose a node u from one of the
interpretations of i
 Node u must have an edge from v or w. If it is from v
 S(i,v,w) = S(i-1,u,w)
The Forward Algorithm
i-1
i
M[v]
M[u]
M[w]
 Number the peaks in the order of increasing largest
prefix residue mass
 Define S[i,v,w] as the score of the best interpretation
from M[v] to M[w] using peaks 1..i
S[i,v,w] = max
S[i-1,v,w] + f(i)
S[i-1,u,w] + g(i,M[u]), if valid
S[i-1,v,u] + g(i,M[u]), if valid
Forward Backward Paths
i
i+1
0
M[v]
M[w]
M
 S[i][v][w] is the best (M[v],M[w]) inner
interpretation using peaks 1..i.
 Let T[i][v][w] be the best (M[v],M[w])
outer-interpretation using peaks i..m.
 It is the highest score of a path from 0 to v,
and from w to M using peaks i, i+1,…,m
Forward Backward Scoring
i
i+1
0
M[v]
M[w]
M
 S[i][v][w] is the best (M[v],M[w]) inner interpretation
using peaks 1..i.
 Let T[i][v][w] be the best (M[v],M[w]) outerinterpretation using peaks i..m.
T[i,v,w] = max
T[i+1,v,w] + f(i)
S[i+1,u,w] + g(i,M[u]), if valid
S[i+1,v,u] + g(i,M[u]), if valid
Core Interpretations
 Suppose we can answer the following
question:
 What is the best scoring interpretation if
we fix the interpretation of peak i to
something (EX: b, y-H2O)?
 This is equivalent to assigning peak i to one of
its nodes
Do we care about this?
Core interpretations
 If global score after assigning peak i to a
node u is Much higher than a score due to
any other interpretation
 Then, the interpretation of peak I is likely
to be correct even if the global
interpretation is not.
 This allows us to interpret peaks with
incomplete fragmentation
Core Interpretations
 Define H[i,u] as the highest scoring
interpretation in which peak i is assigned to
g[u]
 If H[i,u] = S[m,0,n] for some u, and H[i,v] <<
S[m,0,n] then M[u] is probably the correct
interpretation for I
H[i,u] = g(i,M[u]) +
max
Max_v (S[i-1,v,u] + T[i+1,v,u]) v<u
Max_w (S[i-1,u,w] + T[i+1,u,w]) u<w
More theory
 Reduce dimensionality of the recurrence.
 Generate sub-optimal paths
Some results
 How good is de novo?
Simulation test data set
 Given a peptide, generate artificial tandem
MS with differing intensities (as in SEQUEST)
 b,y with intensity 1.0,
 a with intensity 0.5,
 appropriate neutral losses with intensity 0.2
 Parameters g,e are chosen (fragmentation
and error probability respectively).
 Each fragment is generated with probability
min{ g i,1}, where i is the intensity. An error
offset is chosen uniformly at random from
[-e,e].
Results: %Positions
predicted
100
80
60
40
Forward  = 0.1,
Backward,  = 0.1
Forward, = 0.2
Backward, = 0.2 X
20
0
0
0.5
1
1.5
Ions/Position
2
2.5
3
3.5
4
4.5
Results:Peptide Id
100
80
60
40
Forward  = 0.1,
Backward,  = 0.1
Forward, = 0.2
Backward, = 0.2 X
20
0
1.5
2
2.5
Ions/Position
3
3.5
4
4.5
Results: % TIC explained
100
80
60
40
Forward  = 0.1,
Backward,  = 0.1
Forward, = 0.2
Backward, = 0.2 X
20
0
0
0.5
1
1.5
Ions/Position
2
2.5
3
3.5
4
4.5
Results: Core Interpretation
1800
1600
1400
1200
1000
800
600
400
200
0
0
5
10
15
20
25
30
35
40
45
50
55
% Score Difference
60
65
70
75
80
85
90
95
100
Performance on real data sets
 Zufar et al. (~150 spectra)
 ISB (~500 spectra)
 Performance
 Best spectral interpretation was chosen for each
spectra.
 For tags of different lengths, the number of spectra
with a correctly predicted sequence tag was reported.
De novo Performance on real data sets
80%
Spectra with correct tag predicition
70%
60%
50%
Zufar
ISB
40%
30%
20%
10%
0%
0
1
2
3
4
Tag length
5
6
7
Parameter optimization
 Peak selection parameters
 Scoring parameters (for different
interpretations)
 A simulated annealing step is used to
optimize parameters on a learning data
set.
De novo performance (optimization)
Parameter optimization
100%
90%
Spectra correctly predicted
80%
70%
60%
Zufar
ISB
Zufar*
ISB*
50%
40%
30%
20%
10%
0%
0
1
2
3
4
Tag length
5
6
7
De novo analysis
 ~50% of spectra have a correct 5-mer
prediction
 Worthwhile if databases are incomplete.
 Very useful as filters.
 Mann & Wilm, MS Blast, GutenTag all use
generated tags as filters
De novo interpretation as a filter
 Only tags from high scoring paths are used.
Fewer tags implies fewer candidate peptides
 Experiment:
 All substrings of lengths 3-6 were chosen as tags
 The Uniprot database (143K proteins, 100Mb) was
searched with these tag filters.
 A peptide is a candidate if the tag and flanking
masses are consistent with the interpretation
M1
M2
IVLSDFYLDEERVADCVLL
Tags as effective filters
1800.00
1600.00
Average number of candidates
1400.00
1200.00
1000.00
Series1
Series2
800.00
600.00
400.00
200.00
0.00
0
1
2
3
4
tag size
5
6
7
Tags as filters
 Pros:
 Eliminate a large portion of the database from
consideration
 Allow for efficient search using keyword trees (tries).
This dictionary search is independent of the size of
the tag space
 Allow Searching with Post-translational modifications
 Cons
 Filter out some true hits. Dependent on interpretation.
Suboptimal paths must be considered as well
 General filters based on interpretations can be built
Mass spectra with PT
modification
 Most database search software allow PT
modifications
 Search based on generation of modified
candidate peptides
 A combinatorial expansion occurs.
 Searching with PT modifications is
computationally intensive.
Efficient handling of PT modifications
 Consider the peptide SDFTYLDER
 S,T,Y can all be phosphorylated (or not) giving 8
possibilities
 Parent mass consideration reduces this, but there are
still a large number of possibilities.
 With an increase in the type of PT modifications, this
number can become very large
 De novo interpretation complexity is unchanged
in the presence of PT modifications
PT modifications
70%
% spectra correctly predicted
60%
50%
40%
Zufar
ISB
30%
20%
10%
0%
0
1
2
3
4
Tag length
5
6
7
Conclusions
 Computational methods for de novo analysis are
continually improving.
 Things should get much better with technology
improvement
 Great potential as filters
 Searching large genomic databases
 Searching in the presence of PT modifications
 Filtering without tags
 De novo interpretation should be revived from
the backwaters of MS analysis.
Acknowledgments
 Nathan Edwards, ABI
 UCSD:
 Nuno Bandeira, Ari Frank, Qian Peng, Pavel
A. Pevzner, Noah Zaitlen