Transcript XCorr

“shotgun sequencing”
TOP10
Fill Times
1st
Scan Times
MS2
Fill
Full Scan
1
2
3
4
5
6
7
8
9
2nd
10
LTQ
3rd
FTICR
Full Scan
0
500
1000
4th
1500
2000
2500
Time [ms]
3000
5th
Relative Intensity
926.49408
6th
7th
524.81738
8th
927.49780
463.75125
515.29254
533.33081
591.83795 624.38013
803.40546
9th
876.38116
1029.57788
1017.60364
10th
MS2
spectral matching
MS/MS Spectrum
0
250
500
750
1000
1250
1500
“shotgun sequencing”
time
“shotgun sequencing”
time
ms1
ms1
ms1
time
ms2
ms2
ms2
distributed spectral matching
6000 spectra x 10s/spectrum = 16 CPU hours
LTQ Orbitrap base peak chromatogram
search time
100
Server
single CPU
Relative Abundance
80
16 hours
60
Server
40
parallel CPUs
20
0
20
30
40
Retention time (min)
50
20 nodes
37 min LC-MS/MS run-time
6186 MS/MS spectra
2308 peptide IDs
(false-positive rate 1%)
287 protein IDs
0.8 hours
sequest
XCorr: goodness of fit
between theoretical
b and y ions from
peptides in the database
dCn: fractional XCorr
difference between the
highest XCorr and next
highest XCorr
yates j.r. 3rd et al. j am soc mass spectrom 5:976-89 (1994)
sequest
time
ms1
ms1
ms1
5000 - 25000
ms2 spectra
time
ms2
2
all ms2 ms
in LC
run
ms2
all ms2 in LC run
all raw
(all ms2 = 1 file)
501.000 (precursor
1001.500
(precursorm/z)
m/z)
+2
+3
1 dta
2
sequest
(charge state)
ms2 array
1 ms2 = 1 file
(all ms2 = ~10000 files)
sequest
all ms2 in LC run
>IPI00000001.2
MSQVQVQVQNPSAALSGSQILNKNQSLLSQPLMSIPSTTSSLPSENAGRPIQNSALPSASITSTSAAAESITPTVELNAL
CMKLGKKPMYKPVDPYSRMQSTYNYNMRGGAYPPRYFYPFPVPPLLYQVELSVGGQQFNGKGKTRQAAKHDAAAKALRIL
QNEPLPERLEVNGRESEEENLNKSEISQVFEIALKRNLPVNFEVARESGPPHMKNFVTKVSVGEFVGEGEGKSKKISKKN
AAIAVLEELKKLPPLPAVERVKPRIKKKTKPIVKPQTSPEYGQGINPISRLAQIQQAKKEKEPEYTLLTERGLPRRREFV
MQVKVGNHTAEGTGTNKKVAKRNAAENMLEILGFKVPQAQPTKPALKSEEKTPIKKPGDGRKVTFFEPGSGDENGTSNKE
DEFRMPYLSHQQLPAGILPMVPEVAQAVGVSQGHHTKDFTRAAPNPAKATVTAMIARELLYGGTSPTAETILKNNISSGH
VPHGPLTRPSEQLDYLSRVQGFQVEYKDFPKNNKNEFVSLINCSSQPPLISHGIGKDVESCHDMAALNILKLLSELDQQS
TEMPRTGNGPMSVCGRC
digest to next peptide
1 dta,
2 dta,
3 dta,
10000
dta
MSQVQVQVQNPSAALSGSQILNK
calculate peptide mass
2426.258812
compare with precursor
peptide mass:
1000.000
3000.000 +/- 1Da
not a candidate
if cand., calc. theoretical spectrum
human
ipi database
correlate, score &
61236
proteins
return
10000
32 xx3,250,000
3,250,000
x3,250,000
3,250,000
times
times
times
times
theoretical “candidate” spectrum
experimental peptide spectrum
correlation
spectrum
-2000
-1500
-1000
-500
0
500
1000
yates j.r.
3rd
1500
2000
et al. j am soc mass spectrom 5:976-89 (1994)
correlation
spectrum
-2000
-1500
-1000
-500
0
500
1000
yates j.r.
3rd
1500
2000
et al. j am soc mass spectrom 5:976-89 (1994)
correlation
spectrum
-2000
-1500
-1000
-500
0
500
1000
yates j.r.
3rd
1500
2000
et al. j am soc mass spectrom 5:976-89 (1994)
similarity scoring
Xcorr score
correlation
spectrum
-2000
-1500
-1000
-500
0
500
1000
yates j.r.
3rd
1500
2000
et al. j am soc mass spectrom 5:976-89 (1994)
similarity scoring – cross-correlation vs dot product
Xcorr score
-1500
-1000
-500
0
500
1000
1500
2000
Dot product
-2000
Dot product
Xcorr (cross-correlation)
non-indexed searching
>ipi00000001.2
1st
MSQVQVQVQNPSAALSGSQILNKNQSLLSQ
PLMSIPSTTSSLPSENAGRPIQNSALPSASITST
SAAAESITPTVELNAL….
1200 +/- 1Da
>ipi00853644.1
61236th
human ipi database
61236 proteins
….AKPNINLITGHLEEPMPNPIDEMTEEQKEY
EAMKLVNMLDKLSREELLKPMGLKPDGTIT
indexed searching
>ipi00001234.11
75 Da
G
>ipi00344567.1
WEFGGHTVLR
1200 +/- 1Da
>ipi00853644.1
20245 Da
human ipi database
61236 proteins
indexed
AKPNINLITGHLEEPMPNPIDEMTEEQEYEA
MLVNMLDLSEELLKPMGLKPDGTITAKPNINL
ITGHLEEPMPNPIDEMTEEQEYEAMLVNML
DLSEELLKPMGLKPDGTIT
scoring & analysis
Score/Metric 1 Score/Metric 2 Score/Metric 3
Peptide A
7.65
0.99
97
Peptide B
6.99
0.87
97
Peptide C
6.21
0.65
97
Peptide D
5.57
0.71
96
Peptide E
3.31
0.44
50
Peptide F
1.85
0.41
41
sensitivity =
precision =
frequency
TP
TN
FN
FP
cutoff/threshold
score/criterion
specificity =
TP
TP + FN
TP
TP + FP
TN
TN + FP
TP + TN
accuracy =
TP + TN + FN + FP
The Results: Distinguishing Right from Wrong
In large proteomics data sets (for which manual data inspection is impossible),
how can we distinguish between correct and incorrect peptide assignments?
Use “decoy” sequences to distract non-peptidic, nonuniquely matchable, or otherwise unmatchable spectra
into a search space that is known a priori to be incorrect
Use the frequency of “decoy” sequences among total
sequences to estimate the overall frequency of wrong answers
(False Positive Rate)
Adjust filtering criteria to achieve a ~ 1% False Positive Rate
Decoy Sequences? A “Reversed” Database!
We generate decoy sequences by reversing each protein sequence in a given
database, such that the resultant in silico digest contains nonsense peptides,
then append the reversed database to the end of the forward database
SEARCHING
Decoy references are
labeled with #
Database searching with SEQUEST occurs from top to bottom – when decoy
references are found, there is an equal probability it could have also mapped to a
non-decoy sequence. So our FPR is (# of decoys) x 2 / total matches.
Target/Decoy Database Searching
Forward database
1. MAGFA→ → →SHTRP
Reversed database
1. PRTHS→ → →AFGAM
Composite Database
Final list
Sequest
Right
F
Wrong (random)
F
R
Unknown
FP
100% 50%50%
Filter
(scoring, mass accuracy, etc)
Generate final list
Estimate FP rate from 2 x Rev (i.e., 4%)
Known
FP
sequest scores: finding true positives
Forward + Reverse
0.7
0.7
0.6
0.6
0.5
0.5
DCn
0.8
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0
1
2
3
4
5
6
7
8
0
1
2
3
XCorr
4
5
6
7
XCorr
50
FP
PSM number
DCn
Forward Sequences
0.8
TP
40
30
20
10
0
0
1
2
3
4
5
6
7
8
XCorr
8
High Mass Accuracy
Mass “Accuracy” in Proteomics:
Precision of mass errors between observed and actual m/z
LTQ FT (SIM)
LTQ Orbitrap &
LTQ FT
AGC target 50,000
to avoid space-charge effects
800
300
600
200
Pept. IDs
Pept IDs
250
150
100
200
50
0
-20
400
-15
-10
-5
0
5
10
15
20
0
-20
-15
-10
-5
0
5
10
Mass accuracy (ppm)
Mass accuracy (ppm)
-0.2 ± 1.0 ppm
0.1 ± 0.4 ppm
15
20
Performance is related to the width of the distribution, not the average error
Haas et al. (2006) Mol. Cell. Proteomics 5, 1326
Olsen et al. (2004) Mol. Cell. Proteomics 3, 608
MMA: True Positives and False Positives
True Positives
False Positives
0
MMA
False positives are distributed evenly across MMA space
50
PSM number
FP
TP
40
30
20
10
0
0
1
2
3
4
5
6
7
8
MS/MS vs MMA: Precision vs Sensitivity
50
PSM number
FP
TP
40
30
20
10
0
0
1
2
3
4
5
6
7
0
8
MMA
MS/MS criteria are strong precision filters – require TP / FP separation for sensitivity
50
40
30
20
10
0
MMA
0
0
1
2
3
4
5
6
7
MMA criteria are weak precision filters – assists MS/MS criteria in improving sensitivity
8
Distracting Wrong from Right: MMA
True Positives
False Positives
0
MMA
Search Space
True Positives
False Positives
Filtered
Filtered
0
Extended Search Space
MMA
Mass Accuracy: Another dimension of selectivity
Forward Sequences
0.8
0.7
0.7
0.6
0.6
0.5
0.5
DCn
DCn
Tryptic
Search
+/- 2Da
0.4
0.3
0.2
0.2
0.1
0.1
0
0
1
2
3
4
5
6
7
8
0
XCorr
0.8
1
2
3
0.7
0.7
0.6
0.6
0.5
0.5
0.4
5
6
7
8
5
6
7
8
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
4
XCorr
0.8
DCn
DCn
0.4
0.3
0
Tryptic
Search
+/- 2Da
5ppm
filter
Forward + Reverse
0.8
0
0
1
2
3
4
XCorr
5
6
7
8
0
1
2
3
4
XCorr
Distracting Wrong from Right: Trypticity
Tryptic Search
True Positives
False Positives
K/R-PeptideK/R-
Partial Enzyme Search
True Positives
Filtered
False Positives
Filtered
A- G- C- S- T- I- L- F- P- M- V- H- D- E- Y- W- Q- N-
K/R-Peptide
PeptideK/R-
A- G- C- S- T- I- L- F- P- M- V- H- D- E- Y- W- Q- N-
What do we have here, hm?
n = 286
dCn
1
0.8
0.6
Unphosphorylated
Phosphorylated
0.4
Reversed Hits
0.2
0
0
2
4
6
8
XCorr
Phosphopeptides: Chemically disadvantaged…
Dataset of phosphorylated and unphosphorylated peptide MS/MS pairs
MSFEILR
P
Singly Phosphorylated (n=207)
Doubly Phosphorylated (n=79)
8
n = 286
XCorr (Phosphorylated)
dCn (Phosphorylated)
1.0
MSFEILR
0.8
0.6
0.4
0.2
0.0
n = 286
6
4
2
0
0.0
0.2
0.4
0.6
0.8
dCn (Unphosphorylated)
1.0
0
2
4
6
XCorr (Unphosphorylated)
8
Phosphopeptides: Less power in XCorr & dCn
XCorr (Ph/UnPh)
2
1.5
Singly
Phosphorylated
1
Doubly
Phosphorylated
0.5
86%
Unphosphorylated
Unphosphorylated
dCn (Ph/UnPh)
0
2
1.5
1
0.5
0
93%
Unphosphorylated
Unphosphorylated
Mass Accuracy: Can it help for phosphorylation?
MS/MS
LTQ
1 2 3 4 5 6 7 8 9 10
0
Yeast Whole-Cell Lysate
1
2
Time (sec)
Red., Alkyl.
SDS-PAGE
Ion Accumulation
for Full MS (1x106)
LTQ
60-80 kDa
3
4
MS/MS
1 2 3 4 5 6 7 8 9 10
Orbitrap
Full MS Scan (R 6x104)
0
Trypsin
IMAC-purification
1
2
Time (sec)
3
4
Mass Accuracy: Rescuing phosphopeptides
SEQUEST partial enzyme search, fully tryptic peptide spectral matches
Orbitrap TOP10
7
LTQ TOP10
n=1311
8
+3: 2.3
7
6
6
+2: 1.3
4
-50
3
0
50
2
XCorr
XCorr
5
5
4
+3: 3.5
+2: 2.7
3
2
1
1
0
0
-750
-500
n=1390
-500
-250
-250
0
250
0
500
750
MMA (ppm)
250
500
750
Mission: Phosphopeptide rescue – accomplished!
1200
1046
0.4% FP
# of phosphopeptides
1000
74%
increase
715
800
600
1.0% FP
1.0% FP
600
400
200
0
LTQ
No MMA
MMA
Orbitrap
search algorithms & phosphorylation
98
sequest
omssa
936
928
Bakalarski et al., Anal. Bioanal. Chem., 2007
phosphorylation site localization
GFDSNQpTWR or GFDpSNQTWR?
Beausoleil et al., Nat. Biotechnol, 2006
phosphorylation site localization
Beausoleil et al., Nat. Biotechnol, 2006
phosphorylation site localization
Taus et al., JPR, 2011
phosphorylation localization rate (FLR)
use non-native phosphoacceptors as “decoys”
Ser + Thr (human proteome): 14.1%
Pro + Glu (human proteome): 14.5%
allow search engine / localization assessment tools
to consider pP and pE as true negative “decoys”
calculate dataset FLR based on frequency of pP + pE “decoys”
Baker et al., MCP, 2011
Chalkey & Clauser, MCP, 2012