No Slide Title

Download Report

Transcript No Slide Title

Computational Methods and Bioinformatics in
Proteomic Studies
Bioinformatics: Building Bridges April 14, 2005
Tim Griffin
Dept. Biochemistry, Molecular Biology and Biophysics
[email protected]
Interdisciplinary biology in the 21st century
Genome-era biology: system-wide studies
The yeast genome on a chip
DeRisi et al, 1997, Science 278:680
The simple view: one gene, one protein
The reality: biological systems are complex
Protein interaction network in Drosophila
Science (2003) 302, p. 1727
Why analyze at the protein level?
control of eukaryotic gene expression
Nucleus
Inactive
mRNA
Cytosol
Translational
control
DNA
Trancriptional
control
Primary
RNA
transcript
mRNA
RNA
processing
control
mRNA
RNA
transport
control
Translational
control
Protein
Inactive
protein
Protei
n
activity
control
Active
protein
What is proteomics?
“Proteomics includes not only the identification
and quantification of proteins, but also the
determination of their localization, modifications,
interactions, activities, and, ultimately, their
function.”
-Stan Fields in Science, 2001.
Alternatively: proteomics = “fast biochemistry”
Proteomics: a complement to genomics
What proteomic analysis has to offer:
• measurement of protein response, which is not always
indicated by mRNA response
• post-translational modifications
• macromolecular interactions
• sub-cellular location
• high-resolution structural and molecular characterization
Genomics, Proteomics, and Systems Biology
genomics
genomic
DNA
mRNA
proteomics
protein
products
computational biology
functional
protein
mature prototype emerging
catalytic
activity
sub cellular
location
Protein
Modifications
3D structure
Protein
dynamics
quantitative
profiling
protein
phosphorylation
protein
cataloguing
arrays
sequencing
descriptive protein
interaction maps
system
interactions
between
components
identify
system
components
measure
and define
properties
Proteomics technologies and methods
• Two-dimensional gel electrophoresis
• mass spectrometry
• protein chips
• yeast 2-hybrid
• phage display
• antibody engineering
• high-throughput protein expression
• high-throughput X-ray crystallography
The 1990’s revolution: mass spectrometry
Development of physical methods to mass analyze large biomolecules
+
-
-
+
ionization
+ -
+
 MALDI
 Electrospray:
liquid chromatography
nanospray
separation by m/z
+
detection
+
 quadrupole
 ion trap
 time-of-flight
 mass analysis of
proteins, peptides, DNA
Electrospray ionization (ESI)
200 m
+
• protein and peptide analysis, multiply charged
ions
• quadrupole and TOF detection
• tandem mass spectrometry
• solution phase ionization – enables online
coupling with liquid chromatography (LC)
Separations of complex mixtures: crowd control
• Enables the processing of the many components
in big protein mixtures
turnstile
1
2
3....
Identification of protein mixtures by tandem mass spectrometry
2. select specific peptide
ESI
3. CID
4. detect fragments
Ar
µLC
fragment
peptide
200 400 600 800 10001200
m/z
1. MS “survey” scan
*
trypsin
Protein mixture
Relative Abundance
peptides
100
50
0
600
800
1000
1200
1400
tandem mass spectrum
(MS/MS)
Peptide sequence determination from MS/MS spectra
Collision-induced dissociation (CID) creates two prominent ion series:
y-series: y14 y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1
H2N-N--S--G--D--I--V--N--L--G--S--I--A--G--R-COOH
Relative Abundance
b-series: b1 b2 b3 b4 b5 b6 b7 b8 b9 b10b11 b12 b13 b14
200
400
600
m/z
800
1000
1200
Identification of protein mixtures by mass spectrometry
1. De novo (i.e. manually)
2. Database searching:
200 400 600 80010001200
m/z
theoretical
peptide
identification
(DNA or protein database)
protein
identification
200 400 600 80010001200
m/z
observed
Relative Abundance
Peptide sequence identifies the protein
GDIVNLGSIAGR
DIVNLGSIAGR
IVNLGSIAGR
VNLGSIAGR
NLGSIAGR
LGSIAGR
GSIAGR
H2N-NSGDIVNLGSIAGR-COOH
SIAGR
IAGR
AGR
GR
R
200
400
600
m/z
800
1000
1200
YMR134W, yeast protein involved in iron metabolism
Raw MS/MS spectrum
Relative Abundance
High-throughput protein identification by LC-MS/MS and automated
sequence database searching
200
400
600
m/z
800
1000
Direct identification of 1000+
proteins from complex mixtures
Relative Abundance
Protein sequence and/or DNA
sequence database search
Peptide sequence match
1200
GDIVNLGSIAGR
DIVNLGSIAGR
IVNLGSIAGR
VNLGSIAGR
NLGSIAGR
LGSIAGR
GSIAGR
SIAGR
IAGR
AGR
GR
R
200
400
Protein identification
H2N-NSGDIVNLGSIAGRNSGDIVNLGSIAGR-COOH
600
m/z
800
1000
1200
Case Study: Proteomic Analysis of Oral Cancer Progression
• Mouth cancer, tongue cancer, throat cancer
• In USA, ~30,000 people are newly diagnosed
with oral cancer each year, a person dies from
oral cancer every hour of every day
• 350,000 to 400,000 new cases annually
worldwide
• Less than half will be alive in 5 yrs; 20x higher
risk of producing second, primary tumors
• However, 80 to 90% cure rate when found
early. Unfortunately, at this time, the majority
are found as latter stage cancers
Progression of oral cancer
Malignancy
transformation
rate = 5-17%
insult or injury
??
Can we find molecular markers that predict this transition?
(adapted from Dr. Nelson Rhodus, U of M Dental School)
Saliva as a diagnostic fluid in oral cancer progression
• Readily available, non-invasive collection
• Heterogeneous human fluid with large dynamic range of protein
abundances – requires fractionation
• Many post-translational modified proteins
• Currently only 100-150 proteins have been identified in whole
saliva (LC-MS/MS)
First step: obtain a comprehensive profile of the protein
components from a normal individual saliva sample
insult or injury
Multidimensional separations followed by mass spectrometry
Whole saliva protein mixture
FFE fractionation (70 fractions)
………
RP-capLC
ESI-MS/MS – (500,000+ spectra)
Protein sequence and/or DNA
sequence database search
Protein identification
Raw data processing: Automated database searching
 ProFound
 Mascot
 PepSea
 MS-Fit
 MOWSE
 Peptident
 Multident
 Sequest
 PepFrag
 MS-Tag
Relative Abundance
Computational algorithms for searching MS/MS spectra
against protein sequence databases, mRNA sequences, DNA
sequences
200
400
600
m/z
800
1000 1200
Protein identification
Choosing a sequence database
• National Center for Biotechnology Information (NCBI)
• Swiss-Prot/TrEMBL
• Protein Information Resource (PIR)
• European Biotechnology Institute (EBI)
Considerations: organism-specificity, redundancy, annotation
Analysis of processed data: quality control of protein matches
filtering
Unfiltered – 105+ matches
(lots of noise and junk)
Filtered – thousands of “true” matches
Probability of sequence match via statistical modeling
Keller et al (2002) Analytical Chemistry 74, 5383
Sequence matches automatically assigned a P score between 0 and 1
Collating and interpreting the data: Interact software tool
http://www.systemsbiology.org/Default.aspx?pagename=proteomicssoftware
Result: Processed and Filtered Data
Saliva example: 433 unique proteins identified
Interpreting the data: annotated protein databases
National Center for Biotechnology Information (NCBI)
ExPASy/Uniprot
European Bioinformatics Institute (EBI)
Organism/biology specific:
Saccharomyces Genome Database (SGD)
Human Mitochondrial Protein Database
Human Proteome Organization (HUPO)
Mining databases for data interpretation: Example 1
Mining databases for data interpretation: Example 1
Mining databases for data interpretation: Example 2
Mining databases for data interpretation: Example 2
Classification of interpreted data: subcellular localization
Subcellular Localization
Desmosomal(3)
Mitochondrial(7)
Membrane(25) Endoplasmic(9)
0.7% Endosomal(3)
1.6%
5.8%
2.1%
0.7%
Lysosomal(9)
Ribosomal(6)
Nuclear(17)
2.1%
1.4%
Peroxisomal(2)
3.9%
0.5%
Cytoplasmic/
Nuclear(30)
6.9%
Structure/Cytoskeleton(47)
10.8%
Unknown(56)
12.9%
Cytoplasmic(87)
20.1%
Secreted/Extracellular(132)
30.5%
Unknown (56)
Secreted/Extracellular (132)
Cytoplasmic (87)
Structure/Cytoskeleton (47)
Cytoplasmic/Nuclear (30)
Membrane (25)
Nuclear (17)
Endoplasmic (9)
Lysosomal (9)
Mitochondrial (7)
Ribosomal (6)
Desmosomal (3)
Endosomal (3)
Peroxisomal (2)
Classification of interpreted data: functional characterization
Biological Function
Defense/
Immunoresponse(43) Cell Growth/
9.9%
Differentiation(3)
Transport(41)
9.5%
RNA Binding/
Modification(10)
2.3%
0.7% Cell Adhesion/
Signaling(37)
Structural/
Communication(12)
8.5%
Cytoskeletal(65)
2.8% Unknown(39)
15.0%
9.0%
Unknown (39)
Transport (41)
Signaling (37)
RNA Binding/Modification (10)
Redox (16)
Protein Synthesis (14)
Protein Modification/Polymerization (6)
Protein Folding/Repair (28)
Protein Degradation/Inhibitor (57)
Metabolism-Other (23)
Metabolism-Glycolysis/Carbohydrates
(33)
DNA Binding/Transcription (6)
Defense/Immunoresponse (43)
Structural/Cytoskeletal (65)
Redox(16)
3.7%
Cell Grownth/Differentiation (3)
DNA Binding/Metobalism-Other(23)
Cell Adhesion/Communication (12)
Protein Folding/
Trascription(6)
5.3%
Protein
Repair(28)
1.4%
Protein Degradation/
Synthesis(14)
6.5%
Inhibitor(57)
Metabolism-Glycolysis/
3.2%
Protein Modification/Polymerization(6)
13.2%
Carbohydrates(33)
1.4%
7.6%
What about quantitative measurements?
Malignancy
transformation
rate = 5-17%
insult or injury
??
Can we find molecular markers that predict this transition?
(adapted from Dr. Nelson Rhodus, U of M Dental School)
Stable-isotope labeling of proteins for quantitative profiling
State 1
vs.
37°
State 2
label with “light” (-L) or
“heavy” (-H) reagent
L
H
combine and proteolyze
L
L
L
H
analyze by MS
L
L
L
Intensity
-L and –H labels are
chemically identical,
but isotopically
different due to
incorporation of
stable isotopes (i.e.
2H, 15N, 13C…)
20°
H
m/z
relative protein abundance =
intensity[light]
intensity[heavy]
Chemically identical
but isotopically
different peptides
ionize with same
efficiency, act as
mutual internal
standards
Quantitative analysis of mRNA data
Sample 1
Sample 2
DeRisi et al, 1997, Science 278:680
Automated Quantitative Proteomics
100
light
heavy
quantify
mixture 1 (light)
550 560 570 580
m/z
combine
and
mixture 2 (heavy) proteolyze
multi-dimensional
separation
mass
analysis
100
NH2-EACDPLR-COOH
Identify
(MS/MS)
0
200
400
600
m/z
800
Quantitative analysis
+TOF MS: 20 MCA scans from mm_sample.wiff
a=3.56145059693694800e-004, t0=6.89652636903192620e+001
Max. 274.0 counts.
Sample 2
1926.0240
274
260
1927.0231
240
220
In te n s ity , c o u n ts
200
180
160
1928.0203
140
120
Sample 1
100
1917.9946
80
1929.0322
1916.9909
60
1918.9924
40
1920.0007
20
0
1930.0176
1924.9803
1931.0077
1921.0165
1914
1916
1918
1920
1922
1924
m/z, amu
1926
1928
1930
1932
1934
Disease proteomics: androgen-induced effects in prostate cancer
- androgen
+ androgen
306 peptides, 79 differentially expressed (26%)
(d0/d8 > 1.5 or < 0.67)
Dealing with the data
Data acquisition
Raw data processing
(Database searching)
Analysis of processed data
(Statistical filtering, quantitative analysis)
Data organization and interpretation
Archiving and databasing
Modeling
(Computational Biology)
Need for better data archives and respositories
http://proteomics.jhu.edu/dl/pathidb.php
Archiving challenges: different data formats
http://sashimi.sourceforge.net/software_glossolalia.html
Computational Biology: Integrating proteomics and genomics data
control of eukaryotic gene expression
Nucleus
Inactive
mRNA
Cytosol
Translational
control
DNA
Trancriptional
control
Primary
RNA
transcript
mRNA
RNA
processing
control
mRNA
RNA
transport
control
Translational
control
Protein
Inactive
protein
Protei
n
activity
control
Active
protein
Integrating proteomics and genomics data:
Elucidating gene expression regulatory networks
A
mRNA versus protein abundance ratios, Gal/Eth
2
FBA1
protein abundance ratio (log10)
GAL5
PFK1
3
1
2
PDC1
PDC1
GAL10
ACO1
GAL1
GAL3
1
GAL7
FBP1
0
M LS1
CDC19
PCK1
ICL1
TPI1
-1
-2
3
-3
4
-2
-1
0
1
mRNA abundance ratio (log 10)
Griffin TJ et al (2002) Mol Cell Proteomics 1: 323
2
-3
3
Post-transcriptionally regulated proteins?
3
2
1
protein abundance ratio (log10 )
2
1
0
-1
Mitochondriall located proteins
rRNA processing
Protein synthesis
3
-3
-2
4
-2
-1
0
mRNA abundance ratio (log
10 )
1
2
-3
3
Computational biology: integrating information to assign function
Cytoscape: http://www.cytoscape.org/
Modeling cellular circuitry based on genomic and proteomic data
Is the virtual human on the horizon???
Acknowledgements
Griffin Laboratory
Mikel Roe
Sri Bandhakavi
Hongwei Xie
Clive Nyauncho
U of M Dental School
Dr. Nelson Rhodus
MSI
Patton Fast
University of Minnesota
Funding
Minnesota Medical Foundation
NIH