Transcript GST II: ---Title--- - University of Missouri
Computational Proteomics
Dong Xu
Computer Science Department 109 Engineering Building West E-mail: [email protected]
http://digbio.missouri.edu
573-882-7064 (O)
Outline
Introduction Protein identification using Mass-spec Protein interaction and pathway Summary
Introduction – What is Proteomics?
“The identification, characterization and quantification of all proteins involved in a particular pathway, organelle, cell, tissue, organ or organism that can be studied in concert to provide accurate and comprehensive data about that system.” http://www.inproteomics.com/prodef.html
Scope of proteomics
Graves and Haystead (2002) Microbiol & Molec. Biol.
Rev. 66, 39-63
Outline
Introduction Protein identification using Mass-spec Protein interaction and pathway Summary
Eucaryote Gene/Protein Expression Control
nucleus cytosol
inactive mRNA DNA Primary RNA transcript transcriptional control mRNA RNA processing control
Methods: Masspect Microarray
RNA transport mRNA control mRNA degradation control translation control nucleus membrane protein degradation control inactive protein protein post translational control modified protein
2D Page
Control Toxicant
isoelectric point
Bruno ME et al., Arch Biochem Biophys (2002) 406,153-164
Mass Spectroscopy Techniques
Matrix assisted laser de-adsorption time of-flight (MALDI-TOF) mainly for peptide mass mapping Electro-spray MS-MS more sensitive for protein identification de novo amino acid sequence.
MS fingerprint for protein
protein peptides MPSESSYKVHRPAKSGGS trypsin digestion MPSESSYK VHR PAK SGGS
In-silico Digestion
MPSESSYK VHR PAK SGGS in-silico digestion another protein in-silico digestion …… …… ……
Peak Picking
|PM(a) – PM(b)| < Error score(TM(a), TM(b i ))
MOWSE Score (1)
Popular scoring scheme used.
Protein score based on frequency of occurrence of peptides.
Frequency table is created for every database used.
MOWSE Score (2)
Protein Peptide 0 – 100 Da 0 – 10 k Da 54 100-200 Da … 34 …..
10 – 20 k Da 23 20 – 30 k Da 7 …..
…..
12 23 ….
…..
….
MOWSE Score (3)
Bin frequencies are normalized by dividing by maximum number in the column.
Scoring scheme
S j = 50 / (P n * H)
where P n is the product of n normalized frequencies of matching peptides, H is the protein molecular weight.
Proteins are ranked by their scores.
Too many matches
For each mass, there are very many peptides in the database with the mass.
There are many missed peaks in the MS.
There are many noise in the MS.
For each MS, there could be many proteins in the database that matches the MS.
From Peptides to Protein
Computational Studies on Confidence Assessment for Protein Identification
We have developed a statistical model which give a p-value indicating the confidence for the protein identification to be true. The model is based on the Extreme Value Distribution of the protein identification scores from a randomly shuffled MS spectral peaks.
Score: 1268 P-value: 0.025
Distribution of score for Swissprot with a large number of input spectra Cumulative Distribution of score
Tandem Mass (MS/MS) Spectrum
M R IMV R TL R GD R VALDVDGATTTVAQV K GMVMA R E R M R IMV R TL R GD R VALDVD b-ion GATTTVAQV K GMVMA y-ion Assumption: Will break between every two amino acids, providing a unique sequence pattern.
R E R
MS/MS Fragmentation Pattern
x 3 y 3 z 3 x 2 y 1 z 2 x 1 y 1 z 1 H + R1 H 2 N C H O C N H R2 C H O C N H R3 O C H R4 C N C H H COOH a 1 b 1 c 1 a 2 b 2 c 2 a 3 b 3 c 3
A real MS/MS spectrum with good quality
LGSSEVEQVQLVVDGVK
SEQUEST: Preliminary Score MKFLILLFNILCLFPVLAADNHGVGPQGAS ...
While parsing through the database, all peptides that match the input mass within some user specified mass tolerance (i.e. +/- 1.0 amu) get a preliminary score (Sp):
S (i m ) n m n t b r Sp = S(i m ) * n m * (1+b) * (1+r) / n t
= sum of matched intensities = number of matched fragment ions = number of total fragment ions = fragment ion continuity factor = immonium ion factor
X-Correlation Score
• Sequence database has been parsed.
• Candidate peptides for correlation analysis are the top 500 preliminary scoring peptides.
• A theoretical spectrum is constructed for each candidate peptide and compared against the input spectrum via correlation analysis.
Discrete correlation function: R [ t]
t
= S
x [t] y [t+
t] Calculated via Fourier Transforms: R [ t] <=> X(f)Y * (f)
Calculation of X-Correlation Score
88.1
185.2
361.5
490.6 561.7
692.9
806.0
893.1
1050.2
1226.4
Theoretical spectrum
100 80 60 40 20 200 400 600
m/z
800 1000 1007.4
185.3
255.7
360.9
403.0 519.1
662.3
805.5
892.6
1155.5
1226.8
1200 x8 250 500
m/z
750 1000 1324.8
1250
Experimental spectrum
De Novo Sequencing Using Spectrum Graph Approach
Each node of the graph represents a peak in the spectrum.
Two nodes have an edge if and only if the two corresponding peaks are distanced with the mass of an amino acid.
The path that connects the two ends corresponds to a feasible solution.
Multiple paths on the spectral ladder
From Graph to Sequence
Outline
Introduction Protein identification using Mass-spec Protein interaction and pathway Summary
Protein Complex
Nucleosome
Protein-Protein Interactions
Protein complexes, molecular machines
Protein interaction cascade (signal transduction)
Transient vs. stable interaction
Binary interaction vs. complex
h m bait a e preys d f b k
Genetic vs. Physical Interaction Signal transduction Complex system Physical interaction
Transcription factor
Regulatory network Genetic interaction
Expressed gene
Experimental methods
Yeast Two-hybrid screens
Mass Spectrometry
Immunoprecipitation
Affinity binding
Antibody blockage
Protein chips
Rosetta stone approach for predicting protein interaction
• protein A is homologous to subsequence from protein C • protein B is homologous to subsequence from protein C • subsequences from A and B are NOT homologous to each other
Online Databases
Database URL DIP BIND MIPS MINT BRITE PathCalling Interact PIMRider GRID http://dip.doe-mbi.ucla.edu
http://binddb.org
http://mips.gsf.de/proj/yeast/CYGD/ http://cbm.bio.uniroma2.it/ mint/ http://www.genome.ad.jp/ brite/ http://genome.c.kanazawa-u.ac.jp/Y2H/ http://www.bioinf.man.ac.uk/ resources/interactpr.shtml
http://pim.hybrigenics.com/ http://biodata.mshri.on.ca/grid/ Database size Binary Complex 18,000 6171 11,200 3786 5506 957 851 1050 782 1000 200 1400 14,318
Yeast Protein Interaction Network Deletion phenotype: Red = lethal Green = non-lethal Orange = slow growth Yellow = unknown
An example of a scale-free network Most nodes have few connections A small number of nodes (network hubs) are connected to a large number of other nodes
PPI Viewer
o Protein-Protein-Interaction and Complex Viewer o o http://mips.gsf.de/proj/yeast/CYGD/interaction/ Search ste20 (YHL007c, STE20, Ste20p, ste20 D )
Binary interaction: Complex data
(Bate: Rad1p) cdc28 >genetic< ste20 Bem1p >physical< Ste20p Ste20p >physical< Prp20p ...
Rad1p, Car2p, Dun1p, Far1p, Gpd1p, Gpd2p, Msi1p, Pdc6p, Sec6p, Sen1p, Ste20p, Ubi4p, YDR324c, YGR086c, YHR033w, YLR368w, YNL116w, YPL004c
Protein Interaction Graph
http://portal.curagen.com/extpc/com.cura
gen.portal.servlet.Yeast
Predict cellular function for hypothetical protein
Function inference based on neighbors
Consensus approach
Markov random field
Overview of Signal Transduction
Stimuli (signal) CELL
Secretion Motility Metabolism Genetic Transfer Cell-Cell Communication Gene Transcription Sporulation/ Apoptosis Essential for understanding disease and designing drug
Problem Formulation
signal sensor Protein-1 Protein-2 1.
2.
Define cascade proteins Find interaction path Protein-3 Gene-1 Gene-2 transcription factor Nucleus Cascade of (physical) protein interaction chains
Finding a plausible signal cascade path
Short path Biologically meaningful (function, subcellular location)
Pathway Construction for Amino Acid Transport in Yeast
Ubc4p Ubc2p Ssy5p Ptr3p Ptr1p Cup9
poor nitrogen
GAP1…
rich amino acid
(general)
BAP2…
(specific)
PTR2
peptide transport
Aut10p Tup1p Ptr3p Ssy1p Ssy5p YPL158C Cln1p Gcn4p Rpn6p Amino acid synthesis Mai1p Ssn6p Cdc28p Jsn1p Pre1p Clb3p Rtg3p Mi g1 p Sho1p Energy metabolism Glucose metabolism Gln3p Vma22p Dal80p Cns1p Stp1p Two hybrid Complex from Mass Coprecipitate or pull-down Gap1p Bap2p Tat2p Other biochemical methods Working Model Ubc4p Ubc2p Ptr2p
X
Dipeptide Ptr1p Cup9p Transcriptional control
Scope of proteomics
Graves and Haystead (2002) Microbiol & Molec. Biol.
Rev. 66, 39-63
Reading Assignments
Suggested reading:
http://www.astbury.leeds.ac.uk/Facil/MStut/mstutorial.htm
Yu Chen and Dong Xu. Computational Analyses of High Throughput Protein-Protein Interaction Data.
Current Protein and Peptide Science
. 4:159-181. 2003.
Optional reading:
www.bio.davidson.edu/courses/genomics/proteomics.html
Optional Assignment (1)
1.
2.
Make a yeast protein-interaction network connecting Rho2p, Rom2p, Ste20p, and Pfy1p. Use binary physical protein-protein interaction to connect all the edges. Try to make the network as simple as possible (i.e., involving few proteins). Can you predict the function of the yeast gene YLR269C based on high-throughput protein protein interaction data? How confident are you on this prediction?
Optional Assignment (2)
3.
A protein complex was identified containing Rpn5p, Rri1p, YDR179Cp, YIL071Cp, YMR025Wp, YOL117Wp. Can you find the bait of this complex? How many possible binary interactions in this complex can be verified by yeast two-hybrid data?
4.
It is known that Cup9p is degredated by 26S proteasome. Identify as many proteins in the yeast 26S proteasome as possible. Find a physical interaction network between proteins in 26S proteasome and Cup9p.