GST II: ---Title--- - University of Missouri

Download Report

Transcript GST II: ---Title--- - University of Missouri

Computational Proteomics

Dong Xu

Computer Science Department 109 Engineering Building West E-mail: [email protected]

http://digbio.missouri.edu

573-882-7064 (O)

Outline

 Introduction  Protein identification using Mass-spec  Protein interaction and pathway  Summary

Introduction – What is Proteomics?

“The identification, characterization and quantification of all proteins involved in a particular pathway, organelle, cell, tissue, organ or organism that can be studied in concert to provide accurate and comprehensive data about that system.” http://www.inproteomics.com/prodef.html

Scope of proteomics

Graves and Haystead (2002) Microbiol & Molec. Biol.

Rev. 66, 39-63

Outline

 Introduction  Protein identification using Mass-spec  Protein interaction and pathway  Summary

Eucaryote Gene/Protein Expression Control

nucleus cytosol

inactive mRNA DNA Primary RNA transcript transcriptional control mRNA RNA processing control

Methods: Masspect Microarray

RNA transport mRNA control mRNA degradation control translation control nucleus membrane protein degradation control inactive protein protein post translational control modified protein

2D Page

Control Toxicant

isoelectric point

Bruno ME et al., Arch Biochem Biophys (2002) 406,153-164

Mass Spectroscopy Techniques

 Matrix assisted laser de-adsorption time of-flight (MALDI-TOF)  mainly for peptide mass mapping  Electro-spray MS-MS  more sensitive for protein identification  de novo amino acid sequence.

MS fingerprint for protein

protein peptides MPSESSYKVHRPAKSGGS trypsin digestion MPSESSYK VHR PAK SGGS

In-silico Digestion

MPSESSYK VHR PAK SGGS in-silico digestion another protein in-silico digestion …… …… ……

Peak Picking

 |PM(a) – PM(b)| < Error  score(TM(a), TM(b i ))

MOWSE Score (1)

 Popular scoring scheme used.

 Protein score based on frequency of occurrence of peptides.

 Frequency table is created for every database used.

MOWSE Score (2)

Protein Peptide 0 – 100 Da 0 – 10 k Da 54 100-200 Da … 34 …..

10 – 20 k Da 23 20 – 30 k Da 7 …..

…..

12 23 ….

…..

….

MOWSE Score (3)

 Bin frequencies are normalized by dividing by maximum number in the column.

 Scoring scheme

S j = 50 / (P n * H)

where P n is the product of n normalized frequencies of matching peptides, H is the protein molecular weight.

 Proteins are ranked by their scores.

Too many matches

 For each mass, there are very many peptides in the database with the mass.

 There are many missed peaks in the MS.

 There are many noise in the MS.

 For each MS, there could be many proteins in the database that matches the MS.

From Peptides to Protein

Computational Studies on Confidence Assessment for Protein Identification

We have developed a statistical model which give a p-value indicating the confidence for the protein identification to be true. The model is based on the Extreme Value Distribution of the protein identification scores from a randomly shuffled MS spectral peaks.

Score: 1268 P-value: 0.025

Distribution of score for Swissprot with a large number of input spectra Cumulative Distribution of score

Tandem Mass (MS/MS) Spectrum

M R IMV R TL R GD R VALDVDGATTTVAQV K GMVMA R E R M R IMV R TL R GD R VALDVD b-ion GATTTVAQV K GMVMA y-ion Assumption: Will break between every two amino acids, providing a unique sequence pattern.

R E R

MS/MS Fragmentation Pattern

x 3 y 3 z 3 x 2 y 1 z 2 x 1 y 1 z 1 H + R1 H 2 N C H O C N H R2 C H O C N H R3 O C H R4 C N C H H COOH a 1 b 1 c 1 a 2 b 2 c 2 a 3 b 3 c 3

A real MS/MS spectrum with good quality

LGSSEVEQVQLVVDGVK

SEQUEST: Preliminary Score MKFLILLFNILCLFPVLAADNHGVGPQGAS ...

While parsing through the database, all peptides that match the input mass within some user specified mass tolerance (i.e. +/- 1.0 amu) get a preliminary score (Sp):

S (i m ) n m n t b r Sp = S(i m ) * n m * (1+b) * (1+r) / n t

= sum of matched intensities = number of matched fragment ions = number of total fragment ions = fragment ion continuity factor = immonium ion factor

X-Correlation Score

• Sequence database has been parsed.

• Candidate peptides for correlation analysis are the top 500 preliminary scoring peptides.

• A theoretical spectrum is constructed for each candidate peptide and compared against the input spectrum via correlation analysis.

Discrete correlation function: R [ t]

t

= S

x [t] y [t+

t] Calculated via Fourier Transforms: R [ t] <=> X(f)Y * (f)

Calculation of X-Correlation Score

88.1

185.2

361.5

490.6 561.7

692.9

806.0

893.1

1050.2

1226.4

Theoretical spectrum

100 80 60 40 20 200 400 600

m/z

800 1000 1007.4

185.3

255.7

360.9

403.0 519.1

662.3

805.5

892.6

1155.5

1226.8

1200 x8 250 500

m/z

750 1000 1324.8

1250

Experimental spectrum

De Novo Sequencing Using Spectrum Graph Approach

 Each node of the graph represents a peak in the spectrum.

 Two nodes have an edge if and only if the two corresponding peaks are distanced with the mass of an amino acid.

 The path that connects the two ends corresponds to a feasible solution.

Multiple paths on the spectral ladder

From Graph to Sequence

Outline

 Introduction  Protein identification using Mass-spec  Protein interaction and pathway  Summary

Protein Complex

Nucleosome

Protein-Protein Interactions

Protein complexes, molecular machines

Protein interaction cascade (signal transduction)

Transient vs. stable interaction

Binary interaction vs. complex

h m bait a e preys d f b k

Genetic vs. Physical Interaction Signal transduction Complex system Physical interaction

Transcription factor

Regulatory network Genetic interaction

Expressed gene

Experimental methods

Yeast Two-hybrid screens

Mass Spectrometry

Immunoprecipitation

Affinity binding

Antibody blockage

Protein chips

Rosetta stone approach for predicting protein interaction

• protein A is homologous to subsequence from protein C • protein B is homologous to subsequence from protein C • subsequences from A and B are NOT homologous to each other

Online Databases

Database URL DIP BIND MIPS MINT BRITE PathCalling Interact PIMRider GRID http://dip.doe-mbi.ucla.edu

http://binddb.org

http://mips.gsf.de/proj/yeast/CYGD/ http://cbm.bio.uniroma2.it/ mint/ http://www.genome.ad.jp/ brite/ http://genome.c.kanazawa-u.ac.jp/Y2H/ http://www.bioinf.man.ac.uk/ resources/interactpr.shtml

http://pim.hybrigenics.com/ http://biodata.mshri.on.ca/grid/ Database size Binary Complex 18,000 6171 11,200 3786 5506 957 851 1050 782 1000 200 1400 14,318

Yeast Protein Interaction Network Deletion phenotype: Red = lethal Green = non-lethal Orange = slow growth Yellow = unknown

An example of a scale-free network  Most nodes have few connections  A small number of nodes (network hubs) are connected to a large number of other nodes

PPI Viewer

o Protein-Protein-Interaction and Complex Viewer o o http://mips.gsf.de/proj/yeast/CYGD/interaction/ Search ste20 (YHL007c, STE20, Ste20p, ste20 D )

Binary interaction: Complex data

(Bate: Rad1p) cdc28 >genetic< ste20 Bem1p >physical< Ste20p Ste20p >physical< Prp20p ...

Rad1p, Car2p, Dun1p, Far1p, Gpd1p, Gpd2p, Msi1p, Pdc6p, Sec6p, Sen1p, Ste20p, Ubi4p, YDR324c, YGR086c, YHR033w, YLR368w, YNL116w, YPL004c

Protein Interaction Graph

http://portal.curagen.com/extpc/com.cura

gen.portal.servlet.Yeast

Predict cellular function for hypothetical protein

Function inference based on neighbors

Consensus approach

Markov random field

Overview of Signal Transduction

Stimuli (signal) CELL

Secretion Motility Metabolism Genetic Transfer Cell-Cell Communication Gene Transcription Sporulation/ Apoptosis Essential for understanding disease and designing drug

Problem Formulation

signal sensor Protein-1 Protein-2 1.

2.

Define cascade proteins Find interaction path Protein-3 Gene-1 Gene-2 transcription factor Nucleus Cascade of (physical) protein interaction chains

Finding a plausible signal cascade path

 

Short path Biologically meaningful (function, subcellular location)

Pathway Construction for Amino Acid Transport in Yeast

Ubc4p Ubc2p Ssy5p Ptr3p Ptr1p Cup9

poor nitrogen

GAP1…

rich amino acid

(general)

BAP2…

(specific)

PTR2

peptide transport

Aut10p Tup1p Ptr3p Ssy1p Ssy5p YPL158C Cln1p Gcn4p Rpn6p Amino acid synthesis Mai1p Ssn6p Cdc28p Jsn1p Pre1p Clb3p Rtg3p Mi g1 p Sho1p Energy metabolism Glucose metabolism Gln3p Vma22p Dal80p Cns1p Stp1p Two hybrid Complex from Mass Coprecipitate or pull-down Gap1p Bap2p Tat2p Other biochemical methods Working Model Ubc4p Ubc2p Ptr2p

X

Dipeptide Ptr1p Cup9p Transcriptional control

Scope of proteomics

Graves and Haystead (2002) Microbiol & Molec. Biol.

Rev. 66, 39-63

Reading Assignments

Suggested reading:

  http://www.astbury.leeds.ac.uk/Facil/MStut/mstutorial.htm

Yu Chen and Dong Xu. Computational Analyses of High Throughput Protein-Protein Interaction Data.

Current Protein and Peptide Science

. 4:159-181. 2003.

Optional reading:

www.bio.davidson.edu/courses/genomics/proteomics.html

Optional Assignment (1)

1.

2.

Make a yeast protein-interaction network connecting Rho2p, Rom2p, Ste20p, and Pfy1p. Use binary physical protein-protein interaction to connect all the edges. Try to make the network as simple as possible (i.e., involving few proteins). Can you predict the function of the yeast gene YLR269C based on high-throughput protein protein interaction data? How confident are you on this prediction?

Optional Assignment (2)

3.

A protein complex was identified containing Rpn5p, Rri1p, YDR179Cp, YIL071Cp, YMR025Wp, YOL117Wp. Can you find the bait of this complex? How many possible binary interactions in this complex can be verified by yeast two-hybrid data?

4.

It is known that Cup9p is degredated by 26S proteasome. Identify as many proteins in the yeast 26S proteasome as possible. Find a physical interaction network between proteins in 26S proteasome and Cup9p.