Transcript Topic 13
2o structure, TM regions, and solvent accessibility Chapter 29, Du and Bourne “Structural Bioinformatics” Topic 13 The Truth (Information) is Out (In) There The Truth (Information) is Out (In) There But we’re still having a tough time finding it. Protein Secondary Structure Prediction Given a protein sequence (primary structure), predict its secondary structures GHWIATRGQLIREAYEDYRHFSSECPFIP E: -strand H: -helix C: coil CEEEEECCCEEEEECCCHHHHHHCCCCCC H: ( H: - helix, G: 310 helix, I: -helix ) E: (E: -strand, B: bridge) C: (T: -turn, S: bend, C: coil) Assumption: short stretches of residues have propensity to adopt certain conformation ⇒ conformation of the central residue in a sequence fragment depends only on flanking residues (sliding window) Why secondary structure prediction? -- Because we can (kind of). -- Because it could be a first step towards prediction of protein tertiary structure. “Have solution, need problem.” Nearly every imaginable algorithm has been applied to secondary structure prediction. Secondary Structure Prediction Methods 1. First generation: Single amino acid propensities Chou-Fasman method (1974), GOR I-IV ~56-60% accuracy 2. Second generation: Segments of 3-51 adjacent residues NNSSP, SSPAL ~65% accuracy 3. Neural network PHD, Psi-Pred, J-Pred 4. Support vector machine (SVM) 5. Hidden Markov Models (HMM) Third generation methods using evolutionary information ~76% accuracy Secondary Structure Prediction Accuracy 1. three-state per-residue prediction accuracy 3 Q3 100 M i 1 N obs ii Mii, number of residues observed in state i and predicted in state i Nobs, the total number of residues observed in 3 states 2. per-segment prediction accuracy (SOV, Segment of OVerlap) Per-stage segment overlap: S1: observed SS segment S2: predicted SS segment Single Residue Propensity Methods Calculate the propensity for a given amino acid to adopt a certain ss-type P( | aai ) p( , aai ) P p( ) p( ) p(aai ) i i, amino acid , secondary structure state Example: from a data set with 30 proteins #Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=580 p(,aa) = 580/20,000, p() = 4,000/20,000, p(aa) = 2,000/20,000 P = 580 / (4,000/10) = 1.45 Amino Acid Propensities to Secondary Structures P(H) P(H) T S P T A E L M R S T G 69 77 57 69 142 151 121 145 98 77 69 57 T S P T A E L M R S T G 69 77 57 69 142 151 121 145 98 77 69 57 P(H) T S P T A E L M R S T G 69 77 57 69 142 151 121 145 98 77 69 57 Chou-Fasman method Nearest Neighbor Methods * The idea is simple: predict SS of the central residue of a given segment from homologous segments (neighbors). For example, from database, find some number of the closest sequences to a subsequence defined by a window around the central residue, then use max (N, N, Nc) to assign the SS. E Homologous C sequences C RSTEVRASRQLAKEKVN H H Window size C C Key parameters: 1. How to define similarity? 2. What size window of sequence should be examined? 3. How many close sequences should be selected? C The Devil is in the details… Psi-Pred Method D. Jones, J. Mol. Boil. 292, 195 (1999). Method : Neural network Input data : PSSM generated by PSI-BLAST Bigger and better sequence database Combining several database and data filtering Training and test sets preparation Ss prediction only makes sense for proteins with no homologous structure. No sequence & structural homologues between training and test sets by CATH and PSI-BLAST (mimicking realistic situation). Psi-Pred Method--Neural Network Window size = 15 Two networks First network (sequence-to-structure): Second network (structure-to-structure): 315 = (20 + 1) 15 inputs extra unit to indicate where the windows spans either N or C terminus Data are scaled to [0-1] range by using 1/[1+exp(-x)] 75 hidden units 3 outputs (H, E, L) Structural correlation between adjacent sequences 60 = (3 + 1) 15 inputs 60 hidden units 3 outputs Accuracy ~76% Sample Psi-Pred Output Conf: Confidence (0=low, 9=high) ---very important!!!! Pred: Predicted secondary structure (H=helix, E=strand, C=coil) AA: Target sequence # PSIPRED HFORMAT (PSIPRED V2.3 by David Jones) Conf: 966899999997542002357777557999999716898188034435788873356776 Pred: CCHHHHHHHHHHHHHHHCCCCCCCHHHHHHHHHHHCCCCCEEECCCCEEEEEEECCCCCC AA: MMWEQFKKEKLRGYLEAKNQRKVDFDIVELLDLINSFDDFVTLSSCSGRIAVVDLEKPGD 10 20 30 40 50 60 Conf: 777179998337888888988751235636899718261220179868899999998557 Pred: CCCCEEEEEECCCCCHHHHHHHHHCCCCCEEEEECCCEEEEECCCHHHHHHHHHHHHHCC AA: KASSLFLGKWHEGVEVSEVAEAALRSRKVAWLIQYPPIIHVACRNIGAAKLLMNAANTAG 70 80 90 100 110 120 Conf: 200242314703799714651435541487355188999999999999999889999999 Pred: CCCCCCEECCCEEEEEECCCEEEEEECCCCCEEECHHHHHHHHHHHHHHHHHHHHHHHHH AA: FRRSGVISLSNYVVEIASLERIELPVAEKGLMLVDDAYLSYVVRWANEKLLKGKEKLGRL 130 140 150 160 170 180 ***Compare the prediction for residues 9 and 17*** Sample Psi-Pred Output-II Again, voting rules methods tend to be best ATKAVCVLKGDGPVQGTIHFEAKGDTVVVTGSITGLTEGDHGFHVHQFGDNTQGCTSAGP CCCCCCCCCCCCCCCCEEHCCHHECEEEEEEEEEEEECCCCCCCCCCCCCCCCCCCCCCC CCHEEEEECCCCCCCCEEEHHHCCCEEEEEEEEECECCCCCCEEEECCCCCCCCCCCCCC CCCEEEEEECCCCCEEEEEEEECCCEEEEEEEEEEEECCCCCEEEEECCCCCCCCCCCCC CCCEEEEECCCCCCCEEEEEECCCCEEEEEEEEECCCCCCCCEEEEEECCCCCCCCCCCC HHHCEEEECCCCCCCEEEEEECCCCEEEEEECEEEEEECCCCEEEEECCCCCCEEECCCC CCCCEEEECCCCCCCCCEEECCCCCCEEEEECEEECCCCCCCEEEECCCCCCCCEEECCC CCCCEEEEECCCCCCCCCEEECCCCCEEEECCCCCCCCCCCEEEEEEEECCCCCCCCCCC CCCCEEEECCCCCCCCEEEEECCCCEEEEEEEEEEECCCCCCEEEEECCCCCCCCCCCCC ---EEEEE------EEEEEEEEE--EEEEEEEEE-----EEEEEEEE------------- 2SOD BPS D_R DSC GGR GOR H_K K_S JOI 2SOD HFNPLSKKHGGPKDEERHVGDLGNVTADKNGVAIVDIVDPLISLSGEYSIIGRTMVVHEK CCCCCCCCCCCCCCCCCCCCCCECCCCCCHEECCCCCCCCCECCEECEEEEEEEEEEECC CCCCCCCCCCCCCCCHHCECCCCCECCCCCCEEEEEEECCEEEECCCEEEEEEEEEEECC CCCCCCCCCCCCCCEEEEECCCCCCCCCCCCEEEEEECCCCCCCCCCEEEEEEEEEEECC CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEECCCCCCCCCCEEEECEEEEEECC CCCCCCCCCCCCCCHHEEECCCCCCCCCCCCEEEEEEECCEEECCCCEEEEEEEEEECCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEECCCCCCCCCCCCCCHHHHHHEECCC CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEEEEEEEEEECCCEEECCEEEEEEE CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCEEEEEECCCCECCCCCEEEEEEEEEEECC --------------------EEEEEE------EEEEEEE--------------EEEEE-- 2SOD BPS D_R DSC GGR GOR H_K K_S JOI 2SOD Prediction Accuracy (EVA) 25 P SIP RED SSp ro P ROF P HDps i JP red 2 P HD Percentage of all 150 proteins 20 15 10 5 0 30 40 50 60 70 80 90 1 00 P ercen tag e co rrectl y pred i cted resi d ues per p rot ei n EVA: Automatic evaluation of prediction servers How Far Can We Go? Currently ~76% Proteins with more than 100 homologues 80% Assignment is ambiguous (5-15%). Recall DSSP vs STRIDE. -- non-unique protein structures (dynamic), H-bond cutoff, etc. Different secondary structures between homologues (~12%). Non-locality. Secondary structure is influenced by long-range interactions. -- Some segments can have multiple structure types (chameleon sequences). Solvent accessibility Conceptually similar problem to SS prediction: Buried vs. Exposed. Weighted Ensemble Solvent Accessibility predictor: http://pipe.scs.fsu.edu/wesa.html E E E E B B B B B B E E Why bother? To provide structural context for putative mutations that one wants to characterize biochemically or biophysically. Transmembrane Segment Prediction Again, conceptually similar problem to SS prediction: TM vs. Not.