Protein Secondary Structures Assignment and prediction Pernille Andersen 23.04.2007 Outline • What is protein secondary structure • How can it be used? • Different prediction methods –
Download ReportTranscript Protein Secondary Structures Assignment and prediction Pernille Andersen 23.04.2007 Outline • What is protein secondary structure • How can it be used? • Different prediction methods –
Protein Secondary Structures Assignment and prediction Pernille Andersen 23.04.2007 Outline • What is protein secondary structure • How can it be used? • Different prediction methods – Alignment to homologues – Propensity methods – Neural networks • Evaluation of prediction methods • Links to prediction servers Secondary Structure Elements ß-strand Helix Bend Turn Use of secondary structure • • • • • • Classification of protein structures Definition of loops (active sites) Use in fold recognition methods Improvements of alignments Definition of domain boundaries Input for a number of alterntive bioinformatics tools Classification of secondary structure • Defining features – Dihedral angles – Hydrogen bonds – Geometry • Assigned manually by crystallographers or • Automatic – DSSP (Kabsch & Sander,1983) – STRIDE (Frishman & Argos, 1995) – DSSPcont (Andersen et al., 2002) Dihedral Angles From http://www.imb-jena.de phi dihedral angle of the N-Calpha bond psi dihedral angle of the Calpha-C bond omega dihedral angle of the C-N (peptide) bond Helices phi(deg) psi(deg) H-bond pattern ----------------------------------------------------------alpha-helix -57.8 -47.0 i+4 pi-helix -57.1 -69.7 i+5 310 helix -74.0 -4.0 i+3 (omega = 180 deg ) From http://www.imb-jena.de Beta Strands phi(deg) psi(deg) omega (deg) -----------------------------------------------------------------beta strand -120 120 180 Antiparallel Parallel From http://broccoli.mfn.ki.se/pps_course_96/ Secondary Structure Elements ß-strand Helix Bend Turn Secondary Structure Type Descriptions * * * * * * * * H = alpha helix G = 310 - helix I = 5 helix (pi helix) E = extended strand, participates in beta ladder B = residue in isolated beta-bridge T = hydrogen bonded turn S = bend C = coil Automatic assignment programs • DSSP ( http://www.cmbi.kun.nl/gv/dssp/ ) • STRIDE (http://bioweb.pasteur.fr/seqanal/interfaces/stride.html) • DSSPcont ( http://cubic.bioc.columbia.edu/services/DSSPcont/ ) • The protein data bank visualizes DSSP assignments on structures in the data base (go to sequence details tab) # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 RESIDUE AA STRUCTURE BP1 BP2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 A A A A A A A A A A A A A A E H V I I Q A E F Y L N P D E E E E E E E E T T T -A -A -A +A +A -A -A >> -A 45S+ 45S+ 45S- 0 0 0 23 22 21 20 19 18 17 16 0 0 0 0 0 0 0A 0A 0A 0A 0A 0A 0A 0A 0 0 0 ACC 205 127 66 106 74 86 18 63 31 36 24 54 114 66 N-H-->O 0, 0.0 2, 0.0 -2,-0.3 -2,-0.4 17,-2.8 -2,-0.4 13,-2.5 -2,-0.4 9,-1.5 -2,-0.3 5,-3.2 -2,-0.4 0, 0.0 2,-0.1 O-->H-N 2,-0.3 2,-0.4 21,-2.6 2,-0.4 17,-2.8 2,-0.4 13,-2.5 2,-0.3 9,-1.8 2,-0.4 4,-1.7 -2, 0.0 -1,-0.2 -2,-0.2 N-H-->O 0, 0.0 21, 0.0 2, 0.0 19,-0.2 -2,-0.5 15,-0.2 -2,-0.9 11,-0.2 -2,-0.3 7,-0.2 -2,-0.4 2,-0.2 0, 0.0 1,-0.1 O-->H-N 0, 0.0 21, 0.0 2,-0.5 19,-0.2 2,-0.9 15,-0.2 2,-0.3 11,-0.2 2,-0.4 7,-0.2 5,-1.3 0, 0.0 -2, 0.0 3,-0.1 TCO KAPPA ALPHA PHI PSI 0.000 360.0 360.0 360.0 113.5 -0.987 360.0-152.8-149.1 154.0 -0.995 4.6-170.2-134.3 126.3 -0.976 13.9-170.8-114.8 126.6 -0.972 20.8-158.4-125.4 129.1 -0.910 29.5-170.4 -98.9 106.4 -0.852 11.5 172.8-108.1 141.7 -0.933 4.4 175.4-139.1 156.9 -0.967 13.3-160.9-160.6 151.3 -0.994 16.5-156.0-136.8 132.1 -0.929 11.7-122.6-120.0 133.5 -0.884 84.3 9.0-113.8 150.9 -0.963 125.4 60.5 -86.5 8.5 0.752 89.3-146.2 -64.6 -23.0 X-CA 5.7 9.4 11.5 15.0 16.6 19.9 20.7 23.4 24.4 27.2 28.0 29.7 32.0 33.0 Y-CA 42.2 41.3 38.4 37.6 34.9 33.0 31.8 29.4 27.6 25.3 24.8 22.0 21.6 25.2 Z-CA 25.1 24.7 23.5 24.5 22.4 23.0 19.5 18.4 15.3 14.1 10.4 8.6 6.8 7.6 Secondary Structure Prediction • What to predict? – All 8 types or pool types into groups DSSP Q3 * * * H = alpha helix G = 310 -helix I = 5 helix (pi helix) * * E = extended strand B = beta-bridge E * * * T = hydrogen bonded turn S = bend C = coil C H Secondary Structure Prediction • What to predict? – All 8 types or pool types into groups * H = alpha helix * E = extended strand Straight HEC Q3 H E * * * * * * T = hydrogen bonded turn S = bend C = coil G = 310-helix I = 5 helix (pi helix) B = beta-bridge C Secondary Structure Prediction • Simple alignments • Align to a close homolog for which the structure has been experimentally solved. • Heuristic Methods (e.g., Chou-Fasman, 1974) • Apply scores for each amino acid an sum up over a window. • Neural Networks • • • • Raw Sequence (late 80’s) Blosum matrix (e.g., PhD, early 90’s) Position specific alignment profiles (e.g., PsiPred, late 90’s) Multiple networks balloting, probability conversion, output expansion (Petersen et al., 2000). Improvement of accuracy 1974 Chou & Fasman 1978 Garnier 1987 Zvelebil 1988 Quian & Sejnowski 1993 Rost & Sander 1997 Frishman & Argos 1999 Cuff & Barton 1999 Jones 2000 Petersen et al. ~50-53% 63% 66% 64.3% 70.8-72.0% <75% 72.9% 76.5% 77.9% Simple Alignments •Solved structure of a homolog to query is needed •Homologous proteins have ~88% identical (3 state) secondary structure • If no close homologue can be identified alignments will give almost random results Propensities: Amino acid preferences in -Helix Propensities: Amino acid preferences in -Strand Propensities: Amino acid preferences in coil Chou-Fasman propensities Name Ala Arg Asp Asn Cys Glu Gln Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val P(a) 142 98 101 67 70 151 111 57 100 108 121 114 145 113 57 77 83 108 69 106 P(b) 83 93 54 89 119 37 110 75 87 160 130 74 105 138 55 75 119 137 147 170 P(turn) 66 95 146 156 119 74 98 156 95 47 59 101 60 60 152 143 96 96 114 50 f(i) 0.06 0.070 0.147 0.161 0.149 0.056 0.074 0.102 0.140 0.043 0.061 0.055 0.068 0.059 0.102 0.120 0.086 0.077 0.082 0.062 f(i+1) 0.076 0.106 0.110 0.083 0.050 0.060 0.098 0.085 0.047 0.034 0.025 0.115 0.082 0.041 0.301 0.139 0.108 0.013 0.065 0.048 f(i+2) 0.035 0.099 0.179 0.191 0.117 0.077 0.037 0.190 0.093 0.013 0.036 0.072 0.014 0.065 0.034 0.125 0.065 0.064 0.114 0.028 f(i+3) 0.058 0.085 0.081 0.091 0.128 0.064 0.098 0.152 0.054 0.056 0.070 0.095 0.055 0.065 0.068 0.106 0.079 0.167 0.125 0.053 Chou-Fasman • Generally applicable • Works for sequences with no solved homologs • But the accuracy is low! • The problem is that the method does not use enough information about the structural context of a residue Neural Networks • Benefits – Generally applicable – Can capture higher order correlations – Inputs other than sequence information • Drawbacks – Needs a high amount of data (different solved structures). However, today nearly 7000 structures with low sequence identity/high resolution are solved – Complex method with several pitfalls Architecture Weights Input Layer IK EE H VI HE C IQ AE Hidden Layer Window IKEEHVIIQAEFYLNPDQSGEF….. Output Layer Sparse encoding Inp Neuron 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 R 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 C 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 AAcid Input Layer 0 0 0 0 IK EE HV I 0 0 1 0 0 0 0 IQ AE 0 0 0 0 0 0 0 0 0 BLOSUM 62 A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 Input Layer 1 0 0 IK EE HV I 2 4 2 5 IQ AE 2 0 3 3 1 2 3 1 0 1 3 2 2 Secondary networks (Structure-to-Structure) Weights Input Layer HE CH E CH EC Window HE C Hidden Layer IKEEHVIIQAEFYLNPDQSGEF….. Output Layer PHD method (Rost and Sander) • Combine neural networks with sequence profiles – 6-8 Percentage points increase in prediction accuracy over standard neural networks • Use second layer “Structure to structure” network to filter predictions • Jury of predictors • Set up as mail server PSI-Pred (Jones) • Use alignments from iterative sequence searches (PSI-Blast) as input to a neural network • Better predictions due to better sequence profiles • Available as stand alone program and via the web Position specific scoring matrices (PSI-BLAST profiles) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 I K E E H V I I Q A E F Y L N P D A -2 -1 5 -4 -4 -3 0 -3 -2 2 -1 -3 3 -1 -1 -2 -3 R -4 -1 -3 -3 2 0 -2 0 -3 -4 3 -5 -5 -3 -4 4 -2 N -5 -2 -3 2 1 -4 -4 -5 -2 -4 1 -5 -5 -4 4 -4 1 D -5 -2 -3 5 1 -5 1 -5 -3 -3 1 -5 -6 -2 1 -4 5 C -2 -3 -3 -6 -5 -4 -4 -4 -5 2 -1 -4 3 1 5 -5 -6 Q -4 -1 3 1 1 -4 -2 -2 4 -3 0 -4 -4 5 -3 0 -2 E -4 3 1 5 -2 -2 -4 -5 -1 -1 1 -4 -5 1 -4 -3 2 G -5 -3 -2 -4 -4 -3 -4 -6 3 -4 -4 -1 -2 -1 2 3 2 H -5 -2 -3 -3 9 -5 -5 1 5 -2 -3 -1 -1 -1 -4 2 -1 I 6 -2 -3 -6 -5 1 1 2 -5 1 -1 1 0 -1 -4 -5 -2 L 0 -3 -3 -6 -2 -2 0 4 -3 -1 -3 1 -4 1 -4 -4 -2 K -4 4 -2 -2 -3 1 -2 -4 -3 -4 0 -5 -5 -3 -3 0 -3 M 0 -2 -2 -5 -4 0 0 -1 -4 -3 3 2 -3 -3 -2 -4 -5 F -2 -4 -4 -6 -4 1 2 0 -2 -4 -5 5 3 1 -4 -3 -4 P -4 -3 -3 -4 -5 -4 -5 -5 -4 1 4 -1 -5 -5 -5 0 -5 S -4 1 -1 -2 -3 -3 1 -2 2 2 -1 -4 -2 -1 2 1 -1 T -2 1 -2 -3 -4 3 -1 0 -1 3 -3 -4 -2 -1 0 -2 2 W -4 -4 -4 -6 -5 -5 -5 -3 -4 -5 -6 -3 -2 -2 -5 -1 -6 Y -3 -3 -3 -5 1 -3 -3 5 2 -1 -3 5 7 3 0 5 -3 V 4 2 1 -5 -5 5 4 -1 -2 1 -1 2 1 -2 0 -3 -4 Several different architectures • Sequence-to-structure Output: – Window sizes 15,17,19 and 21 C C H H C C C – Hidden units 50 and 75 – 10-fold cross validation => 80 predictions • Structure-to-structure Output: – Window size 17 CCCCCCC – Hidden units 40 – 10-fold cross validation => 800 predictions The majority rules • Combining predictions from several networks improves the prediction • Combinations of 800 different networks were used in the method described by Petersen TN et al. 2000, Prediction of protein secondary structure at 80 % accuracy. Proteins 41 17-20 Activities to probabilities Helix Strand Coil activities (output) activities (output) probabilities! (calculated) Coil conversion 0.05 0.05 0.10 0.15 . . . 1.0 0.1 0.99 0.15 … 0.9 0.83 0.75 1.0 Benchmarking secondary structure predictions • EVA – Newly solved structures are send to prediction servers. – Every week http://cubic.bioc.columbia.edu/eva/sec/res_sec.html EVA results (Rost et al., 2001) • • • • • • PROFphd PSIPRED SAM-T99sec SSpro Jpred2 PHD 77.0% 76.8% 76.1% 76.0% 75.5% 71.7% – Cubic.columbia.edu/eva Links to servers • Several links: http://cubic.bioc.columbia.edu/eva/doc/explain_methods.html#typ e_sec • ProfPHD http://www.predictprotein.org/ • PSIPRED http://bioinf.cs.ucl.ac.uk/psipred/ • JPred http://www.compbio.dundee.ac.uk/~www-jpred/ • SAM T02 http://www.cse.ucsc.edu/research/compbio/HMM-apps/T02-query.html Practical Conclusions • If you need a secondary structure prediction use the newer methods based on advanced machine learning methods such as : – – – – ProfPHD PSIPRED JPred SAM T02 • And not one of the older ones such as : – Chou-Fasman – Garnier