Assignment and prediction April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Protein Secondary Structures.
Download ReportTranscript Assignment and prediction April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Protein Secondary Structures.
Assignment and prediction April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Protein Secondary Structures • • • • • April 8, 2003 Classification of protein structures Definition of loops/core Use in fold recognition methods Improvements of alignments Definition of domain boundaries Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Use of secondary structure Claus Lundegaard April 8, 2003 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Secondary Structure Elements • Defining features – Dihedral angles – Hydrogen bonds – Geometry • Assigned manually by crystallographers or • Automatic – DSSP (Kabsch & Sander,1983) – STRIDE (Frishman & Argos, 1995) – Continuum (Andersen et al.) April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Classification of secondary structure From http://www.imb-jena.de phi psi omega April 8, 2003 - dihedral angle about the N-Calpha bond dihedral angle about the Calpha-C bond dihedral angle about the C-N (peptide) bond Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Dihedral Angles phi(deg) psi(deg) H-bond pattern -----------------------------------------------------------------right-handed alpha-helix -57.8 -47.0 i+4 pi-helix -57.1 -69.7 i+5 3-10 helix -74.0 -4.0 i+3 (omega is 180 deg in all cases) ----------------------------------------------------------------From http://www.imb-jena.de April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Alpha helices phi(deg) psi(deg) omega (deg) -----------------------------------------------------------------beta strand -120 120 180 ----------------------------------------------------------------- Hydrogen bond patterns in beta sheets. Here a four-stranded beta sheet is drawn schematically which contains three antiparallel and one parallel strand. Hydrogen bonds are indicated with red lines (antiparallel strands) and green lines (parallel strands) connecting the hydrogen and receptor oxygen. From http://broccoli.mfn.ki.se/pps_course_96/ April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Beta Strands * * * * * * * April 8, 2003 H = alpha helix B = residue in isolated beta-bridge E = extended strand, participates in beta ladder G = 3-helix (3/10 helix) I = 5 helix (pi helix) T = hydrogen bonded turn S = bend Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Secondary Structure Types • DSSP ( http://www.cmbi.kun.nl/gv/dssp/ ) • STRIDE ( http://www.hgmp.mrc.ac.uk/Registered/Option/stride.html ) # RESIDUE 1 4 A 2 5 A 3 6 A 4 7 A 5 8 A 6 9 A 7 10 A 8 11 A 9 12 A 10 13 A 11 14 A 12 15 A 13 16 A 14 17 A 15 18 A 16 19 A 17 20 A 18 21 A 19 22 A 20 23 A 21 24 A 22 25 A 23 26 A 24 27 A 25 28 A 26 29 A 27 30 A April 8, 2003 AA E H V I I Q A E F Y L N P D Q S G E F M F D F D G D E STRUCTURE BP1 BP2 0 0 0 0 0 0 E -A 23 0A E -A 22 0A E -A 21 0A E +A 20 0A E +A 19 0A E -A 18 0A E -A 17 0A E >> -A 16 0A T 45S+ 0 0 T 45S+ 0 0 T 45S0 0 T <5 + 0 0 E < +A 11 0A E -A 10 0A E -A 9 0A E +A 8 0A E -AB 7 30A E -AB 6 29A E -AB 5 27A E > S-AB 4 26A T 3 S0 0 T 3 S+ 0 0 E < S-B 23 0A E -B 22 0A ACC 205 127 66 106 74 86 18 63 31 36 24 54 114 66 132 44 28 14 3 0 45 6 76 74 20 114 8 N-H-->O O-->H-N N-H-->O O-->H-N 0, 0.0 2,-0.3 0, 0.0 0, 0.0 2, 0.0 2,-0.4 21, 0.0 21, 0.0 -2,-0.3 21,-2.6 2, 0.0 2,-0.5 -2,-0.4 2,-0.4 19,-0.2 19,-0.2 17,-2.8 17,-2.8 -2,-0.5 2,-0.9 -2,-0.4 2,-0.4 15,-0.2 15,-0.2 13,-2.5 13,-2.5 -2,-0.9 2,-0.3 -2,-0.4 2,-0.3 11,-0.2 11,-0.2 9,-1.5 9,-1.8 -2,-0.3 2,-0.4 -2,-0.3 2,-0.4 7,-0.2 7,-0.2 5,-3.2 4,-1.7 -2,-0.4 5,-1.3 -2,-0.4 -2, 0.0 2,-0.2 0, 0.0 0, 0.0 -1,-0.2 0, 0.0 -2, 0.0 2,-0.1 -2,-0.2 1,-0.1 3,-0.1 -4,-1.7 2,-0.3 1,-0.2 -3,-0.2 -5,-1.3 -5,-3.2 2, 0.0 2,-0.3 -2,-0.3 2,-0.3 -7,-0.2 -7,-0.2 -9,-1.8 -9,-1.5 -2,-0.3 2,-0.4 12,-0.4 12,-2.3 -2,-0.3 2,-0.3 -13,-2.5 -13,-2.5 -2,-0.4 2,-0.4 8,-2.4 7,-2.9 -2,-0.3 8,-1.0 -17,-2.8 -17,-2.8 -2,-0.4 2,-0.5 3,-3.5 3,-2.1 -2,-0.4 -19,-0.2 -21,-2.6 -20,-0.1 -2,-0.5 -1,-0.1 -22,-0.3 2,-0.4 1,-0.2 -1,-0.3 -3,-2.1 -3,-3.5 109, 0.0 2,-0.3 -2,-0.4 -5,-0.3 -5,-0.2 3,-0.1 TCO 0.000 -0.987 -0.995 -0.976 -0.972 -0.910 -0.852 -0.933 -0.967 -0.994 -0.929 -0.884 -0.963 0.752 0.936 -0.877 -0.893 -0.979 -0.982 -0.983 -0.934 -0.948 -0.947 0.904 0.291 -0.822 -0.525 Claus Lundegaard KAPPA ALPHA PHI PSI 360.0 360.0 360.0 113.5 360.0-152.8-149.1 154.0 4.6-170.2-134.3 126.3 13.9-170.8-114.8 126.6 20.8-158.4-125.4 129.1 29.5-170.4 -98.9 106.4 11.5 172.8-108.1 141.7 4.4 175.4-139.1 156.9 13.3-160.9-160.6 151.3 16.5-156.0-136.8 132.1 11.7-122.6-120.0 133.5 84.3 9.0-113.8 150.9 125.4 60.5 -86.5 8.5 89.3-146.2 -64.6 -23.0 51.1 134.1 52.9 50.0 28.9 174.9-124.8 156.8 15.9-146.5-151.0-178.9 5.0-169.6-158.6 146.0 27.8 149.2-139.1 120.3 39.7-127.8-152.1 161.6 23.9-164.1-112.5 137.7 6.9-165.0-123.7 138.3 78.4 -27.2-127.3 111.5 128.9 -46.6 50.4 45.0 118.8 109.3 84.7 -11.1 71.8-114.7-103.1 140.3 24.9-177.7 -74.1 127.5 X-CA 5.7 9.4 11.5 15.0 16.6 19.9 20.7 23.4 24.4 27.2 28.0 29.7 32.0 33.0 33.3 32.1 29.6 28.0 26.5 24.5 21.7 18.9 16.4 13.4 15.4 18.4 21.8 Y-CA 42.2 41.3 38.4 37.6 34.9 33.0 31.8 29.4 27.6 25.3 24.8 22.0 21.6 25.2 24.2 27.7 28.7 31.5 32.2 35.4 37.0 38.9 41.3 42.1 41.4 43.4 41.8 Z-CA 25.1 24.7 23.5 24.5 22.4 23.0 19.5 18.4 15.3 14.1 10.4 8.6 6.8 7.6 11.2 12.3 14.8 16.7 20.1 20.6 22.6 20.8 22.3 20.2 17.0 18.1 19.1 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Automatic assignment programs • What to predict? Q3 into groups – All 8 types or pool types * * * * * * * * H = a helix B = residue in isolated b-bridge E = extended strand, participates in b ladder G = 3-helix (3/10 helix) I = 5 helix (p helix) T = hydrogen bonded turn S = bend C/.= random coil H E C Straight CASPHEC April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Secondary Structure Prediction • Simple alignments. • Heuristic Methods (e.g., Chou-Fasman, 1974) • Neural Networks (different inputs) – Raw Sequence (late 80’s) – Blosum matrix (e.g., PhD, early 90’s) – Position specific alignment profiles (e.g., PsiPred, late 90’s) – Multiple networks balloting, probability conversion, output expansion (Petersen et al., 2000). April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Secondary Structure Prediction 1974 Chou & Fasman 1978 Garnier 1987 Zvelebil 1988 Quian & Sejnowski 1993 Rost & Sander 1997 Frishman & Argos 1999 Cuff & Barton 1999 Jones 2000 Petersen et al. April 8, 2003 Claus Lundegaard ~50-53% 63% 66% 64.3% 70.8-72.0% <75% 72.9% 76.5% 77.9% CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Improvement of accuracy • Solved structures homologous to query needed • Homologous proteins have ~88% identical (3 state) secondary structure • If no homologue can be identified alignment will give almost random results April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Simple Alignments Claus Lundegaard April 8, 2003 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Amino acid preferences in aHelix Claus Lundegaard April 8, 2003 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Amino acid preferences in bStrand Claus Lundegaard April 8, 2003 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Amino acid preferences in coil Name Ala Arg Asp Asn Cys Glu Gln Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val April 8, 2003 P(a) 142 98 101 67 70 151 111 57 100 108 121 114 145 113 57 77 83 108 69 106 P(b) 83 93 54 89 119 37 110 75 87 160 130 74 105 138 55 75 119 137 147 170 P(turn) 66 95 146 156 119 74 98 156 95 47 59 101 60 60 152 143 96 96 114 50 f(i) 0.06 0.070 0.147 0.161 0.149 0.056 0.074 0.102 0.140 0.043 0.061 0.055 0.068 0.059 0.102 0.120 0.086 0.077 0.082 0.062 f(i+1) 0.076 0.106 0.110 0.083 0.050 0.060 0.098 0.085 0.047 0.034 0.025 0.115 0.082 0.041 0.301 0.139 0.108 0.013 0.065 0.048 Claus Lundegaard f(i+2) 0.035 0.099 0.179 0.191 0.117 0.077 0.037 0.190 0.093 0.013 0.036 0.072 0.014 0.065 0.034 0.125 0.065 0.064 0.114 0.028 f(i+3) 0.058 0.085 0.081 0.091 0.128 0.064 0.098 0.152 0.054 0.056 0.070 0.095 0.055 0.065 0.068 0.106 0.079 0.167 0.125 0.053 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Chou-Fasman 1. Assign all of the residues in the peptide the appropriate set of parameters. 2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix) > 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of the helix. If the segment defined by this procedure is longer than 5 residues and the average P(a-helix) > P(b-sheet) for that segment, the segment can be assigned as a helix. 3. Repeat this procedure to locate all of the helical regions in the sequence. 4. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b-sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set of four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the end of the beta-sheet. Any segment of the region located by this procedure is assigned as a beta-sheet if the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region. 5. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(a-helix) for that region. 6. To identify a bend at residue number j, calculate the following value: p(t) = f(j)f(j+1)f(j+2)f(j+3) where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location. April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Chou-Fasman • General applicable • Works for sequences with no solved homologs • Low Accuracy April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Chou-Fasman • Benefits – General applicable – Can capture higher order correlations – Inputs other than sequence information • Drawbacks – Needs many data (different solved structures) – Risk of overtraining April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Neural Networks Weights Input Layer IK EE H VI HE C IQ AE Hidden Layer Window IKEEHVIIQAEFYLNPDQSGEF….. April 8, 2003 Claus Lundegaard Output Layer CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Architecture Inp Neuron 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 R 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 C 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 AAcid April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Sparse encoding 0 0 0 1 0 0 0 0 0 0 0 0 0 IQ AE IK EE HV I Claus Lundegaard April 8, 2003 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Input Layer 0 0 0 0 0 0 0 A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 April 8, 2003 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 Claus Lundegaard S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU BLOSUM 62 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Input Layer -1 0 IK EE HV I 0 2 -4 2 5 -2 0 -3 IQ AE -3 1 -2 -3 -1 0 -1 -3 -2 -2 April 8, 2003 Claus Lundegaard Weights Input Layer HE CH E CH EC Window HE C Hidden Layer IKEEHVIIQAEFYLNPDQSGEF….. April 8, 2003 Claus Lundegaard Output Layer CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Structure to Structure • Combine neural networks with sequence profiles – 6-8 Percentage points increase in prediction accuracy over standard neural networks • Use second layer “Structure to structure” network to filter predictions • Jury of predictors • Set up as mail server April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU PHD method (Rost and Sander) (BLAST profiles) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 April 8, 2003 I K E E H V I I Q A E F Y L N P D A -2 -1 5 -4 -4 -3 0 -3 -2 2 -1 -3 3 -1 -1 -2 -3 R -4 -1 -3 -3 2 0 -2 0 -3 -4 3 -5 -5 -3 -4 4 -2 N -5 -2 -3 2 1 -4 -4 -5 -2 -4 1 -5 -5 -4 4 -4 1 D -5 -2 -3 5 1 -5 1 -5 -3 -3 1 -5 -6 -2 1 -4 5 C -2 -3 -3 -6 -5 -4 -4 -4 -5 2 -1 -4 3 1 5 -5 -6 Q -4 -1 3 1 1 -4 -2 -2 4 -3 0 -4 -4 5 -3 0 -2 E -4 3 1 5 -2 -2 -4 -5 -1 -1 1 -4 -5 1 -4 -3 2 G -5 -3 -2 -4 -4 -3 -4 -6 3 -4 -4 -1 -2 -1 2 3 2 H -5 -2 -3 -3 9 -5 -5 1 5 -2 -3 -1 -1 -1 -4 2 -1 I 6 -2 -3 -6 -5 1 1 2 -5 1 -1 1 0 -1 -4 -5 -2 L 0 -3 -3 -6 -2 -2 0 4 -3 -1 -3 1 -4 1 -4 -4 -2 Claus Lundegaard K -4 4 -2 -2 -3 1 -2 -4 -3 -4 0 -5 -5 -3 -3 0 -3 M 0 -2 -2 -5 -4 0 0 -1 -4 -3 3 2 -3 -3 -2 -4 -5 F -2 -4 -4 -6 -4 1 2 0 -2 -4 -5 5 3 1 -4 -3 -4 P -4 -3 -3 -4 -5 -4 -5 -5 -4 1 4 -1 -5 -5 -5 0 -5 S -4 1 -1 -2 -3 -3 1 -2 2 2 -1 -4 -2 -1 2 1 -1 T -2 1 -2 -3 -4 3 -1 0 -1 3 -3 -4 -2 -1 0 -2 2 W -4 -4 -4 -6 -5 -5 -5 -3 -4 -5 -6 -3 -2 -2 -5 -1 -6 Y -3 -3 -3 -5 1 -3 -3 5 2 -1 -3 5 7 3 0 5 -3 V 4 2 1 -5 -5 5 4 -1 -2 1 -1 2 1 -2 0 -3 -4 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Position specific scoring matrices • Use alignments from iterative sequence searches (PSI-Blast) as input to a neural network • Better predictions due to better sequence profiles • Available as stand alone program and via the web April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU PSI-Pred (Jones, DT) • CASP – Critical Assessment of Structure Predictions – Sequences from about-to-be-solved-structures are given to groups who submit their predictions before the structure is published • EVA – Newly solved structures are send to prediction servers. April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Benchmarking secondary structure predictions • • • • • • PROFphd 77.0% PSIPRED 76.8% SAM-T99sec 76.1% SSpro 76.0% Jpred2 75.5% PHD 71.7% – Cubic.columbia.edu/eva April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU EVA results (Rost et al., 2001) Weights Input Layer IK EE H VI Window HE CH EC IQ AE HE C Hidden Layer IKEEHVIIQAEFYLNPDQSGEF….. April 8, 2003 Output Layer Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Output expansion • Sequence-to-structure – Window sizes 15,17,19 and 21 – Hidden units50 and 75 – 10-fold cross validation => 80 predictions • Structure-to-structure – Window size 17 – Hidden units40 – 10-fold cross validation => 800 predictions April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Several different architectures • Confidence of a per residue prediction – P(Highest) – P(second highest) – H: 0.80 E: 0.05 C:0.15 => conf.=0.65 • Mean per chain confidence for all 800 predictions – Calculate Mean and Standard deviation – Averaging of per chain predictions with Z >=2 April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Balloting procedure Coil conversion Helix Strand Coil 0.05 0.10 0.15 . . . 1.0 April 8, 2003 activities activities probabilities 0.05 0.99 0.1 0.15 0.9 0.83 0.75 Claus Lundegaard … 1.0 CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Activities to probabilities Sequence profiles as input • Neural network technology • Balloting of large number of Neural Network predictions (0.2%) • Output expansion (0.5%) • Probability transformation (1.2%) • EVA (400 low homology proteins) April 8, 2003 Ranking Group name Q3 Performance 1 SBI-AT 77.2 % 2 PROFsec B.Rost 76.3 % 3 Psi-pred D.Jones 76.2 % Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Petersen et al., Proteins, 41: 17-20, 2000 • Database of links – http://mmtsb.scripps.edu/cgibin/renderrelres?protmodel • ProfPHD – http://cubic.bioc.columbia.edu/ • PSIPRED – http://bioinf.cs.ucl.ac.uk/psipred/ • JPred – www.compbio.dundee.ac.uk/Software/JPred/jpred. html April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Links to servers • If you need a secondary structure prediction use one of the newer ones such as – ProfPHD, – PSIPRED, and – JPred • And not one of the older ones such as – Chou-Fasman, and – Garnier April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Practical Conclusion