Transcript Document

Protein Structure Prediction Ram Samudrala University of Washington

Rationale for understanding protein structure and function Protein sequence -large numbers of sequences, including whole genomes structure determination structure prediction Protein structure - three dimensional - complicated - mediates function ?

Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution homology rational mutagenesis biochemical analysis model studies

DNA protein sequence Protein folding …-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-… …-L-K-E-G-V-S-K-D-… one amino acid unfolded protein spontaneous self-organisation (~1 second) native state not unique mobile inactive expanded irregular

DNA protein sequence Protein folding …-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-… …-L-K-E-G-V-S-K-D-… one amino acid unfolded protein spontaneous self-organisation (~1 second) native state not unique mobile inactive expanded irregular unique shape precisely ordered stable/functional globular/compact helices and sheets

Protein folding landscape Large multi-dimensional space of changing conformations unfolded molten globule D

G

* J=10 -3 s native barrier height J=10 -8 s folding reaction

twenty types of amino acids Protein primary structure two amino acids join by forming a peptide bond R R H N H Cα H C O O H H N H Cα H C O H N H Cα O C O H R each residue in the amino acid main chain has two degrees of freedom ( f and y) N H f R c Cα y H C O H N f H Cα c y O C N H f R Cα H c y C O H N f H Cα c y O C R R the amino acid side chains can have up to four degrees of freedom (c 1-4 )

Protein secondary structure many f,y combinations are not possible b sheet (anti-parallel) +180 b L f 0 a -180 -180 0 y a helix +180 b sheet (parallel) C N C N

Protein tertiary and quaternary structures Ribonuclease inhibitor (2bnh) Haemoglobin (1hbh) Hemagglutinin (1hgd)

Methods for determining protein structure Protein sequence -large numbers of sequences, including whole genomes X-ray crystallography NMR spectroscopy Protein structure - three dimensional - complicated - mediates function ?

Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution homology rational mutagenesis biochemical analysis model studies

X-ray crystallography- concept • X-rays interact with electrons in protein molecules arranged in a crystal to produce diffraction patterns • The diffraction patterns of the x-rays can be used to determine the three-dimensional structure of proteins • Provides a “static” picture From <

http://info.bio.cmu.edu/courses/03231/LecF01/Lec25/lec25.html

>

X-ray crystallography- details • Prepare protein crystals where the proteins are organised in a precise crystal lattice • Shine x-rays on crystals which diffract off of electrons of atoms in the crystals; the intensities of the individual reflections are measured • Phases are usually obtained indirectly by ismorphous replacement, from the way one or a few heavy atoms incorporated into the same isomorphous crystal lattice affect the diffraction patern • Intensities and phases of all reflections are combined in a Fourier transform to provide maps of electron density • Interpret the map by fitting the polypeptide chain to the contours • Refine the model by minimising the distance between the observed amplitudes and the calculated amplitudes

NMR spectroscopy - concept • The magnetic-spin properties of atomic nuclei within a molecule are used to obtain a list of distance constraints between atoms in the molecule, from which a three-dimensional structure of the protein molecule can be obtained • Provides a “dynamic” picture NK-lysin (1nkl) S1 RNA binding domain (1sro)

NMR spectroscopy - details • Protein molecules placed in a strong magnetic field have their hydrogen atoms aligned to the field; the alignment can be excited by applying radio frequency (RF) pulses • Possible to obtain unique signal (chemical shift) for each hydrogen atom in a protein molecule • Structural information arises primarily from the Nuclear Overhauser Effect (NOE), which gives information about distances between atoms in a molecule • A pair of protons give a detectable NOE cross-peak if they are within 5.0 Å of each other in space • After obtaining NOE data for protons througout the structure, a number of independent structures can be generated that are consistent with the distance constraints

Computer representation of protein structure • Structures are stored in the protein data bank (PDB), a repository of mostly experimental models based on X-ray crystallographic and NMR studies • <

http://www.rcsb.org

> • Atoms are defined by their Cartesian coordinates:

ATOM 1 N GLU 1 18.222 18.496 -16.203 1.00 21.95

ATOM 2 CA GLU 1 17.706 17.982 -14.905 1.00 16.74

ATOM 3 C GLU 1 17.368 16.466 -15.121 1.00 15.45

ATOM 4 O GLU 1 16.780 16.073 -16.175 1.00 18.81

ATOM 5 CB GLU 1 16.552 18.744 -14.351 1.00 17.35

ATOM 6 CG GLU 1 16.952 20.118 -13.803 1.00 24.48

ATOM 7 CD GLU 1 15.881 21.145 -13.597 1.00 31.51

ATOM 8 OE1 GLU 1 16.012 22.316 -13.292 1.00 29.12

ATOM 9 OE2 GLU 1 14.701 20.768 -13.799 1.00 35.19

ATOM 10 N PHE 2 17.762 15.746 -14.052 1.00 15.83

ATOM 11 CA PHE 2 17.509 14.262 -14.184 1.00 13.24

• These structures provide the basis for most of theoretical work in protein folding and protein structure prediction

Comparison of protein structures • Need ways to determine if two protein structures are related and to compare predicted models to experimental structures • Commonly used measure is the root mean square deviation (RMSD) of the Cartesian atoms between two structures after optimal superposition (McLachlan, 1979): 

i N

 1

dx i

2 

dy i

2 

dz i

2

N

• Usually use C a atoms 3.6 Å 2.9 Å NK-lysin (1nkl) Bacteriocin T102/as48 (1e68) • Other measures include contact maps and torsion angle RMSDs T102 best model

Methods for predicting protein structure Protein sequence comparative modelling fold recognition ab initio prediction -large numbers of sequences, including whole genomes Protein structure - three dimensional - complicated - mediates function ?

Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution homology rational mutagenesis biochemical analysis model studies

Comparative modelling of protein structure • Proteins that have similar sequences (i.e., related by evolution) have similar three-dimensional structures • A model of a protein whose structure is not known can be constructed if the structure of a related protein has been determined by experimental methods • Similarity must be obvious and significant for good models to be built • Need ways to build regions that are not similar between the two related proteins • Need ways to move model closer to the native structure

Comparative modelling of protein structure

scan align … KDHPFGFAVPTKNPDGTMNLMNWECAIP KDPPAGIGAPQDN----QNIMLWNAVIP ** * * * * * * * ** …

build initial model construct non-conserved side chains and main chains refine

Fold recognition • The number of possible protein structures/folds is limited (large number of sequences but few folds) • Proteins that do not have similar sequences sometimes have similar three-dimensional structures 3.6 Å 5% ID NK-lysin (1nkl) Bacteriocin T102/as48 (1e68) • A sequence whose structure is not known is fitted directly (or “threaded”) onto a known structure and the “goodness of fit” is evaluated using a discriminatory function • Need ways to move model closer to the native structure

Fold recognition

evaluate fit … KDHPFGFAVPTKNPDGTMNLMNWECAIP KDPPAGIGAPQDN----QNIMLWNAVIP ** * * * * * * * ** …

build initial model construct non-conserved side chains and main chains refine

Ab initio prediction of protein structure – concept • Go from sequence to structure by sampling the conformational space in a reasonable manner and select a native-like conformation using a good discrimination function • Problems: conformational space is astronomical, and it is hard to design functions that are not fooled by non-native conformations (or “decoys” )

Ab initio prediction of protein structure sample conformational space such that native-like conformations are found astronomically large number of conformations 5 states/100 residues = 5 100 = 10 70 select hard to design functions that are not fooled by non-native conformations (“decoys”)

Sampling conformational space – continuous approaches • Most work in the field - Molecular dynamics - Continuous energy minimisation (follow a valley) - Monte Carlo simulation - Genetic Algorithms • Like real polypeptide folding process • Cannot be sure if native-like conformations are sampled

Molecular dynamics • Force = -dU/dx (slope of potential U); acceleration, m a(t) = force • All atoms are moving so forces between atoms are complicated functions of time • Analytical solution for x(t) and v(t) is impossible; numerical solution is trivial • Atoms move for very short times of 10 -15 seconds or 0.001 picoseconds (ps) new position x(t+ D t) old position old velocity acceleration

=

x(t) + v(t) D t + [4a(t) – a(t D t)] D t 2 /6 new velocity old velocity acceleration v(t+ D t) = v(t) + [2a(t+ D t)+5a(t)-a(t D t)] D t/6 U kinetic = ½ Σ m i v i (t) 2 = ½ n K B T n is number of coordinates (not atoms) • Total energy (U potential + U kinetic ) must not change with time

Energy minimisation • For a given protein, the energy depends on thousands of x,y,z Cartesian atomic coordinates; reaching a deep minimum is not trivial starting conformation deep minimum number of steps • With convergence, we have an accurate equilibrium conformation and a well-defined energy value steepest descent give up number of steps conjugate gradient RMSD converge

Monte Carlo simulation • Discrete moves in torsion or cartesian conformational space • Evaluate energy after every move and compare to previous energy ( D E) • Accept conformation based on Boltzmann probability: ΔE P  exp kT • Many variations, including simulated annealing (starting with a high temperature so more moves are accepted initially and then cooling) • If run for infinite time, simulation will produce a Boltzmman distribution

Genetic Algorithms • Generate an initial pool of conformations • Perform crossover and mutation operations on this set to generate a much larger pool of conformations • Select a subset of the fittest conformations from this large pool • Repeat above two steps until convergence

Sampling conformational space – exhaustive approaches enumerate all possible conformations view entire space (perfect partition function) select computationally intractable: 5 states/100 residues = 5 100 = 10 70 possible conformations must use discrete state models to minimise number of conformations explored

Scoring/energy functions • Need a way to select native-like conformations from non-native ones • Physics-based functions: electrostatics, van der Waals, solvation, bond/angle terms • Knowledge-based scoring functions: derive information about atomic properties from a database of experimentally determined conformations; common parametres include pairwise atomic distances and amino acid burial/exposure.

Requirements for sampling methods and scoring functions • Sampling methods must produce good decoy sets that are comprehensive and include several native-like structures • Scoring function scores must correlate well with RMSD of conformations (the better the score/energy, the lower the RMSD)

Overview of CASP experiment • Three categories: comparative/homology modelling, fold recognition/threading, and ab initio prediction • Goal is to assess structure prediction methods in a blind and rigourous manner; blind prediction is necessary for accurate assessment of methods • Ask modellers to build models of structures as they are in the process of being solved experimentally • After prediction season is over, compare predicted models to the experimental structures • Discuss what went right, what went wrong, and why • Compare progress from CASP1 to CASP4 • Results published in special issues of

Proteins: Structure, Function, Genetics

1995, 1997, 1999, 2002

Comparative modelling at CASP - methods • Alignment: PSI-BLAST, FASTA, CLUSTALW - multiple sequence alignments carefully hand-edited using secondary structure information • More successful side chain prediction methods include: backbone-dependent rotamer libraries (Bower & Dunbrack) segment matching followed by energy minimisation (Levitt) self-consistent mean field optimisation (Bates et al) graph-theory + knowledge-based functions (Samudrala et al ) • More successful loop building methods include: satisfaction of spatial restraints (Sali) internal coordinate mechanics energy optimisation (Abagyan et al) graph-theory + knowledge-based functions (Samudrala et al ) • Overall model building: there is no substitute for careful hand-constructed models (Sternberg et al, Venclovas)

represent residues as nodes A graph theoretic representation of protein structure

-0.5 (I) -1.0 (F) -0.6 (V 1 ) -0.9 (V 2 )

weigh nodes

-0.7 (K) -0.5 (I)

W = -4.5

-0.3

-0.1

-0.2

-1.0 (F) -0.9 (V 2 ) -0.1

-0.7 (K)

find cliques construct graph

-0.5 (I) -0.1

-0.3

-0.6 (V 1 ) -0.9 (V 2 ) -0.2

-0.1

-1.0 (F) -0.2

-0.7 (K)

Historical perspective on comparative modelling BC alignment side chain short loops longer loops excellent ~ 80% 1.0 Å 2.0 Å

Historical perspective on comparative modelling BC CASP1 alignment side chain short loops longer loops excellent ~ 80% 1.0 Å 2.0 Å poor ~ 50% ~ 3.0 Å > 5.0 Å

Prediction for CASP4 target T128/sodm C a RMSD of 1.0 Å for 198 residues (PID 50%)

Prediction for CASP4 target T111/eno C a RMSD of 1.7 Å for 430 residues (PID 51%)

Prediction for CASP4 target T122/trpa C a RMSD of 2.9 Å for 241 residues (PID 33%)

Prediction for CASP4 target T125/sp18 C a RMSD of 4.4 Å for 137 residues (PID 24%)

Prediction for CASP4 target T112/dhso C a RMSD of 4.9 Å for 348 residues (PID 24%)

Prediction for CASP4 target T92/yeco C a RMSD of 5.6 Å for 104 residues (PID 12%)

Comparative modelling at CASP - conclusions BC alignment side chain short loops longer loops excellent ~ 80% 1.0 Å 2.0 Å CASP1 poor ~ 50% ~ 3.0 Å > 5.0 Å CASP2 fair ~ 75% ~ 1.0 Å ~ 3.0 Å CASP3 fair ~75% ~ 1.0 Å ~ 2.5 Å CASP4 fair ~75% ~ 1.0 Å ~ 2.0 Å CASP4: overall model accuracy ranging from 1 Å to 6 Å for 50-10% sequence identity

**T128/sodm – 1.0 Å (198 residues; 50%) **T111/eno – 1.7 Å (430 residues; 51%) **T122/trpa – 2.9 Å (241 residues; 33%) **T125/sp18 – 4.4 Å (137 residues; 24%) **T112/dhso – 4.9 Å (348 residues; 24%) **T92/yeco – 5.6 Å (104 residues; 12%)

Fold recognition at CASP - methods • Visual inspection with sequence comparison (Murzin group) • Procyon - potential of mean force based on pairwise interactions and global dynamic programming (Sippl group) • Threader - potential of mean force and double dynamic programming (Jones group) • Environmental 3D Profiles (Eisenberg group) • NCBI Threading Program using contact potentials and models of sequence-structure conservation (Bryant group) • Hidden Markov Models (Karplus group) • Combination of threading with ab initio approaches (Friesner group) • Environment-specific substitution tables and structure-dependent gap penalties (Blundell group)

Fold recognition at CASP - conclusions • Fold recognition is one of the more successful approaches at predicting structure at all four CASPs • At CASP2 and CASP4 , one of the best methods was simple sequence searching with careful manual inspection (Murzin group ) • At CASP3 and CASP4, none of the threading targets could have been recognised by the best standard sequence comparison methods such as PSI-BLAST • For the most difficult targets, the methods were able to predict  60 residues to 6.0 Å C a RMSD, approaching comparative modelling accuracies as the similarity between proteins increased.

Ab initio prediction at CASP – methods • Assembly of fragments with simulated annealing (Simons et al) • Exhaustive sampling and pruning using knowledge-based scoring functions (Samudrala et al) • Constraint-based Monte Carlo optimisation (Skolnick et al) • Thermodynamic model for secondary structure prediction with manual docking of secondary structure elements and minimisation (Lomize et al) • Minimisation of a physical potential energy function with a simplified representation (Scheraga et al, Osguthorpe et al) • Neural networks to predict secondary structure (Jones, Rost)

Semi-exhaustive segment-based folding

EFDVILKAAGANKVAVIKAVRGATGLGLKEAKDLVESAPAALKEGVSKDDAEALKKALEEAGAEVEVK

… generate fragments from database 14-state f , y model … … minimise monte carlo with simulated annealing conformational space annealing, GA … filter all-atom pairwise interactions, bad contacts compactness, secondary structure

Historical perspective on ab initio prediction Before CASP (BC): “solved” (biased results) CASP1: worse than random CASP2: worse than random with one exception CASP3: consistently predicted correct topology - ~ 6.0 Å for 60+ residues

*T56/dnab – 6.8 Å (60 residues; 67-126) **T61/hdea – 7.4 Å (66 residues; 9-74) **T64/sinr – 4.8 Å (68 residues; 1-68) *T74/eps15 – 7.0 Å (60 residues; 154-213) **T59/smd3 – 6.8 Å (46 residues; 30-75) **T75/ets1 – 7.7 Å (77 residues; 55-131)

CASP4 : ?

Prediction for CASP4 target T110/rbfa C a RMSD of 4.0 Å for 80 residues (1-80)

Prediction for CASP4 target T97/er29 C a RMSD of 6.2 Å for 80 residues (18-97)

Prediction for CASP4 target T106/sfrp3 C a RMSD of 6.2 Å for 70 residues (6-75)

Prediction for CASP4 target T98/sp0a C a RMSD of 6.0 Å for 60 residues (37-105)

Prediction for CASP4 target T126/omp C a RMSD of 6.5 Å for 60 residues (87-146)

Prediction for CASP4 target T114/afp1 C a RMSD of 6.5 Å for 45 residues (36-80)

Postdiction for CASP4 target T102/as48 C a RMSD of 5.3 Å for 70 residues (1-70)

Ab initio prediction at CASP - conclusions Before CASP (BC): “solved” (biased results) CASP1: worse than random CASP2: worse than random with one exception CASP3: consistently predicted correct topology - ~ 6.0 Å for 60+ residues CASP4 : consistently predicted correct topology - ~4-6.0 A for 60-80+ residues

**T97/er29 – 6.0 Å (80 residues; 18-97) *T98/sp0a – 6.0 Å (60 residues; 37-105) **T102/as48 – 5.3 Å (70 residues; 1-70) **T106/sfrp3 – 6.2 Å (70 residues; 6-75) **T110/rbfa – 4.0 Å (80 residues; 1-80) *T114/afp1 – 6.5 Å (45 residues; 36-80)

A. sequence space

Computational aspects of structural genomics

C. fold recognition

* *

B. comparative modelling

* * * * * * * *

D. ab initio prediction

* * * *

E. target selection

* *

targets

* *

F. analysis

* * * * * * (Figure idea by Steve Brenner.)

Key points • DNA/gene is the blueprint - proteins are the functional representatives of genes • Protein structure can be used to understand protein function • Large numbers of genes being sequenced - need structures • Protein folding (from primary sequence to tertiary structure) is a fast self-organising process where a disordered non-functional chain of amino acids becomes a stable, compact, and functional molecule • The free energy difference between the folded and unfolded states is not very high • Experimental methods to determine protein structures include x-ray crystallography and NMR spectroscopy • Theoretical methods to predict protein structures include comparative/homology modelling, fold recognition/threading, and ab initio prediction • For ab initio prediction, you need a method that samples the conformational space adequately (to find native-like conformations) and a function that can identify them • CASP experiment shows limited progress in protein structure prediction

Acknowledgements Michael Levitt, Stanford University John Moult, CARB Patrice Koehl, Stanford University Yu Xia, Stanford Univeristy Levitt and Moult groups <

http://compbio.washington.edu

>