Transcript Slide 1
Algorithms Exploiting the Chain Structure of Proteins Itay Lotan Computer Science Proteins 101 Involved in all functions of our body: metabolism, motion, defense, etc. Michael Levitt Protein representation Torsion angle model: Resi Resi+1 Resi+2 O C N C N C’ O Cα model: C C Resi+3 O C C’ N C N C’ O C C C’ Structure determination X-ray crystallography Bernhard Rupp Outline 1. Fast energy computation during Monte Carlo simulation 2. Model completion for protein X-ray crystallography 3. Large scale computation of similarity Exploit specific properties of proteins to perform the computation efficiently Outline 1. Fast energy computation during Monte Carlo simulation 2. Model completion for protein X-ray crystallography 3. Large scale computation of similarity Lotan, Schwarzer, Halperin* and Latombe. J. Comput. Bio. 2004 (to appear) *CS Department, Tel-Aviv University Monte Carlo simulation (MCS) Popular method for sampling the conformation space of proteins: Estimate thermodynamic quantities Search for low-energy conformations and the folded structure MCS: How it works 1. Propose random change in conformation 2. Compute energy E of new conformation 3. Accept with probability: P(accept ) min 1, eE / kbT Requires >>106 steps to sample adequately Energy function Bonded terms: Bond lengths: Bond angles: Dihedral angles: Non-bonded terms: Van der Waals: Electrostatic: Heuristic: Go models, HP models, etc. Pair-wise interactions Cutoff distance (6 - 12Å) Linear number of interactions contribute to energy (Halperin & Overmars ’98) Challenge: Find all interacting pairs without enumerating all pairs Related work Computer Science Bounding volume hierarchies for collision detection Biology Neighbor lists Verlet ’67 Brooks et al. ’83 Gotschalk et al. ’96 Larsen et al. ’00 Guibas et al. ’02 Grid Faverjon ’84 Halperin & Overmars ’98 Neighbor lists + grid Halperin et al. ’97 Guibas et al. ’02 Space partition methods for collision detection Collisions detection for chains Quentrec & Brot ’73 Hockney et al. ’74 Van Gunsteren et al. ’84 Yip & Elber ’89 Petrella ’02 Grid method d Linear complexity Optimal in worst case d: Cutoff distance Contributions Efficient maintenance and self-collision detection for kinematic chains Efficient computation of pair-wise interactions in MCS of proteins Scheme for caching and reusing partial energy sums during MCS MCS software* Much faster than existing algorithm (grid method) *Download at: http://robotics.stanford.edu/~itayl/mcs Properties of kinematic chains Small changes large effects Properties of kinematic chains Small changes large effects Properties of kinematic chains Small changes large effects Local changes global effects Properties of kinematic chains Small changes large effects Local changes global effects Few DoF changes long rigid subchains Properties of kinematic chains Small changes large effects Local changes global effects Few DoF changes long rigid subchains ChainTree: A tale of two hierarchies Transform hierarchy: approximates kinematics of protein backbone at successive resolutions Bounding volume hierarchy: approximates geometry of protein at successive resolutions Hierarchy of transforms Hierarchy of transforms TAI TAE TEI TAC TAB A TCE TBC B C TCD TEG TDE D E TEF TGI TFG F G TGH THI H I Hierarchy of bounding volumes BAH BEH BAD BCD BAB BA BB BC BEF BD BE BGH BF BG BH The ChainTree TAI BAH TAE BAD TAC BAB TAB BA A TEI BEH TCE BCD TBC BB B TCD BC C TEG BEF TDE BD D TEF BE E TGI BGH TFG BF F TGH BG G THI BH H I Updating the ChainTree TAI BAH TAE BAD TAC BAB TAB BA A TEI BEH TCE BCD TBC BB B TCD BC C TDE BD D TEG BEF TEF BE E TGI BGH TFG BF F TGH BG G THI BH H I Computing the energy Recursively search ChainTree for interactions P N O J A Pruning rules: K B C L D E M F G H 1. Prune search when distance between bounding volumes is more than cutoff distance 2. Do not search inside rigid sub-chains Computing the energy P N J O K L M A B C D E F G H [ P] Computing the energy P N J O K L M A B C D E F G H [ P] [N] Computing the energy P N J O K L M A B C D E F G H [ P] [N] [O ] Computing the energy P N J O K L M A B C D E F G H [ P] [N] [N-O] [O ] Computing the energy P N J O K L M A B C D E F G H [ P] [N] [J] [J-K] [K] [A-C] [A-D] [B-C] [B-D] [C] [C-D] [D] [N-O] [O ] Computing the energy P N J O K L M A B C D E F G H [ P] [N] [J] [A] [A-B] [B] [J-K] [K] [A-C] [A-D] [B-C] [B-D] [N-O] [J-L] [C] [A-E] [C-D] [A-F] [D] [B-E] [B-F] [O] [J-M] [K-L] [K-M] [A-G] [A-H] [B-G] [B-H] [C-E] [C-F] [D-E] [D-F] [L] [L-M] [C-G] [E] [C-H] [E-F] [D-G] [F] [D-H] [E-G] [E-H] [F-G] [F-H] [M] [H] [H-G] [G] Computing the energy P N J O K L M A B C D E F G H E(O) [ P] [N] [J] [A] [A-B] [B] [J-K] [K] [A-C] [A-D] [B-C] [B-D] [N-O] [J-L] [C] [A-E] [C-D] [A-F] [D] [B-E] [B-F] [O] [J-M] [K-L] [K-M] [A-G] [A-H] [B-G] [B-H] [C-E] [C-F] [D-E] [D-F] [L] [L-M] [C-G] [E] [C-H] [E-F] [D-G] [F] [D-H] [E-G] [E-H] [F-G] [F-H] [M] [H] [H-G] [G] Computing the energy Only changed interactions are found Reuse unaffected partial sums Better performance for Longer proteins Fewer simultaneous changes Computational complexity Updating: O k log n k Searching: n 4 3 worst case bound Much faster in practice Test 1-DoF change 5-DoF change 140 280 ChainTree Grid 260 120 Time (in mSec.) 140 Time (in mSec.) ChainTree Grid 120 100 80 60 100 80 60 40 40 20 20 1CTF [68 res.] 1LE2 [144 res.] 1HTB [374 res.] 1JB0 [755 res.] 1CTF 1LE2 1HTB 1JB0 [68 res.] [144 res.] [374 res.] [755 res.] Simulation of α-Synuclein 140 res. protein implicated in Parkinson’s disease Multi-canonical Replica-exchange MC regime Over 1000 CPU days of simulation Study conformations at room temp. Joint work with Vijay Pande Outline 1. Fast energy computation during Monte Carlo simulation 2. Model completion for protein X-ray crystallography 3. Large scale computation of similarity Lotan, van den Bedem*, Deacon* and Latombe, WAFR 2004 van den Bedem*, Lotan, Latombe and Deacon*, submitted to Acta. Cryst. D * Joint Center for Structural Genomics (JCSG) at SSRL Protein Structure Initiative 152K sequenced genes (30K/year) 25K determined structures (3.6K/year) Reduce cost and time to determine protein structure Develop software to automatically interpret the electron density map (EDM) EDM 3-D “image” of atomic structure High value (electron density) at atom centers Density falls off exponentially away from center Automated model building ~90% built at high resolution (2Å) ~66% built at medium to low resolution (2.5 – 2.8Å) Gaps left at noisy areas in EDM (blurred density) Gaps need to be resolved manually The Fragment completion problem Input EDM Partially resolved structure 2 Anchor residues Length of missing fragment Output A small number of candidate structures for missing fragment A robotics inverse kinematics (IK) problem Related work Computer Science Exact IK solvers Manocha & Canny ’94 Manocha et al. ’95 Wang & Chen ’91 Optimization IK solvers Khatib ’87 Burdick ’89 Motion planning for closed loops Han & Amato ’00 Yakey et al. ’01 Cortes et al. ’02, ’04 Exact IK solvers Redundant manipulators Biology/Crystallography Optimization IK solvers Fiser et al. ’00 Kolodny et al. ’03 Database search loop closure Fine et al. ’86 Canutescu & Dunbrack Jr. ’03 Ab-initio loop closure Wedemeyer & Scheraga ’99 Coutsias et al. ’04 Jones & Thirup ’86 Van Vlijman & Karplus ’97 Semi-automatic tools Jones & Kjeldgaard ’97 Oldfield ’01 Contributions Sampling of gap-closing fragments biased by the EDM Refinement of fit to density without breaking closure Fully automatic fragment completion software for X-ray Crystallography Novel application of a combination of inverse kinematics techniques Two-stage IK method 1. Candidate generations: Optimize density fit while closing the gap 2. Refinement: Optimize closed fragments without breaking closure Stage 1: candidate generation Generate random conformation Close using Cyclic Coordinate Descent (CCD) (Wang & Chen ’91, Canutescu & Dunbrack Jr. ’03) Stage 1: candidate generation Generate random conformation Close using Cyclic Coordinate Descent (CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03) Stage 1: candidate generation Generate random conformation Close using Cyclic Coordinate Descent (CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03) Stage 1: candidate generation Generate random conformation Close using Cyclic Coordinate Descent (CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03) Stage 1: candidate generation Generate random conformation Close using Cyclic Coordinate Descent (CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03) CCD moves biased toward high-density Stage 2: refinement Target function T (goodness of fit to EDM) Minimize T while retaining closure Closed conformations lie on Self-motion manifold of lower dimension 1-D manifold Stage 2: null-space minimization Jacobian: linear relation between joint velocities q and end-effector linear and angular velocity x . x J q q (6 n matrix) null J q | J q 0 Compute minimizing move using: qJ † q x N N N – orthonormal basis of null space T T q q Stage 2: minimization with closure 1. 2. 3. 4. Choose sub-fragment with n > 6 DOFs Compute null( J ) using SVD Project T (q) q onto null( J ) Move until minimum is reached or closure is broken Escape from local minima using Monte Carlo with simulated annealing Test: artificial gaps Completed structure (gold standard) Good density (1.6Å res.) Remove fragment and rebuild Length High (2.0Å) Medium (2.5Å) Low (2.8Å) 4 100% (0.14Å) 100% (0.19Å) 100% (0.32Å) 8 100% (0.18Å) 100% (0.23Å) 100% (0.36Å) 12 91% (0.51Å) 96% (0.41Å) 91% (0.52Å) 15 91% (0.53Å) 88% (0.63Å) 83% (0.76Å) Produced by H. van den Bedem Test: true gaps Completed structure (gold standard) O.K. density (2.4Å res.) 6 gaps left by model builder (RESOLVE) Length Top scorer Lowest error 4 0.44Å 0.40Å 4 0.22Å 0.22Å 5 0.78Å 0.78Å 5 0.36Å 0.36Å 7 0.72Å 0.66Å 10 0.43Å 0.43Å Produced by H. van den Bedem Example: TM0423 PDB: 1KQ3, 376 res. 2.0Å resolution 12 residue gap Best: 0.3Å aaRMSD Example: TM0813 PDB: 1J5X, 342 res. 2.8Å resolution 12 residue gap Best: 0.6Å aaRMSD GLU-83 GLY-96 Example: TM0813 PDB: 1J5X, 342 res. 2.8Å resolution 12 residue gap Best: 0.6Å aaRMSD GLU-83 GLY-96 Example: TM0813 PDB: 1J5X, 342 res. 2.8Å resolution 12 residue gap Best 0.6Å aaRMSD GLU-83 GLY-96 Outline 1. Fast energy computation during Monte Carlo simulation 2. Model completion for protein X-ray crystallography 3. Large scale computation of similarity Lotan and Schwarzer, J. Comput. Biol. 11(2–3): 299–317, 2004 Large scale similarity Analysis of simulation trajectories Molecular dynamics simulation Monte Carlo simulation Clustering of decoy sets (e.g., Shortle et al. ’98) Stochastic Roadmap Simulation (Apaydin et al. ’03) Fast similarity measures are needed for analyzing large sets of conformations Contributions Uniform simplification of protein structure for similarity computation Speed-up existing similarity measures Method offers trade-off between speed and precision Efficient computation of nearest neighbors m-Averaged approximation Cut chain into pieces of length m Replace each sequence of m Cα atoms by its centroid 3n coordinates 3n/m coordinates Chains and distances Proximity along the chain entails spatial proximity d 3l Far away links along the chain are spatially distant (on average) ci cj Similarity measures cRMS ( P, Q) minT dRMS ( P, Q ) n 1 pi Tqi n i 1 2 2 n i 2 P Q d ij d ij n( n 1) i 2 j 1 2 Evaluation: test sets 8 structurally diverse proteins (54 -76 residues) 1. Decoy sets: conformations from the Park-Levitt set (Park et al, ’97), N =10,000 2. Random sets: conformations generated by the program FOLDTRAJ (Feldman & Hogue, ’00), N = 5000 Evaluation results: decoy sets m cRMS dRMS 3 4 0.99 0.98-0.99 0.96-0.98 0.94-0.97 6 9 12 0.92-0.99 0.81-0.98 0.54-0.92 0.78-0.93 0.65-0.96 0.52-0.69 9x for cRMS (m = 9) 36x for dRMS (m = 6) Higher correlation for random sets! k Nearest-neighbors problem Given a set S of conformations of a protein and a query conformation c, find the k conformations in S most similar to c Brute force complexity: N log k L • N log k L • 2 N – size of S for all c S L – time to compute similarity Efficient nearest neighbor search kd-tree: O kd log N time per query 1 Limitations: d r r 1. Requires Minkowski metric: dist P, Q pi qi i 1 2. Less efficient when d>20 cRMS is not a Minkowski metric n(n 1) dRMS has dimensionality of 2 Reduce dRMS dimensionality using SVD Reduction using SVD 1. Stack m-averaged distance matrices as vectors 2. Compute the SVD of entire set 3. Project onto principle components dRMS is reduced to 20 dimensions Complexity of SVD ~ n 4 Testing the method Use decoy sets (N = 10,000) and random sets (N = 5,000) m-averaging with (m = 4) Project onto 16 PCs for decoys, 12 PCs for random sets Find k = 10, 25, 100 NNs for 250 conformations in each set Results Decoy sets: ~77% correct Furthest NN off by 10% - 15% (0.7Å – 1.5Å) ~4k approximate NNs contain all true k NNs Random sets: slightly better results Use reduction as fast filter Running Time N = 100,000, m=4, PC = 16 Find k = 100 for each conformation Brute-force: ~84 hours Brute-force + m-averaging: ~4.8 hours Brute-force + m-averaging + SVD: 41 minutes kd-tree + m-averaging + SVD: 19 minutes kd-tree has more impact for larger sets Contributions Energy computation in MCS Efficient maintenance and self-collision detection for kinematic chains Efficient computation of pair-wise interactions in MCS of proteins Caching scheme for partial energy sums during MCS MCS software sampling of gap-closing fragments biased towards the EDM Refinement of fit to density without breaking closure Fully automatic fragment completion software Uniform simplification of protein structure for similarity computation Speed-up existing similarity measures Method offers trade-off between speed and precision Model completion in X-ray crystallography Similarity computation for large conformation sets Efficient computation of nearest neighbors Take-home message Taking into account physical properties of proteins can lead to efficient algorithms for a wide variety of applications in structural biology Outlook biophysicist/biochemist Models that simplify the physics and chemistry of proteins computer scientist Algorithms that exploit properties of protein models Develop simplified protein models that lend themselves to efficient computations Acknowledgements Jean-Claude Latombe Vijay Pande Michael Levitt Leo Guibas Axel Brunger, Balaji Prabhakar, Serafim Batzoglou Fabian Schwarzer, Henry van den Bedem, Dan Halperin Carlo Tomasi Daniel Russakoff, Rachel Kolodny Latombe group Serkan Apaydin, Tim Bretl, Joel Brown, Phil Fong, Mitul Saha, Pekka Isto, Kris Hauser Pande group Bojan Zagrovic, Stefan Larson, Lillian Chong, Young Min Rhee, Sidney Elmer, Chris Snow, Guha Jayachandran, Eric Sorin, Sung-Joo Lee, Jim Cladwell, Michael Shirts, Nina Singhal, Relly Brandman, Vishal Vaidyanathan, Nick Kelley, Mark Engelhardt Levitt Group Patrice Koehl, Tanya Raschke, Erik Lindahl Thank you!