Robotics Algorithms for the Study of Protein Structure and

Download Report

Transcript Robotics Algorithms for the Study of Protein Structure and

Robotics Algorithms
for the Study of
Protein Structure and Motion
Jean-Claude Latombe
Computer Science Department
Stanford University
Based on Itay Lotan’s PhD
Many pathways
Unfolded (denatured) state
Folded (native) state
Folded State
Loops connect  helices and  strands
Protein Sequence Structure
amino-acid
(residue)
peptide
bonds
f-y Kinematic Linkage Model
 Conformational space
Molecule  Robot
Why Studying Proteins?
 They perform many vital functions, e.g.:
•
•
•
•
catalysis of reactions
storage of energy
transmission of signals
building blocks of muscles
 They are linked to key biological problems that
raise major computational challenges
mostly due to their large sizes (100s to several
1000s of atoms), many degrees of kinematic
freedom, and their huge number (millions)
Two problems
 Structure determination from electron
density maps
• Inverse kinematics techniques
[Itay Lotan, Henry van den Bedem, Ashley Deacon
(Joint Center for Structural Genomics)]
 Energy maintenance during Monte Carlo
simulation
• Distance computation techniques
[Itay Lotan, Fabian Schwarzer, and Danny Halperin
(Tel Aviv University)]
Structure Determination:
X-Ray Crystallography
Software
Software systems: RESOLVE, TEXTAL, ARP/wARP, MAID
• 1.0Å < d < 2.3Å
~ 90% completeness
• 2.3Å ≤ d < 3.0Å
~ 67% completeness (varies widely)1
1.0Å
3.0Å
JCSG: 43% of data sets  2.3Å
 Manually completing a model:
• Labor intensive, time consuming
• Existing tools are highly interactive
 Model completion is high-throughput bottleneck
1Badger
(2003) Acta Cryst. D59
The Completion Problem
 Input:
Anchor 1
(3 atoms)
• Electron-density map
• Partial structure
• Two anchor residues
• Amino-acid sequence of
missing fragment
(typically 4 – 15 residues long)
 Output:
Anchor 2
(3 atoms)
Protein fragment (fuzzy map)
Partial structure
(folded)
Main part of protein (folded)
• Ranked conformations Q of fragment that
- Respect the closure constraint
- Maximize target function T(Q) measuring fit with
electron-density map
- No atomic clashes
(Inverse Kinematics)
Two-Stage IK Method
1. Candidate generations
 Closed fragments
2. Candidate refinement
 Optimize fit with EDM
Stage 1: Candidate Generation
1.
Generate a random conformation of
fragment (only one end attached to anchor)
2. Close fragment (i.e., bring other end to
second anchor) using Cyclic Coordinate
Descent (CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03)
Closure Distance
Closure Distance: S  N - N  C - C  C - C
2
moving end
fixed end
2
2
A.A. Canutescu and R.L. Dunbrack Jr.
Cyclic coordinate descent: A robotics
algorithm for protein loop closure.
Prot. Sci. 12:963–972, 2003.
S
0
Compute qi s.t.
qi
+ bias toward
avoiding steric
clashes
Exact Inverse Kinematics
Repeat for each conformation of a
closed fragment:
1. Pick 3 amino-acids at random (3 pairs
of f-y angles)
2. Apply exact IK solver to generate all
IK solutions [Coutsias et al, 2004]
TM0813
GLU-83
GLY-96
Stage 2: Candidate Refinement
 Target function T (Q) measuring quality of
the fit with the EDM
 Minimize T while retaining closure
 Closed conformations lie on a self-motion
manifold of lower dimension
dq3
dq2
Null space
(q1,q2,q3)
dq1
1-D manifold
Closure and Null Space




dX = J dQ, where J is the 6n Jacobian
matrix (n > 6)
Null space {dQ | J dQ = 0} has dim = n – 6
N: orthonormal basis of null space
dQ = NNT T(Q)
X
Computation of N
SVD of J
dX
U66
VT6n
S66
s1
s2
dQ
0
=
s6
NT
(n-6) basis N of null space
Gram-Schmidt orthogonalization
Refinement Procedure
Repeat until minimum of T is reached:
1. Compute J and N at current Q
2. Compute T at current Q
(analytical expression of T + linear-time recursive
computation [Abe et al., Comput. Chem., 1984])
3. Move by small increment along dQ = NNT T
(+ Monte Carlo / simulated annealing protocol to
deal with local minima)
TM0813
GLU-83
GLY-96
Tests #1: Artificial Gaps
 TM1621 (234 residues) and TM0423 (376
residues), SCOP classification a/b
 Complete structures (gold standard)
resolved with EDM at 1.6Å resolution
 Compute EDM at 2, 2.5, and 2.8Å resolution
 Remove fragments and rebuild
TM1621
103 Fragments from TM1621 at 2.5Å
Short Fragments:
100% < 1.0Å aaRMSD
Long Fragments:
12: 96% < 1.0Å aaRMSD
15: 88% < 1.0Å aaRMSD
Produced by H. van den Bedem
Example: TM0423
PDB: 1KQ3, 376 res.
2.0Å resolution
12 residue gap
Best: 0.3Å aaRMSD
Tests #2: True Gaps




Structure computed by RESOLVE
Gaps completed independently (gold standard)
Example: TM1742 (271 residues)
2.4Å resolution; 5 gaps left by RESOLVE
Length
Top scorer
4
0.22Å
5
0.78Å
5
0.36Å
7
0.72Å
10
0.43Å
Produced by H. van den Bedem
TM1621
 Green: manually
completed
conformation
 Cyan: conformation
computed by stage 1
 Magenta: conformation
computed by stage 2
 The aaRMSD improved
by 2.4Å to 0.31Å
Current/Future Work
 Software actively being
used at the JCSG
 What about multi-modal
loops?
B
A
 TM0755: data at 1.8Å
 8-residue fragment crystallized in 2 conformations
 Overlapping density: Difficult to interpret manually
A323
Hist
A316
Ser
Algorithm successfully identified and built both conformations
Current/Future Work
 Software actively being
used at the JCSG
 What about multi-modal
loops?
 Fuzziness in EDM can
then be exploited
B
 Use EDM to infer
probability measure
over the conformation
space of the loop
A
Amylosucrase
J. Cortés, T. Siméon, M. Renaud-Siméon, and V. Tran.
J. Comp. Chemistry, 25:956-967, 2004
Energy maintenance during
Monte Carlo simulation
joint work with Itay Lotan, Fabian Schwarzer,
and Dan Halperin1
1 Computer Science Department, Tel Aviv University
Monte Carlo Simulation (MCS)
 Random walk through conformation space
 At each attempted step:
• Perturb current conformation at random
• Accept step with probability:

P(accept )  min 1, e
-E / kbT

 The conformations generated by an arbitrarily
long MCS are Boltzman distributed, i.e.,
#conformations in V ~

V
e
-
E
kT
dV
Monte Carlo Simulation (MCS)
 Used to:
• sample meaningful distributions of conformations
• generate energetically plausible motion pathways
 A simulation run may consist of millions of steps
 energy must be evaluated a large number of times
Problem: How to maintain energy efficiently?
Energy Function
 E = S bonded terms
+
S non-bonded terms
 Bonded terms
+ S solvation terms
- O(n)
 Non-bonded terms
- E.g., Van der Waals and electrostatic
- Depend on distances between pairs of atoms
- O(n2)  Expensive to compute
 Solvation terms
- May require computing molecular surface
Non-Bonded Terms
 Energy terms go to 0 when distance
increases
 Cutoff distance (6 - 12Å)
 vdW forces prevent atoms
from bunching up
 Only O(n) interacting pairs
[Halperin&Overmars 98]
Problem: How to find interacting pairs
without enumerating all atom pairs?
Grid Method
dcutoff
 Subdivide 3-space into
cubic cells
 Compute cell that
contains each atom
center
 Represent grid as hashtable
Grid Method
dcutoff
 Θ(n) time to build grid
 O(1) time to find
interactive pairs for each
atom
 Θ(n) to find all
interactive pairs of
atoms [Halperin&Overmars, 98]
 Asymptotically optimal
in worst-case
Can we do better on average?
 Few DOFs are changed at each MC step
0
simulation
of 100,000
attempted
steps
5
10
20
30
Number k
of DOF changes
Can we do better on average?
 Few DOFs are changed at each MC step
 Proteins are long chain kinematics
Long sub-chains stay rigid at each step
 Many interacting pairs of atoms are unchanged
 Many partial energy sums remain constant
Problem: How to find new interacting pairs
and retrieve unchanged partial sums?
Two New Data Structures
1. ChainTree
 Fast detection of interacting atom pairs
2. EnergyTree
 Retrieval of unchanged partial energy sums
ChainTree
(Twofold Hierarchy: BVs + Transforms)
links
ChainTree
(Twofold Hierarchy: BVs + Transforms)
TNO
TJK
TAB
joints
Updating the ChainTree
Update path to root:
– Recompute transforms that “shortcut” the DOF change
– Recompute BVs that contain the DOF change
– O(k log2(2n/k)) work for k changes
Finding Interacting Pairs

Finding Interacting Pairs
Finding Interacting Pairs
 Do not search inside
rigid sub-chains
(unmarked nodes)
Finding Interacting Pairs
 Do not search inside
rigid sub-chains
(unmarked nodes)
 Do not test two nodes
with no marked node
between them
 New interacting pairs
EnergyTree
E(N,N)
E(J,L)
E(K.L)
E(L,L)
E(M,M)
EnergyTree
E(N,N)
E(J,L)
E(K.L)
E(L,L)
E(M,M)
Complexity
 n : total number of DOFs
 k : number of DOF changes at each MCS step
 k << n
 Complexity of:
 updating ChainTree: O(k log2(2n/k))
 finding interacting pairs: O(n4/3)
but performs much better in practice!!!
Experimental Setup
 Energy function:




Van der Waals
Electrostatic
Attraction between native contacts
Cutoff at 12Å
 300,000 steps MCS with Grid and
ChainTree
 Steps are the same with both methods
 Early rejection for large vdW terms
Results: 1-DOF change
12.5
7.8
speedup
5.8
3.5
# amino acids
(68)
(144)
(374)
(755)
Results: 5-DOF change
5.9
speedup
4.5
3.4
2.2
(68)
(144)
(374)
(755)
Two-Pass ChainTree (ChainTree+)
1st pass: small cutoff distance to detect steric clashes
2nd pass: normal cutoff distance
>5
Tests around
native state
Interaction with Solvent
 Implicit solvent model: solvent as continuous medium,
interface is solvent-accessible surface
E. Eyal, D. Halperin. Dynamic Maintenance of Molecular Surfaces under
Conformational Changes.
http://www.give.nl/movie/publications/telaviv/EH04.pdf
Summary
 Inverse kinematics techniques 
Improve structure determination from
fuzzy electron density maps
 Collision detection techniques 
Speedup energy maintenance during
Monte Carlo simulation
About Computational Biology
 Computational Biology is more than mimicking
nature (e.g., performing Molecular Dynamic
simulation)
 One of its goals is to achieve algorithmic
efficiency by exploiting properties of molecules,
e.g.:
• Atoms cannot bunch up together
• Forces have relatively short ranges
• Proteins are long kinematic chains