Robotics Algorithms for the Study of Protein Structure and Motion Jean-Claude Latombe Computer Science Department Stanford University.
Download
Report
Transcript Robotics Algorithms for the Study of Protein Structure and Motion Jean-Claude Latombe Computer Science Department Stanford University.
Robotics Algorithms
for the Study of
Protein Structure and Motion
Jean-Claude Latombe
Computer Science Department
Stanford University
Protein
Long sequence of amino-acids (dozens to thousands),
from a dictionary of 20 distinct amino-acids
Central Dogma
of Molecular Biology
Physiological conditions:
aqueous solution, 37°C, pH 7,
atmospheric pressure
Why Proteins?
They are the workhorses of living organisms
• They perform many vital functions, e.g.:
-
catalysis of reactions
storage of energy
transmission of signals
building blocks of muscles
They raise challenging computational issues
• Large molecules (100s to several 1000s of atoms)
• Made of building blocks drawn from a small “dictionary”
• Unusual kinematic structure
They are associated with many critical problems
• Folded structure determination
• Global and local structural similarities
• Prediction of folding and binding motions
f-y Kinematic Linkage Model
peptide group
side-chain group
Molecule and Robot
Two problems
Structure determination from electron
density maps
• Inverse kinematics techniques
[Itay Lotan, Henry van den Bedem, Ashley Deacon
(Joint Center for Structural Genomics)]
Energy maintenance during Monte Carlo
simulation
• Collision detection techniques
[Itay Lotan, Fabian Schwarzer, and Danny Halperin
(Tel Aviv University)]
Structure
Determination/Prediction
Experimental tools
X-ray crystallography
Computational tools
• Homology, threading
• Molecular dynamics
NMR spectrometry
Protein Data Bank
Only about 10% of structures have been
determined for known protein sequences
Protein Structure Initiative (PSI)
1990
1999
2000
2004
250 new structures
2500 new structures
>20,000 structures total
~30,000 structures total
X-Ray Crystallography
Automated Model Building
Software systems: RESOLVE, TEXTAL, ARP/wARP, MAID
• 1.0Å < d < 2.3Å
~ 90% completeness
• 2.3Å ≤ d < 3.0Å
~ 67% completeness (varies widely)1
1.0Å
3.0Å
JCSG: 43% of data sets 2.3Å
Manually completing a model:
• Labor intensive, time consuming
• Existing tools are highly interactive
Model completion is high-throughput bottleneck
1Badger
(2003) Acta Cryst. D59
The Completion Problem
Input:
Anchor 1
(3 atoms)
• Electron-density map
• Partial structure
• Two anchor residues
• Amino-acid sequence of
missing fragment
(typically 4 – 15 residues long)
Anchor 2
(3 atoms)
Protein fragment (fuzzy map)
Main part of protein (folded)
Output:
• Few candidate conformation(s) of fragment that
- Respect the closure constraint (IK)
- Maximize match with electron-density map
IK Problem
Input:
• Closed kinematic chain with n > 6 degrees of freedom
• Relative positions/orientations X of end frames
• Target function T(Q) → R
Output:
• Joint angles Q that
- Achieve closure
- Optimize T
T
Related Work
Biology/Crystallography
Robotics/Computer Science
•
–
–
Manocha & Canny ’94
Manocha et al. ’95
–
Wang & Chen ’91
–
–
Khatib ’87
Burdick ’89
–
–
–
Han & Amato ’00
Yakey et al. ’01
Cortes et al. ’02, ’04
Optimization IK solvers
•
Redundant manipulators
Motion planning for closed loops
Exact IK solvers
–
–
Exact IK solvers
•
•
•
•
Optimization IK solvers
–
–
•
Fiser et al. ’00
Kolodny et al. ’03
Database search loop closure
–
–
•
Fine et al. ’86
Canutescu & Dunbrack Jr. ’03
Ab-initio loop closure
–
–
•
Wedemeyer & Scheraga ’99
Coutsias et al. ’04
Jones & Thirup ’86
Van Vlijman & Karplus ’97
Semi-automatic tools
–
–
Jones & Kjeldgaard ’97
Oldfield ’01
Two-Stage IK Method
1. Candidate generations
Closed fragments
2. Candidate refinement
Optimize fit with EDM
Stage 1: Candidate Generation
1.
Generate random conformation of fragment
(only one end attached to anchor)
2. Close fragment (i.e., bring other end to
second anchor) using Cyclic Coordinate
Descent (CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03)
Closure Distance
Closure Distance: S N - N C - C C - C
2
moving end
fixed end
2
2
A.A. Canutescu and R.L. Dunbrack Jr.
Cyclic coordinate descent: A robotics
algorithm for protein loop closure.
Prot. Sci. 12:963–972, 2003.
S
0
Compute qi s.t.
qi
+ bias toward EDM
+ avoid steric clashes
Stage 2: Candidate Refinement
Target function T (Q) measuring quality of
the fit with the EDM
Minimize T while retaining closure
Closed conformations lie on a self-motion
manifold of lower dimension
dq3
dq2
Null space
(q1,q2,q3)
dq1
1-D manifold
Closure and Null Space
dX = J dQ, where J is the 6n Jacobian
matrix (n > 6)
Null space {dQ | J dQ = 0} has dim = n – 6
N: orthonormal basis of null space
Pseudo-inverse J+ such that JJ+ = I
dQ = J+dX + NNTy
y = T(Q)
Computation of J+ and N
SVD of J
dX
U66
VT6n
S66
s1
s2
dQ
0
=
s6
NT
(n-6) basis N of null space
Gram-Schmidt orthogonalization
J+ = V S+ UT where S+=diag[1/si]
Refinement Procedure
Repeat until minimum is reached:
Compute J, J+ and N at current Q
• Compute T at current Q
(analytical expression of T + linear-time recursive
computation [Abe et al., Comput. Chem., 1984])
•
Move along dQ = J+dX + NNT T until
minimum is reached or closure is broken
+
Monte Carlo + simulated annealing protocol to
deal with local minima
Monte Carlo Optimization
Repeat:
1. Perform a random move of the
fragment:
– either by picking a random direction in null
space
– or by using an exact IK solver over 6 dofs
[Coutsias et al, 2004] ( big jumps)
2. Minimize T(Q)
3. Accept move with Metropolis-criterion
probability ~exp(-DT/Temp)
Tests #1: Artificial Gaps
TM1621 (234 residues) and TM0423 (376
residues), SCOP classification a/b
Complete structures (gold standard)
resolved with EDM at 1.6Å resolution
Compute EDM at 2, 2.5, and 2.8Å resolution
Remove fragments and rebuild
TM1621
103 Fragments from TM1621 at 2.5Å
Short Fragments:
100% < 1.0Å aaRMSD
Long Fragments:
12: 96% < 1.0Å aaRMSD
15: 88% < 1.0Å aaRMSD
Produced by H. van den Bedem
Comparison Across Resolutions
Resolution = 2.0Å
Resolution = 2.5Å
Resolution = 2.8Å
Example: TM0423
PDB: 1KQ3, 376 res.
2.0Å resolution
12 residue gap
Best: 0.3Å aaRMSD
Tests #2: True Gaps
Structure computed by RESOLVE
Gaps completed independently (gold standard)
Example: TM1742 (271 residues)
2.4Å resolution; 5 gaps left by RESOLVE
Length
Top scorer
Lowest error
4
0.22Å
0.22Å
5
0.78Å
0.78Å
5
0.36Å
0.36Å
7
0.72Å
0.66Å
10
0.43Å
0.43Å
Produced by H. van den Bedem
TM0813
PDB: 1J5X, 342 res.
2.8Å resolution
12 residue gap
GLU-83
GLY-96
TM0813
PDB: 1J5X, 342 res.
2.8Å resolution
12 residue gap
Best 0.6Å aaRMSD
GLU-83
GLY-96
TM1621
Green: manually
completed
conformation
Cyan: conformation
computed by stage 1
Magenta: conformation
computed by stage 2
The aaRMSD improved
by 2.4Å to 0.31Å
Alr1529
D72-D78
resolution:
initial model:
contour:
PDB:
aaRMSD:
2.0Å
ARP/wARP
1.0s
1VJG
0.33Å
TM0542
• Top-scoring fragment in cyan
• Manually completed fragment in green
• Residues A259 and A260 are flipped
Current/Future Work
Software actively being
used at the JCSG
What about multi-modal
loops?
B
A
TM0755: data at 1.8Å
8-residue fragment crystallized in 2 conformations
Overlapping density: Difficult to interpret manually
A323
Hist
A316
Ser
Algorithm successfully identified and built both conformations
Current/Future Work
Software actively being
used at the JCSG
What about multi-modal
loops?
Fuzziness in EDM can
then be exploited
B
Use EDM to infer
probability measure
over the conformation
space of the loop
A
Amylosucrase
J. Cortés, T. Siméon, M. Renaud-Siméon, and V. Tran.
J. Comp. Chemistry, 25:956-967, 2004
Energy maintenance during
Monte Carlo simulation
joint work with Itay Lotan, Fabian Schwarzer,
and Dan Halperin1
1 Computer Science Department, Tel Aviv University
Monte Carlo Simulation (MCS)
Random walk through conformation space
At each attempted step:
• Perturb current conformation at random
• Accept step with probability:
P(accept ) min 1, e
-DE / kbT
The conformations generated by an arbitrarily
long MCS are Boltzman distributed, i.e.,
#conformations in V ~
V
e
-
E
kT
dV
Monte Carlo Simulation (MCS)
Used to:
• sample meaningful distributions of conformations
• generate energetically plausible motion pathways
A simulation run may consist of millions of
steps
energy must be evaluated frequently
Problem: How to maintain energy efficiently?
Energy Function
E = S bonded terms
+
S non-bonded terms
Bonded terms
+ S solvation terms
- O(n)
Non-bonded terms
- E.g., e.g. Van der Waals and electrostatic
- Depend on distances between pairs of atoms
- O(n2) Expensive to compute
Solvation terms
- May require computing molecular surface
Non-Bonded Terms
Energy terms go to 0 when distance
increases
Cutoff distance (6 - 12Å)
vdW forces prevent atoms
from bunching up
Only O(n) interacting pairs
[Halperin&Overmars 98]
Problem: How to find interacting pairs
without enumerating all atom pairs?
Grid Method
dcutoff
Subdivide 3-space into
cubic cells
Compute cell that
contains each atom
center
Represent grid as hashtable
Grid Method
dcutoff
Θ(n) time to build grid
O(1) time to find
interactive pairs for each
atom
Θ(n) to find all
interactive pairs of
atoms [Halperin&Overmars, 98]
Asymptotically optimal
in worst-case
Can we do better on average?
Few DOFs are changed at each MC step
0
simulation
of 100,000
attempted
steps
5
10
20
30
Number k
of DOF changes
Can we do better on average?
Few DOFs are changed at each MC step
Proteins are long chain kinematics
Long sub-chains stay rigid at each step
Many partial energy sums remain constant
Problem: How to retrieve the unchanged
partial sums?
Hierarchical Collision Checking
Widely used technique
in robotics/graphics to
approximate distances
between objects
Pre-computation of
bounding-volume
hierarchy
How to update this
hierarchy if the objects
deform
Two New Data Structures
1. ChainTree
Fast detection of interacting atom pairs
2. EnergyTree
Retrieval of unchanged partial energy sums
ChainTree
(Twofold Hierarchy: BVs + Transforms)
links
ChainTree
(Twofold Hierarchy: BVs + Transforms)
TNO
TJK
TAB
joints
Updating the ChainTree
Update path to root:
– Recompute transforms that “shortcut” the DOF change
– Recompute BVs that contain the DOF change
– O(k log(n/k)) work for k changes
Finding Interacting Pairs
Finding Interacting Pairs
Finding Interacting Pairs
Do not search inside
rigid sub-chains
(unmarked nodes)
Finding Interacting Pairs
Do not search inside
rigid sub-chains
(unmarked nodes)
Do not test two nodes
with no marked node
between them
New interacting pairs
EnergyTree
E(N,N)
E(J,L)
E(K.L)
E(L,L)
E(M,M)
EnergyTree
E(N,N)
E(J,L)
E(K.L)
E(L,L)
E(M,M)
Complexity
n : total number of DOFs
k : number of DOF changes at each MCS step
k << n
Complexity of:
updating ChainTree: O(k log(n/k))
finding interacting pairs: O(n4/3)
but performs much better in practice!!!
Experimental Setup
Energy function:
Van der Waals
Electrostatic
Attraction between native contacts
Cutoff at 12Å
300,000 steps MCS with Grid and
ChainTree
Steps are the same with both methods
Early rejection for large vdW terms
Results: 1-DOF change
12.5
7.8
speedup
5.8
3.5
# amino acids
(68)
(144)
(374)
(755)
Results: 5-DOF change
5.9
speedup
4.5
3.4
2.2
(68)
(144)
(374)
(755)
Two-Pass ChainTree (ChainTree+)
1st pass: small cutoff distance to detect steric clashes
2nd pass: normal cutoff distance
>5
Tests around
native state
Interaction with Solvent
Explicit solvent models: 100s or 1000s of discrete
solvent molecules
Implicit solvent models: solvent as continuous medium,
interface is solvent-accessible surface
E. Eyal, D. Halperin. Dynamic Maintenance of Molecular Surfaces under
Conformational Changes.
http://www.give.nl/movie/publications/telaviv/EH04.pdf
Summary
Inverse kinematics techniques
Improve structure determination from
fuzzy electron density maps
Collision detection techniques
Speedup energy maintenance during
Monte Carlo simulation
About Computational Biology
Computational Biology is more than using
computers to biological problems or
mimicking nature (e.g., performing MD
simulation)
One of its goals is to achieve algorithmic
efficiency by exploiting properties of
molecules, e.g.:
• Proteins are long kinematic chains
• Atoms cannot bunch up together
• Forces have relatively short ranges