Bio-CS Exploration of Molecular Conformational Spaces Jean-Claude Latombe Computer Science Department Robotics Laboratory & Bio-X Clark Center.

Download Report

Transcript Bio-CS Exploration of Molecular Conformational Spaces Jean-Claude Latombe Computer Science Department Robotics Laboratory & Bio-X Clark Center.

Bio-CS
Exploration of Molecular
Conformational Spaces
Jean-Claude Latombe
Computer Science Department
Robotics Laboratory & Bio-X Clark Center
Range of Bio-CS Research
Body system
Tissue/Organs
Cells
Molecules
Gene
Molecular
structures,
similarities
and motions
Robotic
surgery
Soft-tissue
simulation and
surgical training
Simulation of
cell interaction
Range of Bio-CS Research
Body system
Tissue/Organs
Cells
Molecules
Gene
Molecular
structures,
similarities
and motions
Robotic
surgery
Soft-tissue
simulation and
surgical training
Simulation of
cell interaction
Accuray
Range of Bio-CS Research
Body system
Tissue/Organs
Cells
Molecules
Gene
Molecular
structures,
similarities
and motions
Robotic
surgery
Soft-tissue
simulation and
surgical training
Simulation of
cell interaction
Motion  Structure
4
2
1
3
Motion  Structure  Function
Develop efficient algorithms and data structures
to explore protein conformational spaces:
Sampling
Similarities
Pathways
Vision for the Future
In-silico experiments
Drugs on demand
 “Interactive” Biology
Analogy with Robotics
free space
[Kavraki, Svetska, Latombe,Overmars, 95]
But Biology  Robotics …
Energy field, instead of joint control
Continuous energy field, instead of
binary free and in-collision spaces
Multiple pathways, instead of single
collision-free path
Potentially many more degrees of
freedom
Relation to real world is more complex
Overview
 Part I
Probabilistic Roadmaps:
A Tool for Computing Ensemble Properties of Molecular
Motions
M.S. Apaydin, D.L. Brutlag, C. Guestrin, D. Hsu, J.C. Latombe, and C.
Varma. Stochastic Roadmap Simulation: An Efficient Representation and
Algorithm for Analyzing Molecular Motion. J. Computational Biology, 10(34):257-281, 2003.
 Part II
ChainTree:
A Data Structure for Efficient Monte Carlo Simulation of
Proteins
I. Lotan, F. Schwarzer, J.C. Latombe. Efficient Energy Computation for
Monte Carlo Simulation of Proteins. 3rd Workshop on Algorithms in
Bioinformatics (WABI), Budapest, Hungary, Sept., 2003.
Part I
Probabilistic Roadmaps:
A Tool for Computing Ensemble
Properties of Molecular Motions
Serkan Apaydin, Doug Brutlag1, Carlos Guestrin,
David Hsu2, Jean-Claude Latombe, Chris Varma
Computer Science Department
Stanford University
1 Department of Biochemistry, Stanford University
2 Computer Science Department, Nat. Univ. of Singapore
Initial Work
[Singh, Latombe, Brutlag, 99]
Study of ligand-protein binding
Probabilistic roadmaps with edges
weighted by energetic plausibility
vi
wij
vj
Initial Work
[Singh, Latombe, Brutlag, 99]
Study of ligand-protein binding
Probabilistic roadmaps with edges
weighted by energetic plausibility
vi
wij
vj
Search of most plausible path
Initial Work
[Singh, Latombe, Brutlag, 99]
 Study of energy profiles along most plausible paths
energy
Catalytic
Site
 Extensions to protein folding
[Song and Amato, 01] [Apaydin et al., 01]
But:
Molecules fold/bind along a myriad of pathways.
Any single pathway is of limited interest.
New Idea:
Capture the stochastic nature of molecular
motion by assigning probabilities to edges
vi
Pij
vj
Edge probabilities
 exp( Eij / k BT )
, if Eij  0;

Ni

Follow Metropolis criteria: Pij  
 1 , otherwise.
 N i
vi
Self-transition probability:
Pii  1   Pij
j i
Pii
Pij
vj
Stochastic Roadmap Simulation
Pij
S
Stochastic simulation on roadmap and Monte Carlo
simulation converge to same Boltzmann distribution
Problems with
Monte Carlo Simulation
 Much time is wasted escaping local minima
 Each run generates a single pathway
Proposed Solution
Pij
Treat a roadmap as a Markov chain and use
First-Step Analysis tool
Example #1:
Probability of Folding pfold
HIV integrase
[Du et al. ‘98]
1- pfold
pfold
“We stress that we do not suggest using pfold as a
transition coordinate for practical purposes as it is
Unfolded state very computationally intensive.”
Folded state
Du, Pande, Grosberg, Tanaka, and Shakhnovich “On the Transition Coordinate
for Protein Folding” Journal of Chemical Physics (1998).
First-Step Analysis
U: Unfolded set





F: Folded set
One linear equation per node
Solution gives pfold for all nodes l
k
No explicit simulation run
j
Pik
Pil
All pathways are taken
Pij into account
m
Pim
Sparse linear system
i
Pii
Let fi = pfold(i)
After one step: fi = Pii fi + Pij fj + Pik fk + Pil fl + Pim fm
=1
=1
In Contrast …
Computing pfold with MC simulation requires:
For every conformation c of interest
 Perform many MC simulation runs from c
 Count number of times F is attained first
Computational Tests
• 1ROP (repressor of
primer)
• 2 a helices
• 6 DOF
• 1HDD (Engrailed
homeodomain)
• 3 a helices
• 12 DOF
H-P energy model with steric clash exclusion [Sun et al., 95]
Correlation with MC Approach
1ROP
Computation Times (1ROP)
Monte Carlo:
49 conformations
Over 11 days of
computer time
Over 106 energy
computations
Roadmap:
5000 conformations 1.5 hours of
computer time
~15,000 energy
computations
~4 orders of magnitude speedup!
Example #2:
Ligand-Protein Interaction
Computation of escape time
from funnels of attraction
around potential binding sites
funnel = ball of 10Å rmsd
[Camacho, Vajda, 01]
Similar Computation Through
Simulation
[Sept, Elcock and McCammon `99]
10K to 30K independent simulations
Computing Escape Time with Roadmap
l
k
j
Pil
Pik
Pij
i
Pii
m
Pim
Funnel of Attraction
ti = 1 + Pii ti + Pij tj+ Pik tk + Pil tl + Pim tm
=0
(escape time is measured as number of steps
of stochastic simulation)
Distinguishing Catalytic Site
Given several potential binding sites,
which one is the catalytic site?
Energy: electrostatic + van der Waals + solvation free energy terms
Complexes Studied
ligand
protein # random
nodes
#
DOFs
oxamate
1ldm
8000
7
Streptavidin
1stp
8000
11
Hydroxylamine
4ts1
8000
9
COT
1cjw
8000
21
THK
1aid
8000
14
IPM
1ao5
8000
10
PTI
3tpi
8000
13
Distinction Based on Energy
Protein
Bound
state
Best potential
binding site
1stp
-15.1
-14.6
4ts1
-19.4
-14.6
3tpi
-25.2
-16.0
1ldm
-11.8
-13.6
1cjw
-11.7
-18.0
1aid
-11.2
-22.2
1ao5
-7.5
-13.1
Able to
distinguish
catalytic site
Not able
(kcal/mol)
Distinction Based on Escape Time
Protein
1stp
4ts1
3tpi
1ldm
1cjw
1aid
1ao5
Bound
state
3.4E+9
3.8E+10
1.3E+11
8.1E+5
5.4E+8
9.7E+5
6.6E+7
Best potential
binding site
1.1E+7
1.8E+6
5.9E+5
3.4E+6
4.2E+6
1.6E+8
5.7E+6
Able to
distinguish
catalytic site
Not able
(# steps)
Conclusion
Probabilistic roadmaps are a promising tool for
computing ensemble properties of molecular
pathways
Current work:
 Non-uniform sampling strategies to handle
more complex molecules
 More realistic energetic models
 Extension to molecular dynamic simulation
 Connection to in-vitro experiments
(interaction of two proteins)
Part II
ChainTree:
A Data Structure for Efficient
Monte Carlo Simulation of Proteins
Itay Lotan, Fabian Schwarzer, Dan Halperin1,
Jean-Claude Latombe
Computer Science Department
Stanford University
1 Computer Science Department, Tel Aviv University
Monte Carlo Simulation (MCS)
 Used to study thermodynamic and kinetic
properties of proteins
 Random walk through conformation space
 At each attempted step:
– Perturb current conformation at random
– Accept step with probability:

P(accept )  min 1, e
E / kbT

 Problem: How to maintain energy efficiently?
Energy Function
 E = S bonded terms + S non-bonded terms
 Bonded terms, e.g. bond length
Easy to compute
 Non-bonded terms, e.g. Van der Waals, depend
on distances between pairs of atoms
Expensive to compute, O(n2)
Energy Function
 Non-bonded terms
 Use cutoff distance (6 - 12Å)
 Only O(n) interacting pairs
[Halperin & Overmars ’98]
Problem: How to find interacting pairs
without enumerating all atom pairs?
Grid Method
 Subdivide space into cubic cells
 Compute cell that contains each atom center
 Store results in hash table
dcutof
f
• Θ(n) time to update grid
• O(1) time to find
interactions for each atom
• Θ(n) to find all interactions
Asymptotically optimal in worst-case!
Can We Do Better on Average?
 Proteins are long
kinematic chains
Protein’s Kinematic Structure
torsional dof
 Angles j, y for backbone
and c for side-chains
 Conformational space
Can We Do Better on Average?
 Proteins are long
chain kinematics
 Few DOFs are
perturbed at
each MC step
How to retrieve unchanged partial sums?
 Long sub-chains stay rigid at each step
 Many partial energy sums remain constant
Two New Data Structures
1. ChainTree
 Fast detection of interacting atom pairs
2. EnergyTree
 Reuse of unchanged partial energy sums
ChainTree
Combination of two hierarchies:
 Transform hierarchy:
 Bounding volume hierarchy:
ChainTree
Combination of two hierarchies:
 Transform hierarchy:
approximate kinematics of protein backbone
at successive resolutions
ChainTree
Combination of two hierarchies:
(Larsen et al., ’00)
 Bounding volume hierarchy: approximate
geometry of protein at successive resolutions
ChainTree
Updating the ChainTree
Update path to root
– Recompute transforms that shortcut change
– Recompute bounding volumes that contain change
Finding Interacting Pairs
vs.
• Do not search inside
rigid sub-chains
(unmarked nodes)
• Do not test two nodes
with no marked node
between them
Finding Interacting Pairs
vs.
• Do not search inside
rigid sub-chains
(unmarked nodes)
• Do not test two nodes
with no marked node
between them
Computational Complexity
• n : total number of DOFs in protein backbone
• k : number of simultaneous DOF changes at
each step of MCS
• Updating complexity:
n

O  k log 
k

• Worst-case complexity of finding all
4
interacting pairs:
3
 (n )
but performs much better in practice!!!
EnergyTree
E(P,P)
E(N,N)
E(N,O)
E(O,O)
EnergyTree
E(P,P)
E(N,N)
E(N,O)
E(O,O)
E (a ,  )  E (al , l )  E (a r ,  r )  E (al ,  r )  E (a r , l )
E (a ,a )  E (al ,al )  E (a r ,a r )  E (al ,a r )
Experimental Setup
Energy function:
–
–
–
–
Van der Waals
Electrostatic
Attraction between native contacts
Cutoff at 12Å
300,000 steps MCS
Early rejection for large vdW terms
Results: 1-DOF change
(68)
(144)
(374)
(755)
Results: 5-DOF change
(68)
(144)
(374)
(755)
Two-Pass ChainTree
(68)
(144)
(374)
(755)
Conclusion
• Chain/EnergyTree reduces average time
per step in MCS of proteins (vs. grid)
• Exploit chain kinematics of protein
• Larger speed-up for bigger proteins
and for smaller number of simultaneous
DOF changes
What is Computational Biology?
Using computers in Biology?
Designing efficient algorithms for analyzing biological
data and simulating biological processes?
Using Biology to design new algorithms and computing
hardware?
Cultural clash
Biology
 classification
Computer Science  abstraction
In any case, Computational Biology will be a critical
domain for the next 20 years, probably the next “big
thing” after the Internet