Transcript Slide 1

Algorithms Exploiting the
Chain Structure of Proteins
Itay Lotan
Computer Science
Proteins 101
Involved in all functions of our body:
metabolism, motion, defense, etc.
Michael Levitt
Protein representation
 Torsion angle model:
Resi
Resi+1
Resi+2
O
C

N
C

N
C’
O
 Cα model:


C
C
Resi+3
O
C

C’
N
C

N
C’
O


C
C
C’
Structure determination
X-ray crystallography
Bernhard Rupp
Outline
1. Fast energy computation during
Monte Carlo simulation
2. Model completion for protein X-ray
crystallography
3. Large scale computation of similarity
Exploit specific properties of proteins to
perform the computation efficiently
Outline
1. Fast energy computation during
Monte Carlo simulation
2. Model completion for protein X-ray
crystallography
3. Large scale computation of similarity
Lotan, Schwarzer, Halperin* and Latombe.
J. Comput. Bio. 2004 (to appear)
*CS Department, Tel-Aviv University
Monte Carlo simulation (MCS)
Popular method for sampling the
conformation space of proteins:
 Estimate thermodynamic quantities
 Search for low-energy conformations
and the folded structure
MCS: How it works
1. Propose random change in conformation
2. Compute energy E of new conformation
3. Accept with probability:

P(accept )  min 1, eE / kbT

Requires >>106 steps to sample adequately
Energy function
 Bonded terms:
 Bond lengths:
 Bond angles:
 Dihedral angles:
 Non-bonded terms:
 Van der Waals:
 Electrostatic:
 Heuristic: Go models, HP models, etc.
Pair-wise interactions
 Cutoff distance (6 - 12Å)
 Linear number of interactions
contribute to energy (Halperin & Overmars ’98)
Challenge: Find all interacting
pairs without enumerating all pairs
Related work
Computer Science
 Bounding volume
hierarchies for collision
detection
Biology
 Neighbor lists


Verlet ’67
Brooks et al. ’83



Gotschalk et al. ’96
Larsen et al. ’00
Guibas et al. ’02
 Grid


Faverjon ’84
Halperin & Overmars ’98
 Neighbor lists + grid


Halperin et al. ’97
Guibas et al. ’02
 Space partition methods
for collision detection
 Collisions detection for
chains





Quentrec & Brot ’73
Hockney et al. ’74
Van Gunsteren et al. ’84
Yip & Elber ’89
Petrella ’02
Grid method
d
 Linear complexity
 Optimal in worst case
d: Cutoff
distance
Contributions
 Efficient maintenance and self-collision
detection for kinematic chains
 Efficient computation of pair-wise
interactions in MCS of proteins
 Scheme for caching and reusing partial
energy sums during MCS
 MCS software*
Much faster than existing algorithm
(grid method)
*Download at: http://robotics.stanford.edu/~itayl/mcs
Properties of kinematic chains
 Small changes  large effects
Properties of kinematic chains
 Small changes  large effects
Properties of kinematic chains
 Small changes  large effects
 Local changes  global effects
Properties of kinematic chains
 Small changes  large effects
 Local changes  global effects
 Few DoF changes  long rigid subchains
Properties of kinematic chains
 Small changes  large effects
 Local changes  global effects
 Few DoF changes  long rigid subchains
ChainTree: A tale of two hierarchies
 Transform hierarchy: approximates
kinematics of protein backbone at
successive resolutions
 Bounding volume hierarchy:
approximates geometry of protein at
successive resolutions
Hierarchy of transforms
Hierarchy of transforms
TAI
TAE
TEI
TAC
TAB
A
TCE
TBC
B
C
TCD
TEG
TDE
D
E
TEF
TGI
TFG
F
G
TGH
THI
H
I
Hierarchy of bounding volumes
BAH
BEH
BAD
BCD
BAB
BA
BB
BC
BEF
BD
BE
BGH
BF
BG
BH
The ChainTree
TAI BAH
TAE BAD
TAC BAB
TAB
BA
A
TEI BEH
TCE BCD
TBC
BB
B
TCD
BC
C
TEG BEF
TDE
BD
D
TEF
BE
E
TGI BGH
TFG
BF
F
TGH
BG
G
THI
BH
H
I
Updating the ChainTree
TAI BAH
TAE BAD
TAC BAB
TAB
BA
A
TEI BEH
TCE BCD
TBC
BB
B
TCD
BC
C
TDE
BD
D
TEG BEF
TEF
BE
E
TGI BGH
TFG
BF
F
TGH
BG
G
THI
BH
H
I
Computing the energy
Recursively search ChainTree for
interactions
P
N
O
J
A
Pruning rules:
K
B
C
L
D
E
M
F
G
H
1. Prune search when distance between bounding volumes
is more than cutoff distance
2. Do not search inside rigid sub-chains
Computing the energy
P
N
J
O
K
L
M
A B C D E F G H
[ P]
Computing the energy
P
N
J
O
K
L
M
A B C D E F G H
[ P]
[N]
Computing the energy
P
N
J
O
K
L
M
A B C D E F G H
[ P]
[N]
[O ]
Computing the energy
P
N
J
O
K
L
M
A B C D E F G H
[ P]
[N]
[N-O]
[O ]
Computing the energy
P
N
J
O
K
L
M
A B C D E F G H
[ P]
[N]
[J]
[J-K] [K]
[A-C]
[A-D]
[B-C]
[B-D]
[C]
[C-D]
[D]
[N-O]
[O ]
Computing the energy
P
N
J
O
K
L
M
A B C D E F G H
[ P]
[N]
[J]
[A]
[A-B]
[B]
[J-K] [K]
[A-C]
[A-D]
[B-C]
[B-D]
[N-O]
[J-L]
[C] [A-E]
[C-D] [A-F]
[D] [B-E]
[B-F]
[O]
[J-M] [K-L] [K-M]
[A-G]
[A-H]
[B-G]
[B-H]
[C-E]
[C-F]
[D-E]
[D-F]
[L] [L-M]
[C-G] [E]
[C-H] [E-F]
[D-G] [F]
[D-H]
[E-G]
[E-H]
[F-G]
[F-H]
[M]
[H]
[H-G]
[G]
Computing the energy
P
N
J
O
K
L
M
A B C D E F G H
E(O)
[ P]
[N]
[J]
[A]
[A-B]
[B]
[J-K] [K]
[A-C]
[A-D]
[B-C]
[B-D]
[N-O]
[J-L]
[C] [A-E]
[C-D] [A-F]
[D] [B-E]
[B-F]
[O]
[J-M] [K-L] [K-M]
[A-G]
[A-H]
[B-G]
[B-H]
[C-E]
[C-F]
[D-E]
[D-F]
[L] [L-M]
[C-G] [E]
[C-H] [E-F]
[D-G] [F]
[D-H]
[E-G]
[E-H]
[F-G]
[F-H]
[M]
[H]
[H-G]
[G]
Computing the energy
 Only changed interactions are found
 Reuse unaffected partial sums
 Better performance for
 Longer proteins
 Fewer simultaneous changes
Computational complexity
 Updating:

O k log n
k

 Searching:
 
 n
4
3
worst case bound
Much faster in practice
Test
1-DoF change
5-DoF change
140
280
ChainTree
Grid
260
120
Time (in mSec.)
140
Time (in mSec.)
ChainTree
Grid
120
100
80
60
100
80
60
40
40
20
20
1CTF
[68 res.]
1LE2
[144 res.]
1HTB
[374 res.]
1JB0
[755 res.]
1CTF
1LE2
1HTB
1JB0
[68 res.]
[144 res.]
[374 res.]
[755 res.]
Simulation of α-Synuclein
 140 res. protein implicated in
Parkinson’s disease
 Multi-canonical Replica-exchange MC
regime
 Over 1000 CPU days of simulation
 Study conformations at room temp.
 Joint work with Vijay Pande
Outline
1. Fast energy computation during
Monte Carlo simulation
2. Model completion for protein X-ray
crystallography
3. Large scale computation of similarity
Lotan, van den Bedem*, Deacon* and
Latombe, WAFR 2004
van den Bedem*, Lotan, Latombe and
Deacon*, submitted to Acta. Cryst. D
*
Joint Center for Structural Genomics (JCSG) at SSRL
Protein Structure Initiative
152K sequenced genes
(30K/year)
25K determined structures
(3.6K/year)
 Reduce cost and time to determine protein
structure
 Develop software to automatically interpret the
electron density map (EDM)
EDM
3-D “image” of atomic structure
 High value (electron density) at atom
centers
 Density falls off exponentially away from
center
Automated model building
 ~90% built at high resolution (2Å)
 ~66% built at medium to low
resolution (2.5 – 2.8Å)
 Gaps left at noisy areas in EDM
(blurred density)
Gaps need to be resolved manually
The Fragment completion problem
 Input




EDM
Partially resolved structure
2 Anchor residues
Length of missing fragment
 Output
 A small number of candidate structures
for missing fragment
A robotics inverse kinematics (IK) problem
Related work
Computer Science



Exact IK solvers


Manocha & Canny ’94
Manocha et al. ’95

Wang & Chen ’91
Optimization IK solvers

Khatib ’87
Burdick ’89
Motion planning for closed
loops



Han & Amato ’00
Yakey et al. ’01
Cortes et al. ’02, ’04
Exact IK solvers



Redundant manipulators



Biology/Crystallography
Optimization IK solvers



Fiser et al. ’00
Kolodny et al. ’03
Database search loop closure



Fine et al. ’86
Canutescu & Dunbrack Jr. ’03
Ab-initio loop closure



Wedemeyer & Scheraga ’99
Coutsias et al. ’04
Jones & Thirup ’86
Van Vlijman & Karplus ’97
Semi-automatic tools


Jones & Kjeldgaard ’97
Oldfield ’01
Contributions
 Sampling of gap-closing fragments
biased by the EDM
 Refinement of fit to density without
breaking closure
 Fully automatic fragment completion
software for X-ray Crystallography
Novel application of a combination of
inverse kinematics techniques
Two-stage IK method
1. Candidate generations: Optimize
density fit while closing the gap
2. Refinement: Optimize closed
fragments without breaking closure
Stage 1: candidate generation
 Generate random conformation
 Close using Cyclic Coordinate Descent
(CCD) (Wang & Chen ’91, Canutescu & Dunbrack Jr. ’03)
Stage 1: candidate generation
 Generate random conformation
 Close using Cyclic Coordinate Descent
(CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03)
Stage 1: candidate generation
 Generate random conformation
 Close using Cyclic Coordinate Descent
(CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03)
Stage 1: candidate generation
 Generate random conformation
 Close using Cyclic Coordinate Descent
(CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03)
Stage 1: candidate generation
 Generate random conformation
 Close using Cyclic Coordinate Descent
(CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03)
CCD moves biased toward high-density
Stage 2: refinement
 Target function T (goodness of fit to EDM)
 Minimize T while retaining closure
 Closed conformations lie on Self-motion
manifold of lower dimension
1-D
manifold
Stage 2: null-space minimization
Jacobian: linear relation between joint
velocities q and end-effector linear and
angular velocity x .
x  J  q  q (6  n matrix)
null  J   q | J  q  0
Compute minimizing move using:
qJ
†
q x  N  N
N – orthonormal basis of null space
T

T  q 
q
Stage 2: minimization with closure
1.
2.
3.
4.
Choose sub-fragment with n > 6 DOFs
Compute null( J ) using SVD
Project T (q) q onto null( J )
Move until minimum is reached or
closure is broken
Escape from local minima using Monte
Carlo with simulated annealing
Test: artificial gaps
 Completed structure (gold standard)
 Good density (1.6Å res.)
 Remove fragment and rebuild
Length
High (2.0Å)
Medium (2.5Å)
Low (2.8Å)
4
100% (0.14Å)
100% (0.19Å)
100% (0.32Å)
8
100% (0.18Å)
100% (0.23Å)
100% (0.36Å)
12
91% (0.51Å)
96% (0.41Å)
91% (0.52Å)
15
91% (0.53Å)
88% (0.63Å)
83% (0.76Å)
Produced by H. van den Bedem
Test: true gaps
 Completed structure (gold standard)
 O.K. density (2.4Å res.)
 6 gaps left by model builder (RESOLVE)
Length
Top scorer
Lowest error
4
0.44Å
0.40Å
4
0.22Å
0.22Å
5
0.78Å
0.78Å
5
0.36Å
0.36Å
7
0.72Å
0.66Å
10
0.43Å
0.43Å
Produced by H. van den Bedem
Example: TM0423
PDB: 1KQ3, 376 res.
2.0Å resolution
12 residue gap
Best: 0.3Å aaRMSD
Example: TM0813
PDB: 1J5X, 342 res.
2.8Å resolution
12 residue gap
Best: 0.6Å aaRMSD
GLU-83
GLY-96
Example: TM0813
PDB: 1J5X, 342 res.
2.8Å resolution
12 residue gap
Best: 0.6Å aaRMSD
GLU-83
GLY-96
Example: TM0813
PDB: 1J5X, 342 res.
2.8Å resolution
12 residue gap
Best 0.6Å aaRMSD
GLU-83
GLY-96
Outline
1. Fast energy computation during
Monte Carlo simulation
2. Model completion for protein X-ray
crystallography
3. Large scale computation of similarity
Lotan and Schwarzer,
J. Comput. Biol. 11(2–3): 299–317, 2004
Large scale similarity
 Analysis of simulation trajectories
 Molecular dynamics simulation
 Monte Carlo simulation
 Clustering of decoy sets (e.g., Shortle et al. ’98)
 Stochastic Roadmap Simulation
(Apaydin et al. ’03)
Fast similarity measures are needed for
analyzing large sets of conformations
Contributions
 Uniform simplification of protein
structure for similarity computation
 Speed-up existing similarity measures
 Method offers trade-off between
speed and precision
 Efficient computation of nearest
neighbors
m-Averaged approximation
 Cut chain into pieces of length m
 Replace each sequence of m Cα atoms
by its centroid
3n coordinates
3n/m coordinates
Chains and distances
 Proximity along the chain entails
spatial proximity
d  3l
 Far away links along the chain are
spatially distant (on average)
ci
cj
Similarity measures
cRMS ( P, Q)  minT
dRMS ( P, Q ) 
n
1
pi  Tqi

n i 1

2
2
n
i
2
P
Q
d ij  d ij


n( n  1) i  2 j 1

2
Evaluation: test sets
8 structurally diverse proteins
(54 -76 residues)
1. Decoy sets: conformations from
the Park-Levitt set (Park et al, ’97), N
=10,000
2. Random sets: conformations
generated by the program
FOLDTRAJ (Feldman & Hogue, ’00), N = 5000
Evaluation results: decoy sets
m
cRMS
dRMS
3
4
0.99
0.98-0.99
0.96-0.98
0.94-0.97
6
9
12
0.92-0.99
0.81-0.98
0.54-0.92
0.78-0.93
0.65-0.96
0.52-0.69
 9x for cRMS (m = 9)
 36x for dRMS (m = 6)
Higher correlation for random sets!
k Nearest-neighbors problem
Given a set S of conformations of a
protein and a query conformation c,
find the k conformations in S most
similar to c
Brute force complexity:


  N  log k  L  
•  N  log k  L 
•
2
N – size of S
for all c  S
L – time to compute similarity
Efficient nearest neighbor search
kd-tree: O  kd log N  time per query
1
Limitations:
d
r r
1. Requires Minkowski metric: dist  P, Q     pi  qi  
 i 1

2. Less efficient when d>20
cRMS is not a Minkowski metric
n(n  1)
dRMS has dimensionality of
2
Reduce dRMS dimensionality using SVD
Reduction using SVD
1. Stack m-averaged distance matrices
as vectors
2. Compute the SVD of entire set
3. Project onto principle components
dRMS is reduced to 20 dimensions
Complexity of SVD ~ n
4
Testing the method
 Use decoy sets (N = 10,000) and
random sets (N = 5,000)
 m-averaging with (m = 4)
 Project onto 16 PCs for decoys, 12 PCs
for random sets
 Find k = 10, 25, 100 NNs for 250
conformations in each set
Results
 Decoy sets:
 ~77% correct
 Furthest NN off by 10% - 15% (0.7Å – 1.5Å)
 ~4k approximate NNs contain all true k NNs
 Random sets: slightly better results
Use reduction as fast filter
Running Time
N = 100,000, m=4, PC = 16
Find k = 100 for each conformation
Brute-force:
~84 hours
Brute-force + m-averaging:
~4.8 hours
Brute-force + m-averaging + SVD: 41 minutes
kd-tree + m-averaging + SVD:
19 minutes
kd-tree has more impact for larger sets
Contributions
 Energy computation in MCS


Efficient maintenance and self-collision detection for kinematic
chains
Efficient computation of pair-wise interactions in MCS of
proteins
Caching scheme for partial energy sums during MCS
MCS software



sampling of gap-closing fragments biased towards the EDM
Refinement of fit to density without breaking closure
Fully automatic fragment completion software

Uniform simplification of protein structure for similarity
computation
Speed-up existing similarity measures
Method offers trade-off between speed and precision


 Model completion in X-ray crystallography
 Similarity computation for large conformation sets


 Efficient computation of nearest neighbors
Take-home message
Taking into account physical properties
of proteins can lead to efficient
algorithms for a wide variety of
applications in structural biology
Outlook
biophysicist/biochemist
Models that simplify
the physics and
chemistry of proteins
computer scientist
Algorithms that
exploit properties
of protein models
Develop simplified
protein models that
lend themselves to
efficient computations
Acknowledgements









Jean-Claude Latombe
Vijay Pande
Michael Levitt
Leo Guibas
Axel Brunger, Balaji Prabhakar, Serafim Batzoglou
Fabian Schwarzer, Henry van den Bedem, Dan Halperin
Carlo Tomasi
Daniel Russakoff, Rachel Kolodny
Latombe group
Serkan Apaydin, Tim Bretl, Joel Brown, Phil Fong, Mitul Saha, Pekka Isto, Kris Hauser
 Pande group
Bojan Zagrovic, Stefan Larson, Lillian Chong, Young Min Rhee, Sidney Elmer, Chris
Snow, Guha Jayachandran, Eric Sorin, Sung-Joo Lee, Jim Cladwell, Michael Shirts, Nina
Singhal, Relly Brandman, Vishal Vaidyanathan, Nick Kelley, Mark Engelhardt
 Levitt Group
Patrice Koehl, Tanya Raschke, Erik Lindahl
Thank you!