Transcript week3

Week 3
Analysis of Trajectory Data
•
Lecture 5: Fundamental quantities; total energy, temperature,
pressure, volume, density. Structural quantities; root mean square
deviation (RMSD), distribution functions (pair or radial),
conformational analysis (Ramachandran plots, rotamers).
Kinematic quantities; time correlation functions and transport
coefficients. Complex structure determination via docking and MD.
•
Lecture 6: Dynamical quantities; free energy of a system,
Helmholtz and Gibbs free energies, calculation of free energy using
perturbation, umbrella sampling, and steered MD methods.
Structural analysis of proteins and complexes
1. Fundamental quantities
Total energy, temperature, pressure, volume (or density)
2. Structural quantities
Root Mean Square Deviation (RMSD)
Distribution functions (e.g., pair and radial)
Conformational analysis (e.g., Ramachandran plots, rotamers)
3. Complex structure determination
Determination of protein-ligand and protein-protein complexes using
docking programs.
Refinement of complex structures using MD simulations.
Fundamental quantities
• Total energy: strictly conserved but due to numerical errors it may drift.
• Temperature: would be constant in a macroscopic system (fluctuations
are proportional to 1/N, hence negligible). But in a small system,
they will fluctuate around a base line. Temperature coupling ensures
that the base line does not drift from the set temperature.
• Pressure: similar to temperature, though fluctuations are much larger
compared to the set value of 1 atm.
• Volume (or density): fixed in the NVT ensemble but varies in NPT.
Therefore, it needs to be monitored during the initial equilibration to
make sure that the system has converged to the correct volume
(density). Relatively quick for globular proteins in water but could take
a long time (~1 ns) for systems involving lipids (membrane proteins).
Root Mean Square Deviation
For an N-atom system, variance and RMSD at time t are defined as
1 N
 (t )   ri (t )  ri (0)2 ,
N i 1
RMSD  
Where ri(0) are the reference coordinates. Usually they are taken from the
first frame in an MD simulation, though they can also be taken from the
PDB or a target structure. Because side chains in proteins are different,
it is common to use a restricted set of atoms such as Ca or the backbone
(N, Ca, C,O) atoms.
RMSD is very useful in monitoring the approach to equilibrium (typically,
signalled by the appearance of a broad shoulder).
It is a good practice to keep monitoring RMSD during production runs
to ensure that the system stays near equilibrium.
Evolution of the RMSD for the backbone atoms of ubiquitin
Distribution functions
Pair distribution function gij(r) gives the probability of finding a pair of
atoms (i, j) at a distance r. It is obtained by sampling the distance rij in
MD simulations and placing each distance in an appropriate bin, [r, r+r].
Pair distribution functions are used in characterizing correlations between
pair of atoms, e.g., hydrogen bonds, cation-carbonyl oxygen. The peak
gives the average distance and the width, the strength of the interaction.
When the distribution is isotropic (e.g., ion-water in an electrolyte), one
samples all the atoms in the spherical shell, [r, r+r]. Thus to obtain the
radial distribution function (RDF), the sampled number needs to be
normalized by the volume ~4r2r (as well as density)
gion water (r ) 
N (r )
4r 2 r
Radial distribution functions exhibit oscillations, where each maxima is
identified with a coordination number. They are obtained by integrating
g(r) between two neighbouring minima, e.g. denoting the first minimum by
rmin1, the first coordination number is given by
N1 
rmin1
2
4

r
 g (r )dr
0
Similarly, the second coordination number can be found from
N2 
rmin 2
2
4

r
 g (r )dr
rmin1
where rmin2 denotes the second minimum in the RDF.
RDF for a Lennard-Jones liquid
Example: Binding of charybdotoxin to KcsA potassium channel
Important charge int’s:
K27 – Y78 (ABCD)
K11
R34
R34 – D80 (D)
R25 – D64, D80 (C)
K11 – D64 (B)
K27
Pair distribution functions for charge interactions
10
RMSD of charybdotoxin (ChTX)
11
Configurational changes in ChTx
during umbrella sampling simul.
The N-terminal moves away
and the helix is distorted
(L16-V20 H-bond is broken)
Changes are induced by the
tidal forces arising from the
competition between the
applied harmonic forces pulling
the toxin out and the Coulomb
forces pulling it in. Such
simulation artefacts has to be
avoided using restraints.
Conformational analysis
In proteins, the bond lengths and bond angles are more or less fixed.
Thus we are mainly interested in conformations of the torsional angles.
As the shape of a protein is determined by the backbone atoms, the
torsional angles, f (N−Ca) and y (C−Ca), are of particular interest.
These are conveniently analyzed using the Ramachandran plots.
Conformational changes in a protein during MD simulations can be most
easily revealed by plotting these torsional angles as a function of time.
Also of interest are the torsional angles of the side chains, c1, c2, etc.
Each side chain has several energy minima (called rotamers), which are
separated by low-energy barriers (~ kT). Thus transitions between
different rotamers is readily achievable, and they could play functional
roles, especially for charged side chains.
(Sasisekharan)
y
a-helix
f ~ 57o
y ~ 47o
f
Bound atoms or groups of atoms fluctuate around a mean value.
Most of the high-frequency fluctuations (e.g. H atoms), do not directly
contribute to the protein function. Nevertheless they serve as “lubricant”
that enables large scale motions in proteins (e.g. domain motions) that
do play a significant role in their function. As mentioned earlier, large
scale motions occur in a ~ ms to s time scale, hence are beyond the
present MD simulations. (Normal mode analysis provides an alternative.)
But torsional fluctuations occur in the ns time domain, and can be studied
in MD simulations using time correlation functions, e.g.
Cf (t ) 
f (0)  f (t )
f (0)
2
These typically decay exponentially and such fluctuations can be
described using the Langevin equation.
Time correlation functions & transport coefficients
At room temperature, all the atoms in the simulation system are in a
constant motion characterized by their average kinetic energy: (3/2)kT.
Free atoms or molecules diffuse according to the relationship
r 2 (t )  6 Dt
While this relationship can be used to determine the diffusion coefficient,
more robust results can be obtained using correlation functions, e.g.,
D is related to the velocity autocorrelation function as

1
D   v (0)  v (t ) dt
30
Similarly, conductance of charged particles is given by the Kubo formula
1
 
3VkT

 J (0)  J (t ) dt,
0
J   qi vi
i
(current)
Complex structure determination
Few proteins perform their function alone. In most cases, they form
complexes with other proteins to become functional, or binding of a ligand
to a protein triggers its function.
• Hemoglobin (oxygen carrying protein) is a complex of four proteins
• Ligand-gated ion channels are opened by binding of a ligand
• Most drugs act by binding to a receptor protein, moderating its function
Structures of many biomolecules have been determined using x-ray
diffraction or nuclear magnetic resonance (NMR). However, only in a
handful of cases, structures of complexes have been determined
experimentally. Accurate determination of complex structures is essential
for understanding the function of proteins, and will have many
applications in pharmacology (rational drug design) and biotechnology.
Basic methods for complex structure determination
1. Docking and scoring: The simplest and the fastest method. Typically
an analytical function is used to estimate the interaction energy of the
protein-ligand complex for various positions, orientations and
conformations of the ligand, and its minimum value is used to find the
binding pose of the ligand. Limited accuracy, especially in scoring.
2. Brownian dynamics (BD) simulations: Trajectory of the ligand is
followed in BD simulations (implicit water) until it binds to the protein.
Slower than docking and provides no apparent advantages.
3. MD simulations: Trajectory of the ligand is followed in MD simulations
until it binds to the protein. Accurate but too slow to be useful in practice.
4. Docking combined with MD: Poses predicted by docking are refined in
MD simulations. Currently the most optimal method for this purpose.
Docking methods
AutoDock: The most popular docking method. Calculates the proteinligand interaction energies over a grid.
Scoring function: electrostatics + Lennard-Jones + hydrogen bond +
solvation + entropic term.
Protein is taken as rigid while ligand is allowed torsional flexibility.
Works well for small molecules (e.g. not good for > 30 AA peptide)
ZDOCK: Treats both protein and ligand as rigid, and hence can handle
much larger ligands. Searches for the binding position by optimizing
shape complementarity, electrostatics, and dehydration energies using a
Fast Fourier Transform (FFT) algorithm on a grid.
HADDOCK: Retains ligand flexibility and works for larger peptides. Much
more sophisticated (e.g. uses of exp. restraints, allows ensemble
docking, etc) but also much slower than the others.
Dynamical quantities – Free energy calculations
• Free energy of a simulation system
• Free energy difference between two states
• Calculation of free energy differences using free energy perturbation
(FEP) and thermodynamic integration (TI) methods; alchemical
transformation between two states
• Path dependent methods; potential of mean force along a reaction
coordinate; umbrella sampling with weighted histogram analysis
method (WHAM); steered MD with Jarzysnki’s equation.
• Calculation of binding free energies using path independent and path
dependent methods.
Free energy of a system
Free energy is the most important quantity that characterizes a dynamical
process. One can use either the Helmholtz or Gibbs free energy for this
purpose, which are given by
A = U – TS
(Helmholtz)
G = H – TS = U + pV – TS
(Gibbs)
In a typical biomolecular process, the volume and pressure do not
change, hence it is appropriate to use the Helmholtz free energy A.
Unfortunately, G has been used instead of A in the literature although the
calculated quantity is the Helmholtz free energy.
Calculation of the absolute free energies is difficult in MD simulations.
However, free energy differences can be estimated more easily and
several methods have been developed for this purpose.
Free energy perturbation (FEP)
The starting point for most approaches is Zwanzig’s perturbation formula for
the free energy difference between two states A and B, which are described
by the Hamiltonians HA and HB . Assuming that A and B differ by a small
perturbation, the free energy difference is given by:
G ( A  B )  GB  G A   kT ln exp ( H B  H A ) / kT
G ( B  A)  G A  GB   kT ln exp ( H A  H B ) / kT
A
B
G ( A  B )   G ( B  A)
The equality should hold if there is sufficient sampling.
However, if the two states are not similar enough, this is difficult to achieve
in finite simulations, and there will be a large hysteresis effect.
(i.e. the forward and backward results will be very different)
Derivation of the perturbation formula
From statistical mechanics, the Helmholtz free energy is given by
G  kT ln Z ,
Z
1
h
3N
N!
 H / kT
e
dqdp

(Z: partition function)
Free energy difference between two states A and B can be written as
G ( A  B)  GB  G A  kT ln
 H B / kT
e
dqdp

ZB
 kT ln  H / kT
ZA
 e A dqdp
Inserting e H A / kTeH A / kT in the in top integral and using the probability
e  H A / kT
PA (q, p)   H / kT
A
e
dqdp

G ( A  B)  kT ln  exp ( H B  H A ) / kT PA (q, p ) dqdp
 kT ln exp ( H B  H A ) / kT
A
FEP with alchemical transformation
To obtain accurate results with the perturbation formula in finite time, the
energy difference between the states should be a few kT, which is not
satisfied for most biomolecular processes. To deal with this problem, one
introduces a hybrid Hamiltonian
H (l )  (1  l ) H A  lH B
and performs the transformation from A to B gradually by changing the
parameter l from 0 to 1 in small steps. That is, one divides [0,1] into n
subintervals with {li, i = 0, n}, and for each li value, calculates the free
energy difference from the ensemble average
G(li  li 1 )  kT ln exp[( H (li 1 )  H (li ))]/ kT
li
24
The total free energy change is then obtained by summing the contributions
from each subinterval
n 1
G (0  1)   G (li  li 1 )
i 0
The number of subintervals is chosen such that the free energy change at
each step is < a few kT, otherwise the method may lose its validity. Points
to be aware of:
1. Most codes use equal subintervals for li. But the changes in Gi
are usually highly non-linear. One should try to choose li such that
Gi remains around a few kT for all values, e.g., for charge
interactions, using exponential spacing gives better results.
2. The simulation times (equilibration + production) have to be chosen
carefully. It is not possible to extend them in case of non-convergence
25
(have to start over).
Thermodynamic integration (TI)
Another way to obtain the free energy difference is to integrate the
derivative of the hybrid Hamiltonian H(l):
H  H / kT
e
dqdp

dG
H
 l  H / kT

dl
l
dqdp
e
1

l
G  
0
H (l )
l
l
dl
This integral is evaluated most efficiently using a Gaussian quadrature.
In typical calculations for ions, 7-point quadrature is found to be sufficient.
(But one should check that other quadratures give the same result)
The advantage of TI over FEP is that the production run can be extended
as long as necessary and the convergence of the free energy can be
monitored (when the cumulative G flattens, it has converged).
26
Example : Binding of a K+ ion to a protein
Initial state (A): K+ ion in bulk, a water molecule at the binding site.
Final state (B): K+ ion at the binding site, water in place of the ion.
The free energy difference between A and B gives free energy gain in
translocating a K+ ion from bulk to the binding site of the protein. Note that
this is not the binding free energy – that includes entropic changes as well.
K+
W
protein
A
+ K+ (bulk)
protein
B
G( A  B)  G( B)  G( A)
+ W (bulk)
Example: Free energy change due to mutation of a ligand
A very common question is how a mutation in a ligand (or protein) changes
the free energy of the protein-ligand complex.
GB  G A  Gbs ( A  B )  Gbulk ( A  B )
+
+
GA
Gbulk(AB)
+
GB
Thermodynamic cycle
wild type
Gbs(AB)
mutant
Standard binding free energy from FEP/TI methods
The total binding free energy of a ligand can be expressed as
Gb  Gtr  Grot  Gres  Gint
 (2e) 3 / 2  x y z 
Gtr  kT ln 
, V0  1660  3
V0


Grot
 (2e) 3 / 2  f1 f 2 f 3 
 kT ln 

2
8


bulk
site
Gres  Gres
 Gres
The various sigma’s are the translational and rotational rmsd’s of the ligand
in the binding site. The last term is the interaction (or translocation) energy,
which is calculated using FEP or TI.
29
Path dependent methods
Methods such as FEP and TI are independent of the physical path
connecting the two states. As seen in the previous example, calculation
of free energy difference for binding of a ligand involves annihilation of
the ligand in bulk and its creation in the binding site. Such calculations
can be done fairly accurately for small, uncharged ligands, but FEP/TI
methods fail for large or charged ligands. The reason is that the hydration
energies of such ligands are quite large and the errors committed in each
(bulk and binding site) calculation is usually larger than the free energy
difference between the two states.
To calculate the binding free energy of such ligands, one has to use path
dependent methods, where the ligand is physically moved from the
binding site to bulk along a reaction coordinate, and its free energy profile
along this path is calculated using various methods.
Potential of mean force (PMF)
Interaction of two molecules in solution is described by their PMF, W(r),
whose gradient gives the average force acting between them.
The PMF can be determined from MD simulations by sampling the relative
positions of the two molecules. Assuming one is at the origin, the density of
the second will be given by (r). From Boltzmann equation, the density is
given by
 (r)   (r0 ) e[W (r )W (r0 )] / kT
Here r0 is a reference point in bulk where W is assumed to vanish.
Inverting the Boltzmann equation yields
  (r )
W (r )  W (r0 )  kT ln
  (r0 )



Thus by sampling the density along a reaction coordinate, one can
determine the PMF between two molecules.
Umbrella sampling simulations
In general, position of a molecule cannot be adequately sampled at
high-energy points in finite simulations. To counter that, one introduces
harmonic potentials, which restrain the particle at desired points, and then
unbias its effect (this can be done easily for a harmonic biasing potential).
For convenience, one introduces umbrella potential at regular intervals
along a reaction coordinate (e.g. ~0.5 Å). After adequate sampling of each
window, the raw density data are unbiased, and the PMFs obtained from
all the windows are optimally combined using the Weighted Histogram
Analysis Method (WHAM).
For a simple introduction to the WHAM method and how it is implemented
in practice, see Alan Grossfield’s web page (search for “wham method” in
google and download the pdf file).
Points to consider in umbrella sampling
Two main parameters in umbrella sampling are the force constant, k
and the distance between windows, d. In bulk, the position of the
ligand will have a Gaussian distribution given by
1
 ( z  z0 ) 2
P( z ) 
e
2 
2 2
,
z  z 0 ,   k BT / k
The overlap between two Gaussian distributions separated by d
% overlap  1  erf (d / 8 )
The parameters should be chosen such that the percentage overlap
satisfies, 10-20% > % overlap > 5%
If the overlap is too small, PMF will have discontinuities.
If it is too large, simulations are not very efficient.
Standard binding free energy from PMF
Experimentally, the standard binding free energy is determined by
measuring the dissociation constant Kd (or IC50), which is defined as the
ligand concentration at which half the ligands are bound. The standard
binding free energy is related to the binding constant Keq = 1/Kd via
Gb  kT lnC0 K d   kT lnC0 K eq 
Where C0 is the standard concentration of 1 M (i.e., 1/ C0 = 1660 Å3)
The binding constant is determined by integrating the 3-D PMF over the
binding site
K eq 
W (r )
e

kT 3
d r
site
Because it virtually impossible to determine the PMF in a 3-D grid, it is
34
necessary to invoke a 1-D approximation to the PMF.
1-D approximation of PMF
In most cases, there is a well-defined path from the binding site to bulk,
along which the PMF is determined. Thus the simplest approximation is to
perform the integral along this path assuming a flat-bottomed potential in the
transverse directions which will make a constant contribution. Taking the
path along the z-axis and assuming a cross sectional area of R2 for the
flat-bottomed potential, we obtain
K eq  R 2
bulk
W ( z )
e

kT
dz
bs
The most sensible choice for R is to determine it from the transverse
fluctuations of the COM of the ligand in the binding pocket. In fact, this
choice can be rigorously justified using the simulation data without assuming
a flat-bottomed potential.
35
Justification of the 1-D approximation of PMF
Distribution of the x and y components of the peptide COM follows a
Gaussian distribution in all of the PMF windows (Kv1.3-HsTx1 complex)
Justification of the 1-D approximation of PMF
Gaussian distribution in x-y directions:
 ( x, y )   0 e
( x 2  y 2 ) / 2 2
Comparing with the Boltzmann distribution due to W(x,y)
 ( x, y)  0 eW ( x, y ) / kBT
shows that the transverse PMF can be represented by a harmonic pot.
1
k T
W ( x, y )  k ( x 2  y 2 ) with k  B2
2

Assuming W(x,y,z) = W(x,y) + W(z), the x-y integrals separate from z
e
W ( x , y ) k BT
dxdy   e
 k ( x 2  y 2 ) 2 k BT
1-D formula for Keq follows upon using
dxdy  2
k BT
 2 2
k
R 2   x2   y2  2 2
37
Justification of the 1-D approximation of PMF
For the validity of the 1-D
approximation, the average
transverse position of the ligand
and its fluctuations should not
deviate much from the binding
site.
B) The average transverse
position of the ligand in each
umbrella window.
C) The R values (fluctuations)
in each umbrella window.
38
Steered MD (SMD) simulations and Jarzynski’s equation
Steered MD is a more recent method where a harmonic force is applied
to the COM r of a peptide and the reference point r0 of this force is
pulled with a constant velocity v. It has been used to study binding of
ligands and unfolding of proteins. The discovery of Jarzynski’s equation
in 1997, enabled determination of PMF from SMD, which boosted its
applications.
f
Work done by
the harmonic force
Jarzynski’s equation:
W   F.ds , F  k[r  (r0  vt )]
i
e G / kT  e W / kT
This method seems to work well in simple systems and when G is large
but beware of its applications in complex systems!