Folie 1 - FLI

Download Report

Transcript Folie 1 - FLI

Structure Prediction and Modeling
of
Biological Macromolecules
Jürgen Sühnel
[email protected]
Leibniz Institute for Age Research, Fritz Lipmann Institute (FLI)
Jena Centre for Bioinformatics (JCB)
Jena Centre for Systems Biology of Ageing (JenAge)
Jena / Germany
http://www.fli-leibniz.de/groups/suehnel_3D_en.php
Outline
Proteins
– Secondary structure
– 3D structure
• Modeling by homology
(Comparative modeling)
• Fold recognition (Threading)
• Ab initio prediction
– Rule-based approaches
– Lattice models
– Simulating the time
dependence of folding
(Molecular Dynamics)
• Refinement
• Exploring the effect of single amino
acid substitutions
• Ligand effects on protein structure
and dynamics (induced fit)
Nucleic Acids
– Secondary structure
– 3D structure
PDB Content Growth
Year
1980
1993
2003
2009
2010
2011
Yearly
16
695
4167
7396
7923
8123
Total
70
1582
23597
62191
70114
78237
structures
structures (~ 2 new structures per day)
structures (~ 11 new structures per day)
structures (~ 20 new structures per day)
structures (~ 22 new structures per day)
structures (~ 22 new structures per day)
(nur experimentelle Strukturen)
PDB Content Growth
May 29, 2012
UniProt/SwissProt: Growth Rate
29.05.2012
Release 2012_05 of 16-May-12 of UniProtKB/Swiss-Prot contains
536029 sequence entries, comprising 190235160 amino acids abstracted from
209686 references.
UniProt/TrEMBL: Growth Rate
29.05.2012
Release 2012_05 of 16-May-2012 of UniProtKB/TrEMBL contains
22128511 sequence entries, comprising 7226807757 amino acids .
Swiss-Prot/TrEMBL: Amino Acid Composition
Swiss-Prot
TrEMBL
15-Jan-2008
Protein Structure Prediction
Structural Genomics
Structural genomics consists in the determination of the three dimensional structure of all proteins of a
given organism, by experimental methods such as X-ray crystallography, NMR spectroscopy
or computational approaches such as homology modelling.
As opposed to traditional structural biology, the determination of a protein structure through
a structural genomics effort often (but not always) comes before anything is known regarding
the protein function. This raises new challenges in structural bioinformatics, i.e. determining protein function
from its 3D structure.
One of the important aspects of structural genomics is the emphasis on high throughput determination of
protein structures. This is performed in dedicated centers of structural genomics.
While most structural biologists pursue structures of individual proteins or protein groups, specialists in
structural genomics pursue structures of proteins on a genome wide scale. This implies large scale
cloning, expression and purification. One main advantage of this approach is economy of scale.
On the other hand, the scientific value of some resultant structures is at times questioned.
Protein Structure Prediction
A Good Protein Structure
• Minimizes disallowed
torsion angles
• Maximizes number of
hydrogen bonds
• Minimizes interstitial
cavities or spaces
• Minimizes number of
“bad” contacts
• Minimizes number of
buried charges
Protein Structure Prediction – CAFASP Contest
http://www.cs.bgu.ac.il/~dfischer/CAFASP5/
Protein Structure Prediction – CASP Contest
http://predictioncenter.gc.ucdavis.edu/
Protein Structure Prediction – CASP Contest
http://predictioncenter.gc.ucdavis.edu/
Lysozyme
Lysozyme – 5lyz
Lysozyme – 5lyz: Information from the JenaLib Atlas Page
Lysozyme – 5lyz: Information from the JenaLib Atlas Page
Lysozyme – 5lyz: Information from the JenaLib Atlas Page
Lysozyme – 5lyz: Information from the JenaLib Atlas Page - ProSite
Lysozyme – 5lyz: PROSITE Signature
PROMOTIF Secondary Structure Analysis – 5lyz
.
.
Protein Backbone Torsion Angles
D. W. Mount: Bioinformatics, Cold Spring Harbor Laboratory Press, 2001.
Sidechain Torsion/Dihedral Angles
PROMOTIF Secondary Structure Analysis – 5lyz
Chou-Fasman Secondary Structure Prediction
Amino Acid Propensities
From a database of experimental 3D structures, calculate the
propensity for a given amino acid to adopt a certain type of
secondary structure

Example:
N(Ala)=2.000; N(tot)=20.000; N(Ala, helix)=568; N(helix)=4.000.
P(Ala,helix) = [N(Ala,helix)/N(helix)] / [N(Ala)/N(tot)]
P(Ala,helix) = [568/4.000]/[2.000/20.000] = 1.42
Used in Chou-Fasman algorithm
Chou-Fasman Secondary Structure Prediction
• Assign all of the residues in the peptide the appropriate set of parameters.
• Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix) > 100.
• That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous
residues that have an average P(a-helix) < 100 is reached. That is declared the end of the helix.
If the segment defined by this procedure is longer than 5 residues and the average
P(a-helix) > P(b-sheet) for that segment, the segment can be assigned as a helix.
• Repeat this procedure to locate all of the helical regions in the sequence.
• Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of
P(b-sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions
until a set of four contiguous residues that have an average P(b-sheet) < 100 is reached.
That is declared the end of the beta-sheet. Any segment of the region located by this procedure
is assigned as a beta-sheet if the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix)
for that region.
• Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if the
average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average
P(b-sheet) > P(a-helix) for that region.
•To identify a bend at residue number j, calculate the following value
p(t) = f(j)f(j+1)f(j+2)f(j+3)
where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and
the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for
P(turn) > 1.00 in the tetrapeptide; and (3) the averages for the tetrapeptide obey the inequality
P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.
Lysozyme – 5lyz: Chou-Fasman Secondary Structure Prediction
http://fasta.bioch.virginia.edu/fasta_www/chofas.htm
Lysozyme – 5lyz: Chou-Fasman Secondary Structure Prediction
GRCE
RCEL
CELA
ELAA
(0.57|0.98|0.70|1.39)
0.91
(0.98|0.70|1.39|1.41)
1.12
(0.70|1.39|1.41|1.42)
(1.39|1.41|1.42|1.42)
1.23
1.41
http://fasta.bioch.virginia.edu/fasta_www/chofas.htm
Lysozyme – 5lyz: PhD/PROF Structure Prediction
PROF_sec:
Rel_sec
SUB_sec
O3_acc
P3_acc
Rel_acc
SUB_acc
PROF predicted secondary structure: H=helix, E=extended (sheet), blank=other (loop)
PROF = PROF: Profile network prediction Heidelberg
reliability index for PROF_sec prediction (0=low to 9=high)
subset of the PROFsec prediction, for all residues with an expected average accuracy > 82% (tables in header)
NOTE: for this subset the following symbols are used:
L: is loop (for which above ' ' is used)
.: means that no prediction is made for this residue, as the reliability is: Rel < 5
observed relative solvent accessibility (acc) in 3 states: b = 0-9%, i = 9-36%, e = 36-100%.
PROF predicted relative solvent accessibility (acc) in 3 states: b = 0-9%, i = 9-36%, e = 36-100%.
reliability index for PROFacc prediction (0=low to 9=high)
subset of the PROFacc prediction, for all residues with an expected average correlation > 0.69 (tables in header)
NOTE: for this subset the following symbols are used:
I: is intermediate (for which above ' ' is used)
.: means that no prediction is made for this residue, as the reliability is: Rel < 4
http://cubic.bioc.columbia.edu/predictprotein/submit_def.html#top
Lysozyme – 5lyz: PhD/PROF Structure Prediction, BLAST
http://cubic.bioc.columbia.edu/predictprotein/submit_def.html#top
Lysozyme – 5lyz: PhD/PROF Structure Prediction, BLAST
http://cubic.bioc.columbia.edu/predictprotein/submit_def.html#top
Lysozyme – 5lyz: PhD/PROF Structure Prediction
•
•
•
•
•
•
•
Perform BLAST search to find local alignments
Remove alignments that are “too close”
Perform multiple alignments of sequences
Construct a profile (PSSM) of amino-acid frequencies at each residue
Use this profile as input to the neural network
A second network performs “smoothing”
The third level computes jury decision of several different instantiations of
the first two levels.
http://cubic.bioc.columbia.edu/predictprotein/submit_def.html#top
PSSM
A PSSM, or Position-Specific Scoring Matrix, is a type of scoring matrix
used in protein BLAST searches in which amino acid substitution scores
are given separately for each position in a protein multiple sequence
alignment. Thus, a Tyr-Trp substitution at position A of an alignment may
receive a very different score than the same substitution at position B.
This is in contrast to position-independent matrices such as the PAM
and BLOSUM matrices, in which the Tyr-Trp substitution receives the
same score no matter at what position it occurs.
PSI-BLAST
Position specific iterative BLAST (PSI-BLAST) refers to a feature of BLAST 2.0
in which a profile (or position specific scoring matrix, PSSM) is constructed
(automatically) from a multiple alignment of the highest scoring hits in an initial
BLAST search.
The PSSM is generated by calculating position-specific scores for each position in
the alignment. Highly conserved positions receive high scores and weakly conserved
positions receive scores near zero.
The profile is used to perform a second (etc.) BLAST search and the results of each
"iteration" are used to refine the profile.
This iterative searching strategy results in increased sensitivity.
Conserved Domain Database
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
PsiPred
PSIPRED is a simple and reliable secondary structure prediction method, incorporating
two feed-forward neural networks which perform an analysis on output obtained from
PSI-BLAST (Position Specific Iterated - BLAST).
Version 2.0 of PSIPRED includes a new algorithm which averages the output from up to
4 separate neural networks in the prediction process to further increase
prediction accuracy.
Using a very stringent cross validation method to evaluate the method's performance,
PSIPRED 2.0 is capable of achieving an average Q3 score of nearly 78%.
Predictions produced by PSIPRED were also submitted to the CASP4 server and
assessed during the CASP4 meeting, which took place in December 2000 at Asilomar.
PSIPRED 2.0 achieved an average Q3 score of 80.6% across all 40 submitted target
domains with no obvious sequence similarity to structures present in PDB,
which placed PSIPRED in first place out of 20 evaluated methods
(an earlier version of PSIPRED was also ranked first in CASP3 held in 1998).
http://bioinf.cs.ucl.ac.uk/psipred/psiform.html
Comparing Secondary Structure Prediction Results
PsiPred
Chou-Fasman
Phd/PROF
Comparing Secondary Structure Prediction Results
Protein Secondary Structure Prediction - Summary
1st Generation - 1970s
• Chou & Fasman, Q3 = 50-55%
2nd Generation -1980s
• Qian & Sejnowski, Q3 = 60-65%
3rd Generation - 1990s
• PHD, PSI-PRED, Q3 = 70-80%
Features of the new methods:
• Taking into account evolutionary information
• Neural networks
Failures:
• Nonlocal sequence interactions
• Wrong prediction at the ends of H/E
Q3 – Percentage of correctly assigned amino acids in a test set
Protein Structure Prediction
http://speedy.embl-heidelberg.de/gtsp/flowchart2.html
Modeling by Homology (Comparative Modeling)
http://salilab.org/modeller/
Modeling by Homology (Comparative Modeling)
http://modbase.compbio.ucsf.edu/modbase-cgi-new/search_form.cgi
Modeling by Homology (Comparative Modeling)
http://modbase.compbio.ucsf.edu/modbase-cgi-new/search_form.cgi
Modeling by Homology (Comparative Modeling)
http://modbase.compbio.ucsf.edu/modbase-cgi-new/search_form.cgi
Modeling by Homology (Comparative Modeling)
http://swissmodel.expasy.org/
Modeling by Homology (Comparative Modeling)
Comparative modeling predicts the three-dimensional structure of a given
protein sequence (target) based primarily on its alignment to one or more proteins
of known structure (templates).
The prediction process consists of
• fold assignment,
• target template alignment,
• model building, and
• model evaluation and refinement.
The number of protein sequences that can be modeled and the accuracy of
the predictions are increasing steadily because of the growth in the number of
known protein structures and because of the improvements in the modeling
software.
Further advances are necessary in recognizing weak sequence structure
similarities, aligning sequences with structures, modeling of rigid body shifts,
distortions, loops and side chains, as well as detecting errors in a model.
Despite these problems, it is currently possible to model with useful accuracy
significant parts of approximately one third of all known protein sequences.
http://salilab.org/modeller/
Threading – Sequence Structure Alignment
Methods of protein fold recognition or threading or sequence-structure alignment
attempt to detect similarities between protein 3D structure that are not accompanied
by any significant sequence similarity.
The unifying theme of these appraoches is to try and find folds that are
compatible with a particular sequence. Unlike sequence-only comparison,
these methods take advantage of the extra information made available by
3D structure information.
Rather than predicting how a sequence will fold, they predict how well a fold
will fit a sequence.
Fold Recognition (Threading) – Why ?
• Secondary structure is more conserved than primary
structure
• Tertiary structure is more conserved than secondary
structure
• Therefore very remote relationships can be better
detected through 2o or 3o structural homology instead of
sequence homology
Threading
• Use protein sequence alignment (modeling by homology)
• Use 3D profiles
• How buried, partly buried, exposed are amino acids?
• How is the fraction of surrounding environment that is
polar or apolar?
• Use contact potentials
Fold Recognition
• Database of 3D structures and sequences
– Protein Data Bank (or non-redundant subset)
• Query sequence
– Sequence < 25% identity to known structures
• Alignment protocol
– Dynamic programming
• Evaluation protocol
– Distance-based potential or secondary structure
• Ranking protocol
Fold Recognition
http://www.sbg.bio.ic.ac.uk/~3dpssm/index2.html
Ab Initio Prediction
• Predicting the 3D structure without any “prior knowledge”
• Used when homology modelling or threading have failed
(no homologues are evident)
• Equivalent to solving the “Protein Folding Problem”
• Still a research problem
Ab Initio Prediction
http://robetta.bakerlab.org
Ab Initio Prediction
Ab Initio Prediction
Simons, Strauss, Baker. J. Mol. Biol. 2001, 306, 1191-1199.
Ab Initio Prediction – Lysozyme (5lyz)
http://rosettadesign.med.unc.edu/
Protein Model Portal
http://www.proteinmodelportal.org/
Simulation of Protein Folding
Simulation of Protein Folding
Thousand trillon FLOPs
IBM Blue Gene Project | System-on-a-Chip Approach
~ 65.000 processors
teraflop – a trillion floating point operations
per second
Quantum Chemistry
Quantum Chemistry
Quantum-chemical Calculations: Telomeric DNA
Quantum-chemical Calculations: Telomeric DNA
Molecular Dynamics
Simulation of Protein Folding – Molecular Dynamics
AMBER
GROMOS
CHARMM
TINKER
Molecular Mechanics (Force Field)
http://cmm.info.nih.gov/modeling/guide_documents/molecular_mechanics_document.html
How Do We Get the Parameters ?
Experimental Data
(Examples: Geometrical Parameters)
Quantum-chemical Calculations
(Examples: Charges)
Geometry Optimization
Molecular Dynamics Simulation
Protein Capsid Of Filamentous Bacteriophage Ph75 From Thermus Thermophilus
1HGV, extended structure
1HGV, actual structure
1HGV, 61% helix, 1.928 ns
1HGV, 75% helix, 3.428 ns
Images created using VMD (Visual Molecular Dynamics) (HUMPHREY, W., DALKE, A. and
SCHULTEN, K., 1996.VMD - Visual Molecular Dynamics. Journal Molecular Graphics,14,
pp33-38).
Optimization Methods – Newton-Raphson Methods
g -. gradient
h - Hessian
Optimization Methods – Steepest Descent
Steepest descent
Optimization Methods – Conjugate Gradients Method
Molecular Dynamics Simulation
amber.scripps.edu
Molecular Dynamics Simulation
Molecular Dynamics Simulation – GROMOS Package
www.gromos.net
Molecular Dynamics Packages
www.charmm.org
Molecular Dynamics Packages
dasher.wustl.edu/ffe/
Visualizing and Analyzing Molecular Dynamics Simulations
www.ks.uiuc.edu/Research/vmd/
Folding Surface for Lysozyme
Dobson, Sali, Karplus, Angew. Chem. Int. Ed. 1998, 37, 868.
Protein Folding States
Dobson, Sali, Karplus, Angew. Chem. Int. Ed. 1998, 37, 868.
Monitoring Protein Folding by Experimental Methods
Dobson, Sali, Karplus, Angew. Chem. Int. Ed. 1998, 37, 868.
Monitoring Protein Folding by Experimental Methods
Paxco, Dobson, Curr. Opin. Struct. Biol. 1996, 6, 630.
Protein Folding by Molecular Dynamics
Protein Folding by Molecular Dynamics
Protein Folding by Molecular Dynamics
Villin headpiece domain
(PDB code: 1vii)
Actin binding site highlighted
36 amino acids
Protein Folding by Molecular Dynamics
Protein Folding by Molecular Dynamics
Protein Folding by Molecular Dynamics
Radius of Gyration
In a globular protein the radius of gyration Rg can be predicted with reasonable
accuracy from the relationship
Rg(pred)
= 2.2 N 0.588
where N is the number of amino acids.
Protein Folding by Molecular Dynamics
Protein Folding by Molecular Dynamics
Statistical Potentials
A statistical potential or knowledge-based potential is an energy function derived from an analysis of
known protein structures.
They are mostly applied to pairwise amino acid interactions. The statistical potential assigns to each
possible pair of amino acids a weight or score or energy.
Statistical potentials are applied to protein structure prediction and to protein folding.
Their physical interpretation is highly disputed. Nevertheless, they have been applied with great success,
and do have a rigorous probabilistic justification.
Thomas, Dill, J. Mol. Biol. 1996, 257, 457-469
Statistical Potentials
Boltzmann distribution:
The Boltzmann distribution applied to a specific pair of amino acids, is given by:
where r is the distance, k is the Boltzmann constant, T is the temperature and Z is the partition function, with
The quantity F(r) is the free energy assigned to the pairwise system. Simple rearrangement results in the inverse
Boltzmann formula, which expresses the free energy F(r) as a function of P(r):
To construct a so-called Potentail of Mean Force (PMF) , one then introduces a so-called reference state with a
corresponding distribution QR and partition function ZR, and calculates the following free energy difference:
The reference state typically results from a hypothetical system in which the specific interactions between the
amino acids are absent. The second term involving Z and ZR can be ignored, as it is a constant.
Statistical Potentials
In practice, P(r) is estimated from the database of known protein structures, while QR(r) typically results from
calculations or simulations. For example, P(r) could be the conditional probability of finding the Cβ atoms of a valine
and a serine at a given distance r from each other, giving rise to the free energy difference ΔF. The total free energy
difference of a protein, ΔFT, is then claimed to be the sum of all the pairwise free energies:
where the sum runs over all amino acid pairs ai,aj (with i < j) and rij is their corresponding distance. It should be noted
that in many studies QR does not depend on the amino acid sequence
Intuitively, it is clear that a low free energy difference indicates that the set of distances in a structure is more likely in
proteins than in the reference state. However, the physical meaning of these PMFs have been widely disputed since
their introduction. The main issues are the interpretation of this "potential" as a true, physically valid potential of mean
force, the nature of the reference state and its optimal formulation, and the validity of generalizations beyond pairwise
distances.
Statistical Potentials
wij(r)
ij(r)
*
–
-
interaction free energy
pair density
reference pair density at
infinite separation
Statistical potentials can be determined by
simply counting interactions of a specific type
in a dataset of experimental structures.
The distance dependence may or may not be taken
into account. If not, the interaction free energy is usually
called a contact potential. It represents an average over
distances shorter than some cutoff distance rc.
Thomas, Dill, J. Mol. Biol. 1996, 257, 457-469
Lattice Folding
Lattice Algorithm
•
•
•
•
•
•
Red = hydrophobic, Blue = hydrophilic
If Red is near empty space E = E+1
If Blue is near empty space E = E-1
If Red is near another Red E = E-1
If Blue is near another Blue E = E+0
If Blue is near Red E = E+0