CSCE590/822 Data Mining Principles and Applications

Transcript CSCE590/822 Data Mining Principles and Applications

CSCE555 Bioinformatics

Lecture 18 Protein Tertiary Structure
Prediction
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008
www.cse.sc.edu.
Outline
Experimental limitation of protein
structure determination
 Tertiary Structure Prediction

◦ AB initio
◦ Homology modeling
◦ Threading
Experimental Protein Structure
Determination

High-resolution structure determination
◦ X-ray crystallography (<1A)
◦ Nuclear magnetic resonance (NMR) (~1-2.5A)

Lower-resolution structure determination
◦ Cryo-EM (electron-microscropy) ~10-15A

Theoretical Models?
◦ Highly variable - but a few equiv to X-ray!
Tertiary Structure Prediction

Fold or tertiary structure prediction problem can be
formulated as a search for minimum energy conformation
◦ Search space is defined by psi/phi angles of backbone and sidechain rotamers
◦ Search space is enormous even for small proteins!
◦ Number of local minima increases exponentially with
number of residues
Computationally it is an exceedingly difficult problem!
Levinthal Paradox of Protein Folding:
How nature does search?
We assume that there are three conformations for each amino acid
(ex. α-helix, β-sheet and random coil). If a protein is made up of 100
amino acid residues, a total number of conformations is
3100 = 515377520732011331036461129765621272702107522001
≒ 5 x 1047.
If 100 psec (10-10 sec) were required to convert from a conformation
to another one, a random search of all conformations would require
5 x 1047 x 10-10 sec ≒ 1.6 x 1030 years.
However, folding of proteins takes place in msec to sec order.
Therefore, proteins fold not via a random search but a more
sophisticated search process.
We want to watch the folding process of a protein using molecular
simulation techniques.
Steps in Protein Folding
1- "Collapse"- driving force is burial of hydrophobic aa’s
(fast - msecs)
2- Molten globule - helices & sheets form, but "loose"
(slow - secs)
3- "Final" native folded state - compaction, some
2' structures rearranged
Native state? - assumed to be lowest free energy
- may be an ensemble of structures
Protein Folding Funnel
Local mimina
Global minimum
Native Structure
7
Protein Structure Prediction

Ab initio
◦ Use just first principles: energy,
geometry, and kinematics

Homology
◦ Find the best match to a database of
sequences with known 3D-structure
Combinations

Threading

Meta-servers and other methods
Knowledge
based
approaches
Ab Initio Prediction

Basic idea
Anfinsen’s theory: Protein native structure corresponds to the state
with the lowest free energy of the protein-solvent system.

General procedures
◦ Develop a Potential/Energy function
 Evaluate the energy of protein conformation
 Select native structure
◦ Conformational search algorithm
 To produce new conformations
 Search the potential energy surface and locate the global minimum (native
conformation)
Provides both folding pathway & folded structure
Can only apply to very small proteins
9
Potential Functions for PSP

Potential function
◦ Physical based energy function
Empirical all-atom forcefields: CHARMM, AMBER, ECEPP-3, GROMOS,
OPLS
Parameterization: Quantum mechanical calculations, experimental data
Simplified potential: UNRES (united residue)
◦ Solvation energy
 Implicit solvation model: Generalized Born (GB) model, surface area
based model
 Explicit solvation model: TIP3P (computationally expensive)
10
General Form of All-atom Forcefields
Φ
r
Θ
Bond stretching
term
Angle bending
term
Vtotal 
Dihedral term
 K r  r    K       K 1  cosn   
2
b
2

0
bonds
angles
dihedrals
 Cij Dij 
 12  10  


 van der Waals
r
Hbonds rij
ij

 i , j pairs


H-bonding term
O
r
H

0
 Aij Bij 
qi q j
 12  6  
r
 electrosta tic r
r
ij 
ij
 ij
i , j pairs

Van der Waals term
r
The most
time
demanding
part.
Electrostatic term
＋
r
ー
11
Search Potential Energy Surface
We are interested in minimum points on Potential Energy Surface (PES)
Conformational search techniques
Energy Minimization
Monte Carlo
Molecular Dynamics
Others: Genetic Algorithm, Simulated
Annealing
12
Energy Minimization
Local miminum

Energy minimization

Methods
First-order minimization: Steepest descent, Conjugate gradient
minimization
Second derivative methods: Newton-Raphson method
Quasi-Newton methods: L-BFGS
13
Monte Carlo

In molecular simulations, ‘Monte Carlo’ is an importance sampling
technique.
1. Make random move and produce a new conformation
2. Calculate the energy change E for the new conformation
3. Accept or reject the move based on the Metropolis criterion
E
P  exp(
)
kT
Boltzmann factor
If E<0, P>1, accept new conformation;
Otherwise: P>rand(0,1), accept, else reject.
14
Ab initio Prediction – CASP results
Comparative Modeling (Knowledge based
approach)
Two primary methods
1) Homology modeling
2) Threading (fold recognition)
Both rely on availability of experimentally determined
structures that are "homologous" or at least structurally
very similar to target
Provide folded structure only
Homology Modeling
1.
2.
Identify homologous protein sequences (-BLAST)
Among available structures, choose the one with closest
sequence match to target as template
(can combine steps 1 & 2 by using PDB-BLAST)
3.
Build model by placing residues in corresponding
positions of homologous structure & refine by
"tweaking"

Homology modeling - works "well"
•
•
Computationally? not very expensive
Accuracy? higher sequence identity  better model
 Requires ~30% sequence identity with sequence for
which structure is known
Homology-based Prediction
Raw model
Loop modeling
Side chain placement
Refinement
Homology-based Prediction
Threading - Fold Recognition
Identify “best” fit between target sequence & template structure
 Threading - works "sometimes"
• Computationally? Can be expensive or cheap,
depends on energy function & whether "all atom"
or "backbone only" threading
• Accuracy? in theory, should not depend on
sequence identity (should depend on quality of
template library & "luck")
 Usually, higher sequence identity to protein of
known structure  better model
Threading Algorithm for PSP
Database of 3D structures and sequences
◦ Protein Data Bank (or non-redundant subset)
 Query sequence
◦ Sequence < 25% identity to known structures
 Alignment protocol
◦ Dynamic programming
 Evaluation protocol
◦ Distance-based potential or secondary structure
 Ranking protocol

3.3b
21
Threading

Basic premise:
The number of unique structural folds in nature is fairly small
(probably 2000-3000)

Statistics from Protein Data Bank (~40,000 structures)
Until very recently, 90% of new structures submitted to PDB
had similar structural folds in PDB

Thus, chances for a protein to have a native-like
structural fold in PDB are quite good
◦ Note: Proteins with similar structural folds could be either
homologs or analogs
Steps in Threading
Target
Sequence
ALKKGF…HFDTSE
Structure
Templates
1. Align target sequence with template structures
(fold library) from the Protein Data Bank (PDB)
2. Calculate energy score to evaluate goodness of fit between
target sequence & template structure
3. Rank models based on energy scores
Threading Issues
Find “correct” sequence-structure alignment of a target sequence with its
native-like fold in PDB

Structure database - must be complete: no decent model if no
good template in library!

Sequence-structure alignment algorithm:
Bad alignment  Bad score!

Energy function (scoring scheme):
 must distinguish correct sequence-fold alignment from incorrect
sequence-fold alignments
 must distinguish “correct” fold from close decoys

Prediction reliability assessment - How determine whether
predicted structure is correct? (or even close?)
Threading: Template database

Build a database of structural templates
(eg, ASTRAL domain library derived from the PDB)
Supplement with additional decoys, e.g., generated using
ab initio approach such as Rosetta (Baker)
Threading: Energy function

Two main methods (and combinations of these)
 Structural profile (environmental)
physico-chemical properties of aa’s
 Contact potential (statistical)
based on contact statistics from PDB
Miyazawa & Jernigan (ISU)
Protein Threading:
Typical energy function
What is "probability"
that two specific
residues are in
contact?
How well does a specific
residue fit structural
environment?
Alignment gap
penalty?
Total energy: Ep + Es + Eg
Goal: Find a sequence-structure alignment that minimizes
the energy function
CAFASP
GOAL
The goal of CAFASP is to evaluate the performance of fully automatic
structure prediction servers available to the community. In contrast
to the normal CASP procedure, CAFASP aims to answer the
question of how well servers do without any intervention of
experts, i.e. how well ANY user using only automated methods can
predict protein structure. CAFASP assesses the performance of
methods without the user intervention allowed in CASP.
Performance Evaluation in
CAFASP3
Servers with
name
in italic are
meta servers
MaxSub score
ranges from 0
to 1
Therefore,
maximum total
score is 30
Servers
(54 in total)
Sum MaxSub
Score
# correct
(30 FR targets)
3ds5 robetta
5.17-5.25
15-17
pmod 3ds3 pmode3
4.21-4.36
13-14
RAPTOR
3.98
13
shgu
3.93
13
3dsn
3.64-3.90
12-13
pcons3
3.75
12
fugu3 orf_c
3.38-3.67
11-12
…
…
…
pdbblast
0.00
0
(http://ww.cs.bgu.ac.il/~dfischer/CAFASP3, released in December, 2002.)
One structure where RAPTOR did
best
Red: true structure
Blue: correct part of
prediction
Green: wrong part of
prediction
•
Target Size:144
•
Super-imposable size
within 5A: 118
•
RMSD:1.9
Some more results by other
programs
Some more results by other
programs
Some more results by other
programs
Summary of current state of the art
Automated Web-Based Homology Modeling

SWISS Model : http://www.expasy.org/swissmod/SWISSMODEL.html

WHAT IF : http://www.cmbi.kun.nl/swift/servers/

The CPHModels Server :
http://www.cbs.dtu.dk/services/CPHmodels/

3D Jigsaw : http://www.bmm.icnet.uk/~3djigsaw/

SDSC1 : http://cl.sdsc.edu/hm.html

EsyPred3D : http://www.fundp.ac.be/urbm/bioinfo/esypred/
Comparative Modeling Server & Program

COMPOSER
http://www.tripos.com/sciTech/inSilicoDisc/bioInformatics/matchma
ker.html

MODELER http://salilab.org/modeler

InsightII http://www.msi.com/

SYBYL
http://www.tripos.com/

CSCE590/822 Data Mining Principles and Applications

Transcript CSCE590/822 Data Mining Principles and Applications

Directory