Structure prediction

Transcript Structure prediction

Prediction of protein
structure
aim

Structure prediction tries to build models of
3D structures of proteins that could be useful
for understanding structure-function
relationships.
Genbank/EMBL 105.000.000
Uniprot 5.200.000
PDB
47.000
DNA sequence
Molecular recognition
Protein sequence
3D structure
The protein folding problem

The information for 3D structures is coded in
the protein sequence

Proteins fold in their native structure in
seconds

Native structures are both thermodynamically
stables and kinetically available
ab-initio prediction

Prediction from sequence using first
principles
AVVTW...GTTWVR
Ab-initio prediction

“In theory”, we should be able to build native
structures from first principles using
sequence information and molecular
dynamics simulations: “Ab-initio prediction of
structure”



Simulaciones de 1 ms de “folding” de una proteína modelo
(Duan-Kollman: Science, 277, 1793, 1998).
Simulaciones de folding reversible de péptidos (20-200 ns)
(Daura et al., Angew. Chem., 38, 236, 1999).
Simulaciones distribuidas de folding de Villin (36-residues)
(Zagrovic et al., JMB, 323, 927, 2002).
... the bad news ...


It is not possible to span simulations to the
“seconds” range
Simulations are limited to small systems and
fast folding/unfolding events in known
structures



steered dynamics
biased molecular dynamics
Simplified systems
typical shortcuts

Reduce conformational space



Statistic force-fields obtained from known structures



1,2 atoms per residue
fixed lattices
Average distances between residues
Interactions
Use building blocks: 3-9 residues from PDB
structures
“lattice” folding
Example PROSA potential
Very stable
Low stability
Hydrophobic
Cb-Cb
Total
http://lore.came.sbg.ac.at:8080/CAME/CAME_EXTERN/ProsaII/index_html
Results from ab-initio



Average error 5 Å 10 Å
Function cannot be
predicted
Long simulations
Some protein from E.coli
predicted at 7.6 Å
(CASP3, H.Scheraga)
comparative modelling

The most efficient way to predict protein
structure is to compare with known 3D
structures
Protein folds
Basic concept

In a given protein 3D structure is a more
conserved characteristic than sequence



Some aminoacids are “equivalent” to each other
Evolutionary pressure allows only aminoacids
substitutions that keep 3D structure largely
unaltered
Two proteins of “similar” sequences must
have the “same” 3D structure
Possible scenarios
1. Homology can be recognized using sequence comparison tools or
protein family databases (blast, clustal, pfam,...).
Structural and functional predictions are feasible
2. Homology exist but cannot be recognized easily (psi-blast, threading)
Low resolution fold predictions are possible. No functional
information.
3. No homology
1D predictions. Sequence motifs. Limited functional prediction. Abinitio prediction
fold prediction
3D struc. prediction
1D prediction

Prediction is based on averaging aminoacid
properties
AGGCFHIKLAAGIHLLVILVVKLGFSTRDEEASS
Average over a
window
1D prediction. Properties




Secondary structure propensitites
Hydrophobicity (transmembrane)
Accesibility
...
Propensities Chou-Fasman
Biochemistry 17, 4277 1978
Aminoacido
Ala
Cys
Leu
Met
Glu
Gln
His
Lys
P(a)
P(b)
P(turn)
1.29
1.11
1.3
1.47
1.44
1.27
1.22
1.23
0.9
0.74
1.02
0.97
0.75
0.8
1.08
0.77
0.78
0.8
0.59
0.39
1
0.97
0.69
0.96
a
Val
Ile
Phe
Tyr
Trp
Thr
0.91
0.97
1.07
0.72
0.99
0.82
1.49
1.45
1.32
1.25
1.14
1.21
0.47
0.51
0.58
1.05
0.75
1.03
b
Gly
Ser
Asp
Asn
Pro
0.56
0.82
1.04
0.9
0.52
0.92
0.95
0.72
0.76
0.64
1.64
1.33
1.41
1.23
1.91
turn
Arg
0.96
0.99
0.88
Some programs (www.expasy.org)














BCM PSSP - Baylor College of Medicine
Prof - Cascaded Multiple Classifiers for Secondary Structure Prediction
GOR I (Garnier et al, 1978) [At PBIL or at SBDS]
GOR II (Gibrat et al, 1987)
GOR IV (Garnier et al, 1996)
HNN - Hierarchical Neural Network method (Guermeur, 1997)
Jpred - A consensus method for protein secondary structure prediction at
University of Dundee
nnPredict - University of California at San Francisco (UCSF)
PredictProtein - PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader,
MaxHom, EvalSec from Columbia University
PSA - BioMolecular Engineering Research Center (BMERC) / Boston
PSIpred - Various protein structure prediction methods at Brunel University
SOPM (Geourjon and Deléage, 1994)
SOPMA (Geourjon and Deléage, 1995)
AGADIR - An algorithm to predict the helical content of peptides
1D Prediction

Original methods: 1 sequence and uniform
parameters (25-30%)

Original improvements: Parameters specific from
protein classes

Present methods use sequence profiles obtained
from multiple alignments and neural networks to
extract parameters (70-75%, 98% for
transmembrane helix)
PredictProtein (PHD)
1.
2.
3.
4.
Building of a multiple alignment using
Swissprot, prosite, and domain databases
1D prediction from the generated profile
using neural networks
Fold recognition
Confidence evaluation
PredictProtein
Available information









Multiple alignments MaxHom
PROSITE motifs
SEG Composition-bias
Threading TOPITS
Secondary structure PHDSec PROFsec
Transmembrane helices PHDhtm, PHDtop
Globularity GLOBE
Coiled-coil COILS
Disulfide bridges CYSPRED
Result
PredictProtein
Available information






Signal peptides SignalP
O-glycosilation NetOglyc
Chloroplast import signal CloroP
Consensus secondary struc. JPRED
Transmembrane TMHMM, TOPPRED
SwissModel
Methods for remote homology

Homology can be recognized using PSI-Blast

Fold prediction is possible using threading
methods

Acurate 3D prediction is not possible: No
structure-function relationship can be inferred
from models
Threading

Unknown sequence is “folded” in a number of
known structures

Scoring functions evaluate the fitting between
sequence and structure according to
statistical functions and sequence
comparison
ATTWV....PRKSCT
..........
10.5
>
..........
SELECTED HIT
5.2
ATTWV....PRKSCT
HHHHH....CCBBBB
eeebb....eeebeb
Sequence
Pred. Sec. Struc.
Pred. accesibility
..........
Sequence
Obs SS
Obs Acc.
GGTV....ATTW
BBBB....CCHH
EEBE.....BBEB
...........
...........
...........
ATTVL....FFRK
HHHB.....CBCB
BBEBB....EBBE
Threading accurancy
0.35
% 0.3
A 0.25
C
0.2
I
E 0.15
R
T 0.1
O
S 0.05
0
5
10
15
20
% IDENTIDAD SECUENCIAS
25
Comparative modelling

Good for homology >30%

Accurancy is very high for homology > 60%

Remainder


The model must be USEFUL
Only the “interesting” regions of the protein need
to be modelled
Expected accurancy

Strongly dependent on the quality of the sequence
alignment

Strongly dependent on the identity with “template”
structures. Very good structures if identity > 60-70%.

Quality of the model is better in the backbone than
side chains

Quality of the model is better in conserved regions
Steps
1.
2.
Choose templates: Proteins with
experimental 3D structure with significant
homology (BLAST, PFAM, PDB)
Building multiple alignment of templates.


Alignment quality is critical for accurancy.
Always use structure-based alignment.
Reduce redundancies
Template alignment
Steps
Alignment of template structures
Alignment of unknown sequence against
template alignment
1.
2.
•
•
Structural alignment may not concide with
evolution-based alignment.
Gaps must be chosen to minimize structure
distortion
PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS (green)
PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS (red)
PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS (blue)
Steps
Alignment of template structures
Alignment of unknown sequence against template
alignment
Build structure of conserved regions (SCR)
1.
2.
3.
•
•
Coordinates come from either a single structure or
averages.
Side chains are adapted to the original or placed in
standard conformations
Steps
1.
2.
3.
4.
Alignment of template structures
Alignment of unknown sequence against
template alignment
Build structure of conserved regions (SCR)
Build of unconserved regions (“loops”
usually)
“loops”
Ab initio
PDB
“loops”
Chosen manually or energy-based
Optimization
Optimize side chain conformation
1.
1.
Energy minimization restricted to standard conformers
and VdW energy
Optimize everything
2.
•
•
Global energy minimization with restrains
Molecular dynamics
Quality test

No energy differences between a correct or
wrong model

The structure must by “chemically correct” to
use it in quantitative predictions
Alignment quality



Global test: compare sequence with N
residue exchanges (N=1000).
Calculate Z-score
If (alignments 100-200 res):



Z > 15
5 < Z <= 15
Z <= 5
Ideal
70 % core residues core right
Problems
Analysis software




PROCHECK
WHATCHECK
Suite Biotech
PROSA
Sources of information

300 best structures in PDB

Molecular geometry from CSD database

Theoretical data (Ramachandran, etc.)
Procheck







Covalent geometry
Planarity
Dihedral angels
Quirality
Non-bonded interactions
Satisfied/unsatisfies Hydrogen-bonds
Disulfide bonds
Whatcheck
Prediction software

SwissModel (automatic)


SwissModel Repository


http://www.bmm.icnet.uk/servers/3djigsaw/
Modeller (A.Sali)


http://swissmodel.expasy.org/repository/
3D-JIGSAW (M.Stenberg)


http://www.expasy.org/swissmod/
http://salilab.org/modeller/modeller.html
MODBASE (A. Sali)

http://alto.compbio.ucsf.edu/modbase-cgi/index.cgi
spdbv Result
Final test

The model must justify experimental data (i.e.
differences between unknown sequence and
templates) and be useful to understand
function.

Structure prediction

Transcript Structure prediction

Directory