Transcript Slide 1

LSM2104/CZ2251
Essential Bioinformatics and Biocomputing
Protein Structure and
Visualization (2)
Chen Yu Zong
[email protected]
6874-6877
LSM2104/CZ2251
Essential Bioinformatics and Biocomputing
Lecture 10
Protein structure databases; visualization;
and classifications
1. Introduction to Protein Data Bank (PDB)
2. Free graphic software for 3D structure
visualization
3. Hierarchical classification of protein domains:
SCOP & CATH & DALI
1. Protein Data Bank (PDB)
• Protein Data Bank: maintained by the Research
Collaboratory for Structural Bioinformatics (RCSB)
• http://www.rcsb.org/pdb/
– 30060 Structures 15-Mar-2005
– 27570 Structures 05-Oct-2004
– 23997 Structures 20-Jan-2004
• Also contains structures of other bio-macromolecules:
DNA, carbohydrates and protein-DNA complexes.
1. Protein Data Bank (PDB)
1. Protein Data Bank (PDB)
PDB Content Growth
PDB Presentation of Selected Molecules
Deficiencies in our structural knowledge
Only deposited data is actually available
Many structures not deposited in PDB, why?
Structures available for soluble proteins
A few dozen entries for membrane protein domains, why?
X-ray data only for those proteins that crystallize
well or diffract properly.
Why?
NMR structures are usually for small proteins
How to survey the size of NMR-determined proteins?
Estimated that structural data available
for only 10-15% of all known proteins.
Alternative Source of Structure: NCBI
Protein Structure in PDB
• Text files
• Each entry is specified by a unique 4-letter
code (PDB code): say 1HUY for a variant of
GFP; 1BGK for a 37-residue toxin protein
isolated from sea anemone
• 1HUY and 1BGK
– Header information
– Atomic coordinates in Å (1 Ångstrom = 1.0e-10 m)
Header Details
• Identifies the molecule, modifications, date of
release
• Host organism, keywords, method of study
• Authors, reference, resolution for X-ray structure
– Smaller the number, better the structure.
• Sequence, reference
The Atomic Coordinates
• XYZ Coordinates for each atom (starting with ATOM, only heavy atom
for X-ray structure) from the first residue to the last
• XYZ coordinates for any ligands (starting with HETATM) complexed to
the bio-macromolecule
• O atoms of water molecules (starting with HETATM, normally at the last
part of the xyz coordinate section)
• Usually, for X-ray structure, resolution is not high enough to locate H
atoms: hence only heavy atoms are shown in the PDB file.
• For NMR structure, all atoms (including hydrogen atoms) are specified in
the PDB file.
X-ray structure 1HUY
NMR structure 1BGK
2. Free Software for Protein Structure
Visualization
• RASMOL: available for all platforms
http://www.openrasmol.org
• Swiss PDB Viewer: from Swiss-Prot
http://www.expasy.ch/spdbv/
• Chemscape Chime Plug-in: for PC and Mac
http://www.mdl.com/downloads/downloadable/index.jsp
• YASARA: http://www.yasara.org/
• MOLMOL: MOLecule analysis and MOLecule display
http://129.132.45.141/wuthrich/software/molmol/index.html
Ribbon representation by RasMol
1HUY
An Improved Yellow
Variant Of Green
Fluorescent Protein
From Tsien’s group
J.Biol.Chem. 276 29188
(2001)
Ribbon representation by YASARA
Ribbon representation by YASARA
Ribbon representation by MOLMOL
An ensemble of 15 structures (NMR, toxin Bgk);
Proton atoms also included
15 backbone structures of the
sea anemone toxin Bgk
15 all-atom structures of the
sea anemone toxin Bgk
Line representation
Ribbon representation
Space-filling representation
3. Hierarchical classification of protein
domains: SCOP & CATH
• SCOP: Structural Classification of Proteins
University of Cambridge, UK
http://scop.mrc-lmb.cam.ac.uk/scop/
Hyperlink in Singapore: http://scop.bic.nus.edu.sg/
• CATH:
Class—Architecture—Topology
--Homologous Superfamily
Sequence family
University College London, UK
http://www.biochem.ucl.ac.uk/bsm/cath/
Basis for protein classification
Proteins adopt a limited number of topologies
More than 50,000 sequences fold into ~1000 unique folds.
Homologous sequences have similar structures
Usually, when sequence identity>30%, proteins adopt the
same fold. Even in the absence of sequence homology,
some folds are preferred by vastly different sequences.
The “active site” is highly conserved
A subset of functionally critical residues are found to be
conserved even the folds are varied.
How many unique folds do organisms
use to express functions?
Sequence space
> 50,000
Conformational
space
Many sequences to form
one unique fold
~1,000 ???????
Growth of Protein Databases
12000
Sequences
Structures
Folds
70000
10000
60000
8000
50000
6000
40000
30000
4000
20000
2000
10000
2000
1998
1996
1994
1992
1990
0
1988
0
1986
No of Sequences
80000
No. of Structures and Folds
90000
Structural Classification of Proteins
SCOP
• University of Cambridge, UK: http://scop.mrc-lmb.cam.ac.uk/scop/
– mirrored at Singapore: http://scop.bic.nus.edu.sg/
– contains PDB entries grouped hierachically by:
• Structural class,
• Fold,
• Superfamily,
• Family,
• Individual member
(domain-based)
Structural Classification of Proteins
SCOP
• Family
• Proteins are clustered together into families on the basis
of one of two criteria that imply their having a common
evolutionary origin:
• All proteins that have residue identities of 30% and
greater;
• Proteins with lower sequence identities but whose
functions and structures are very similar
Example, globins with sequence identities of 15%.
Structural Classification of Proteins
SCOP
• Superfamily
• Families, whose proteins have low sequence identities
but whose structures and, in many cases, functional
features suggest that a common evolutionary origin is
probable, are placed together in superfamilies
• Example, actin, the ATPase domain of the heatshock protein and hexokinase
Structural Classification of Proteins
SCOP
• Fold
• Superfamilies and families are defined as having a
common fold if their proteins have same major
secondary structures in same arrangement with the
same topological connections.
Structural Classification of Proteins
SCOP
• Class
– For convenience of users, the different folds have been grouped into
classes. Most of the folds are assigned to one of a few structural classes
on the basis of the secondary structures of which they composed
SCOP Class: All-a topologies
cytochrome
b-562
ferritin
SCOP Class: All-a topologies
SCOP Class: All-a topologies
SCOP Class: All-b topologies
b sandwiches
b-barrels
SCOP Class: All-b topologies
SCOP Class: a/b Topologies
a/b horseshoe
SCOP Class: a/b Topologies
a/b barrels
SCOP Class: a/b Topologies
SCOP Class: Alpha+Beta Topologies
SCOP Class: Alpha+Beta Topologies
Ubiquitin
1ubi
Ubiquitin
1ubi
Ubiquitin
1ubi
Ubiquitin
1ubi
CATH database
http://www.biochem.ucl.ac.uk/bsm/cath/
CATH:
Class—Architecture—
Topology--Homologous
Superfamily--Sequence
family
Orengo et al. CATH-a hierarchical
classification of protein domain
structures (1997) Structure 5, 10931108
Sequence identity >30%
Sequence identity >70%
the same overall fold
the same overall fold
+ the similar function
CATH database
Class
Derived from secondary structure content, is assigned for more than 90% of protein structures
automatically.
Architecture
Describes the gross orientation of secondary structures, independent of connectivities, is currently
assigned manually.
Topology
Clusters structures according to their topological connections and numbers of secondary structures.
Homologous superfamilies
Cluster proteins with highly similar structures and functions. The assignments of structures to
topology families and homologous superfamilies are made by sequence and structure comparisons.
Sequence families
Structures within each H-level are further clustered on sequence identity. Domains clustered in the
same sequence families have sequence identities >35%.
Non-identical sequence domains,
Identical sequence domains,
Domains
CATH database
The class (C), architecture (A) and
topology (T) levels in the CATH database
Class
Architecture
Topology
The class (C), architecture (A) and
topology (T) levels in the CATH database
Homologous
Superfamily
CATH – architectures
CATH – architectures (cont.)
The protein structure universe in
the PDB (1997) by a CATH wheel
The distribution of nonhomologous structures
(i.e. a single
representative from each
homologous
superfamily at the Hlevel in CATH) amongst
the different classes (C),
architectures (A) and
fold families (T) in the
CATH database.
SCOP / CATH -> DALI
SCOP & CATH
• Hierarchical and based on abstractions
• Include some manual aspects and are curated by experts in
the field of protein structure
Dali
Presentation of results of computer classification, where the
methods that underlie the classification remain internal
Structure comparison
DALI
Comparing protein structures in 3D
a/b
b
a
anti parallel b
barrel
a b meander
More information about DALI
Touring protein fold space with Dali/FSSP: Liisa Holm and Chris Sander
Compare 3D protein structures by Dali
http://www.ebi.ac.uk/dali/
Compare 3D protein structures by Dali
http://www.ebi.ac.uk/dali/
•
The FSSP database (Fold classification based on Structure-Structure
alignment of Proteins) is based on exhaustive all-against-all 3D
structure comparison of protein structures currently in the Protein Data
Bank (PDB).
•
The classification and alignments are automatically maintained and
continuously updated using the Dali search engine.
Dali Domain Dictionary
•
Structural domains are delineated automatically using the criteria of
recurrence and compactness. Each domain is assigned a Domain
Classification number DC_l_m_n_p , where:
 l - fold space attractor region
 m - globular folding topology
 n - functional family
 p - sequence family
Compare 3D protein structures by Dali
http://www.ebi.ac.uk/dali/
Functional families
•
Evolutionary relationships from strong structural similarities which
are accompanied by functional or sequence similarities.
•
Functional families are branches of the fold dendrogram
where all pairs have a high average neural network prediction
for being homologous.
Sequence families
•
Representative subset of the Protein Data Bank extracted using
a 25 % sequence identity threshold.
•
All-against-all structure comparison was carried out within the
set of representatives.
•
Homologues are only shown aligned to their representative.
Compare 3D protein structures by Dali
http://www.ebi.ac.uk/dali/
Fold types
• Fold types are defined as clusters
of structural neighbors in fold
space with average pairwise Zscores (by Dali) above 2.
Structural neighbours of 1urnA (top
left). 1mli (bottom right) has the
same topology even though there
are shifts in the relative orientation
of secondary structure elements
Summary

Protein structure database (PDB)

Protein structure visualization software

Structural classification, databases and
servers