Transcript Slide 1
LSM2104/CZ2251 Essential Bioinformatics and Biocomputing Protein Structure and Visualization (2) Chen Yu Zong [email protected] 6874-6877 LSM2104/CZ2251 Essential Bioinformatics and Biocomputing Lecture 10 Protein structure databases; visualization; and classifications 1. Introduction to Protein Data Bank (PDB) 2. Free graphic software for 3D structure visualization 3. Hierarchical classification of protein domains: SCOP & CATH & DALI 1. Protein Data Bank (PDB) • Protein Data Bank: maintained by the Research Collaboratory for Structural Bioinformatics (RCSB) • http://www.rcsb.org/pdb/ – 30060 Structures 15-Mar-2005 – 27570 Structures 05-Oct-2004 – 23997 Structures 20-Jan-2004 • Also contains structures of other bio-macromolecules: DNA, carbohydrates and protein-DNA complexes. 1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB) PDB Content Growth PDB Presentation of Selected Molecules Deficiencies in our structural knowledge Only deposited data is actually available Many structures not deposited in PDB, why? Structures available for soluble proteins A few dozen entries for membrane protein domains, why? X-ray data only for those proteins that crystallize well or diffract properly. Why? NMR structures are usually for small proteins How to survey the size of NMR-determined proteins? Estimated that structural data available for only 10-15% of all known proteins. Alternative Source of Structure: NCBI Protein Structure in PDB • Text files • Each entry is specified by a unique 4-letter code (PDB code): say 1HUY for a variant of GFP; 1BGK for a 37-residue toxin protein isolated from sea anemone • 1HUY and 1BGK – Header information – Atomic coordinates in Å (1 Ångstrom = 1.0e-10 m) Header Details • Identifies the molecule, modifications, date of release • Host organism, keywords, method of study • Authors, reference, resolution for X-ray structure – Smaller the number, better the structure. • Sequence, reference The Atomic Coordinates • XYZ Coordinates for each atom (starting with ATOM, only heavy atom for X-ray structure) from the first residue to the last • XYZ coordinates for any ligands (starting with HETATM) complexed to the bio-macromolecule • O atoms of water molecules (starting with HETATM, normally at the last part of the xyz coordinate section) • Usually, for X-ray structure, resolution is not high enough to locate H atoms: hence only heavy atoms are shown in the PDB file. • For NMR structure, all atoms (including hydrogen atoms) are specified in the PDB file. X-ray structure 1HUY NMR structure 1BGK 2. Free Software for Protein Structure Visualization • RASMOL: available for all platforms http://www.openrasmol.org • Swiss PDB Viewer: from Swiss-Prot http://www.expasy.ch/spdbv/ • Chemscape Chime Plug-in: for PC and Mac http://www.mdl.com/downloads/downloadable/index.jsp • YASARA: http://www.yasara.org/ • MOLMOL: MOLecule analysis and MOLecule display http://129.132.45.141/wuthrich/software/molmol/index.html Ribbon representation by RasMol 1HUY An Improved Yellow Variant Of Green Fluorescent Protein From Tsien’s group J.Biol.Chem. 276 29188 (2001) Ribbon representation by YASARA Ribbon representation by YASARA Ribbon representation by MOLMOL An ensemble of 15 structures (NMR, toxin Bgk); Proton atoms also included 15 backbone structures of the sea anemone toxin Bgk 15 all-atom structures of the sea anemone toxin Bgk Line representation Ribbon representation Space-filling representation 3. Hierarchical classification of protein domains: SCOP & CATH • SCOP: Structural Classification of Proteins University of Cambridge, UK http://scop.mrc-lmb.cam.ac.uk/scop/ Hyperlink in Singapore: http://scop.bic.nus.edu.sg/ • CATH: Class—Architecture—Topology --Homologous Superfamily Sequence family University College London, UK http://www.biochem.ucl.ac.uk/bsm/cath/ Basis for protein classification Proteins adopt a limited number of topologies More than 50,000 sequences fold into ~1000 unique folds. Homologous sequences have similar structures Usually, when sequence identity>30%, proteins adopt the same fold. Even in the absence of sequence homology, some folds are preferred by vastly different sequences. The “active site” is highly conserved A subset of functionally critical residues are found to be conserved even the folds are varied. How many unique folds do organisms use to express functions? Sequence space > 50,000 Conformational space Many sequences to form one unique fold ~1,000 ??????? Growth of Protein Databases 12000 Sequences Structures Folds 70000 10000 60000 8000 50000 6000 40000 30000 4000 20000 2000 10000 2000 1998 1996 1994 1992 1990 0 1988 0 1986 No of Sequences 80000 No. of Structures and Folds 90000 Structural Classification of Proteins SCOP • University of Cambridge, UK: http://scop.mrc-lmb.cam.ac.uk/scop/ – mirrored at Singapore: http://scop.bic.nus.edu.sg/ – contains PDB entries grouped hierachically by: • Structural class, • Fold, • Superfamily, • Family, • Individual member (domain-based) Structural Classification of Proteins SCOP • Family • Proteins are clustered together into families on the basis of one of two criteria that imply their having a common evolutionary origin: • All proteins that have residue identities of 30% and greater; • Proteins with lower sequence identities but whose functions and structures are very similar Example, globins with sequence identities of 15%. Structural Classification of Proteins SCOP • Superfamily • Families, whose proteins have low sequence identities but whose structures and, in many cases, functional features suggest that a common evolutionary origin is probable, are placed together in superfamilies • Example, actin, the ATPase domain of the heatshock protein and hexokinase Structural Classification of Proteins SCOP • Fold • Superfamilies and families are defined as having a common fold if their proteins have same major secondary structures in same arrangement with the same topological connections. Structural Classification of Proteins SCOP • Class – For convenience of users, the different folds have been grouped into classes. Most of the folds are assigned to one of a few structural classes on the basis of the secondary structures of which they composed SCOP Class: All-a topologies cytochrome b-562 ferritin SCOP Class: All-a topologies SCOP Class: All-a topologies SCOP Class: All-b topologies b sandwiches b-barrels SCOP Class: All-b topologies SCOP Class: a/b Topologies a/b horseshoe SCOP Class: a/b Topologies a/b barrels SCOP Class: a/b Topologies SCOP Class: Alpha+Beta Topologies SCOP Class: Alpha+Beta Topologies Ubiquitin 1ubi Ubiquitin 1ubi Ubiquitin 1ubi Ubiquitin 1ubi CATH database http://www.biochem.ucl.ac.uk/bsm/cath/ CATH: Class—Architecture— Topology--Homologous Superfamily--Sequence family Orengo et al. CATH-a hierarchical classification of protein domain structures (1997) Structure 5, 10931108 Sequence identity >30% Sequence identity >70% the same overall fold the same overall fold + the similar function CATH database Class Derived from secondary structure content, is assigned for more than 90% of protein structures automatically. Architecture Describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. Topology Clusters structures according to their topological connections and numbers of secondary structures. Homologous superfamilies Cluster proteins with highly similar structures and functions. The assignments of structures to topology families and homologous superfamilies are made by sequence and structure comparisons. Sequence families Structures within each H-level are further clustered on sequence identity. Domains clustered in the same sequence families have sequence identities >35%. Non-identical sequence domains, Identical sequence domains, Domains CATH database The class (C), architecture (A) and topology (T) levels in the CATH database Class Architecture Topology The class (C), architecture (A) and topology (T) levels in the CATH database Homologous Superfamily CATH – architectures CATH – architectures (cont.) The protein structure universe in the PDB (1997) by a CATH wheel The distribution of nonhomologous structures (i.e. a single representative from each homologous superfamily at the Hlevel in CATH) amongst the different classes (C), architectures (A) and fold families (T) in the CATH database. SCOP / CATH -> DALI SCOP & CATH • Hierarchical and based on abstractions • Include some manual aspects and are curated by experts in the field of protein structure Dali Presentation of results of computer classification, where the methods that underlie the classification remain internal Structure comparison DALI Comparing protein structures in 3D a/b b a anti parallel b barrel a b meander More information about DALI Touring protein fold space with Dali/FSSP: Liisa Holm and Chris Sander Compare 3D protein structures by Dali http://www.ebi.ac.uk/dali/ Compare 3D protein structures by Dali http://www.ebi.ac.uk/dali/ • The FSSP database (Fold classification based on Structure-Structure alignment of Proteins) is based on exhaustive all-against-all 3D structure comparison of protein structures currently in the Protein Data Bank (PDB). • The classification and alignments are automatically maintained and continuously updated using the Dali search engine. Dali Domain Dictionary • Structural domains are delineated automatically using the criteria of recurrence and compactness. Each domain is assigned a Domain Classification number DC_l_m_n_p , where: l - fold space attractor region m - globular folding topology n - functional family p - sequence family Compare 3D protein structures by Dali http://www.ebi.ac.uk/dali/ Functional families • Evolutionary relationships from strong structural similarities which are accompanied by functional or sequence similarities. • Functional families are branches of the fold dendrogram where all pairs have a high average neural network prediction for being homologous. Sequence families • Representative subset of the Protein Data Bank extracted using a 25 % sequence identity threshold. • All-against-all structure comparison was carried out within the set of representatives. • Homologues are only shown aligned to their representative. Compare 3D protein structures by Dali http://www.ebi.ac.uk/dali/ Fold types • Fold types are defined as clusters of structural neighbors in fold space with average pairwise Zscores (by Dali) above 2. Structural neighbours of 1urnA (top left). 1mli (bottom right) has the same topology even though there are shifts in the relative orientation of secondary structure elements Summary Protein structure database (PDB) Protein structure visualization software Structural classification, databases and servers