Transcript Document

Worldwide Protein Data Bank
www.wwpdb.org
Worldwide Protein Data Bank
www.wwpdb.org
Agenda
 Welcome and Introductions
 Overview of recent wwPDB progress
 Introduction to the BMRB
 Theoretical model policy
 Issues for discussion and advice
Break
 wwPDB group interactions
 wwPDB plans for 2007
 Long term aims, funding, and stability
 Executive session
 Feedback to wwPDB
 Set next meeting date (July 2007; Salt Lake City, UT?)
Worldwide Protein Data Bank
www.wwpdb.org
wwPDB Achievements
August 2005-October 2006








Continued growth of archive
Website updates
Publications and presentations
Time stamped archive
wwPDB team building
Annotation document
Remediation
BMRB formally a member of wwPDB
Worldwide Protein Data Bank
www.wwpdb.org
Deposition issues
Worldwide Protein Data Bank
www.wwpdb.org
The never ending story
Worldwide Protein Data Bank
www.wwpdb.org
Deposition since establishment of
3 sites
Worldwide Protein Data Bank
www.wwpdb.org
PDB entry processing
 1-1-2000
10,997 entries in PDB
 Today 1-Oct-2006
39,323 entries in PDB
Total size is 3.6 times when the 3 sites started
 In 1999 2361 entries deposited
 In 2005 6678 entries deposited
We handle 2.8 as many entries per year with less staff and all 3 sites produce high quality annotated PDB
entries
NO CURRENT BACKLOG UN-PROCESSED ENTRIES
Worldwide Protein Data Bank
www.wwpdb.org
Time-stamped copies of the
archive
 24 Gbytes of data for 2005, released January 3, 2006
 Includes:
–
–
–
–
–
PDB format entries
mmCIF format entries
PDBML format entries
Experimental data
Dictionary, schema and format documentation
Worldwide Protein Data Bank
www.wwpdb.org
Outreach
 wwPDB website
 Publications and meetings
Worldwide Protein Data Bank
www.wwpdb.org
Worldwide Protein Data Bank
www.wwpdb.org
Joint publications and presentations
 Nucleic Acids Research 2007 Database Issue
– Ensuring a single, uniform archive of PDB data
 Methods in Molecular Biology 2007
– Data deposition and annotation at the wwPDB
 Nature Structural & Molecular Biology, 2006
– Is one solution good enough? (response)
 CODATA (October 23-25, 2006; Beijing, China)
– The Worldwide Protein Data Bank
 Encyclopedia of Genomics, Proteomics, and
Bioinformatics, 2005
– The Protein Data Bank and the wwPDB
Worldwide Protein Data Bank
www.wwpdb.org
The wwPDB Team
Worldwide Protein Data Bank
www.wwpdb.org
wwPDB interactions this year
 Exchange visits
– MSD/RCSB (6) (thanks to WT)
– PDBj/RCSB (1),
– BMRB/RCSB-PDB (3)
 Phone conference with site directors-twice a year
 VTC’s among staff
– BMRB/RCSB twice a month (ADIT-NMR)
– MSD/RCSB-twice a week (annotation procedures, remediation)
 Email among staff
– MSD/RCSB~2 per day
– PDBj/RCSB~2 per day
Worldwide Protein Data Bank
www.wwpdb.org
What is the PDB?
 Content
 Processes to ensure quality (annotation
project)
Worldwide Protein Data Bank
www.wwpdb.org
Annotation project
Worldwide Protein Data Bank
www.wwpdb.org
Annotation project
GOALS
 Standardize annotation rules and policies among
wwPDB sites
 Document annotation rules and policies
 Create venue to update annotation rules and
policies as necessary
Worldwide Protein Data Bank
www.wwpdb.org
Annotation project
How did we get there?
 Review and discuss each PDB field by email
and VTC
 Write document and review by all staff
 Final review by site directors
 Implement software compliant to new
annotation procedures
 Test software and train annotators
 Publish document on Web
Worldwide Protein Data Bank
www.wwpdb.org
Annotation project
Resultant document
 Specification of ALL fields in PDB file
 Clarification of policies
– Assignment of PDB IDs
– Release of files and information
– Changes to entries
 Clarification of data representation
–
–
–
–
Chain ID for all atoms in the file
Multi-model representation for alternate conformation or disorder
Chimeras
Microheterogenity
Worldwide Protein Data Bank
www.wwpdb.org
Remediation
Worldwide Protein Data Bank
www.wwpdb.org
Remediation: scope
34,528 Entries Checked




Primary citations
Sequences & taxonomy
Ligand stereochemistry and nomenclature
Symmetry and coordinate transformations for virus
entries
 Diffraction source & beamline
 Miscellaneous uniformity issues
Worldwide Protein Data Bank
www.wwpdb.org
Remediation: statistics
 Citations:
– All primary citations checked
– 8508 citations manually examined
– 7037 citations confirmed and updated
 Sequence and taxonomy:
– 47917 sequences checked
– 20068 updated sequence data references
– 11087 taxonomic references updated
 Virus entries
– 250 entries checked and revised
 Diffraction source
– 10985 entries revised
 Miscellaneous uniformity corrections
– 1041 entries revised
Worldwide Protein Data Bank
www.wwpdb.org
Remediation: statistics
 Ligand stereochemistry and nomenclature
–
–
–
–
–
–
7568 ligand definitions checked
1758 new ligand definitions added
185 ligand definitions obsoleted
152,000 ligand instances checked
138,230 ligand instances OK
6815 ligand instances renamed
Worldwide Protein Data Bank
www.wwpdb.org
Remediation process
 Corrections contributed and reviewed by wwPDB
members
 Corrections on the archival mmCIF data files tracked
in a version tracking system, CVS
 New PDB exchange, PDBML and PDB format data
files being produced now
 Each wwPDB group will validate and load the
resulting files into their database systems
 Invited public testing will begin January 2007
 General availability will start April 2007
Worldwide Protein Data Bank
www.wwpdb.org
Remediation: Ligand dictionary rewrite









Model and idealized coordinates provided
Stereochemical configuration assignments
Aromatic atoms and bonds flagged
Definitions provided for “Chemistry Catalog” state
with leaving atom candidates flagged
Nonstandard atom names revised (e.g. dinucleotides)
Duplicate ligand definitions marked as obsolete
Metal hydrate definitions obsoleted
Alternate atom name aded to store legacy atom
names
SMILES and INCHI descriptors provided
Worldwide Protein Data Bank
www.wwpdb.org
Remediation: major entry level
corrections
 Citations:
– PubMed identifiers provided where available
– Unpublished citations checked and flagged
 Sequence and taxonomy:
– UniProt sequence database references
– Taxonomies from NCBI Taxonomy database
 Diffraction source
– Synchrotron facility and beamlines names consistently
specified in coordination with BioSync
Worldwide Protein Data Bank
www.wwpdb.org
Remediation: major ATOM
record changes
 Nomenclature changes
– IUPAC H-atom names for standard amino acids and
nucleotides
– DNA and RNA differentiated (AD (DNA) & A (RNA))
– Modified nucleotides expressed as 3-letter codes
(removed +’s)
– PDB asterisks replaced by single quotes in atom names
– Noncompliant ligands flagged in data files
Worldwide Protein Data Bank
www.wwpdb.org
Remediation: Major REMARK
changes
 Virus entries
– Transformations from deposited frame to point
symmetry and crystallographic frame provided
– NCS and point symmetry transformations properly
differentiated
Worldwide Protein Data Bank
www.wwpdb.org
EM standards
 New dictionary for electron microscopy
 MAP orientation conventions
Worldwide Protein Data Bank
www.wwpdb.org
BMRB
John Markley
Worldwide Protein Data Bank
www.wwpdb.org
Introduction to the BMRB
 BMRB is the worldwide archival site for biomolecular NMR data
 NMR data related to structures are cross referenced to PDB
entries
 PDBj mirrors BMRB and supports external BMRB depositions
 As RCSB members, BMRB and PDB have worked closely to
capture and annotate NMR data associated with deposited
coordinate sets
 Recognizing that the biomolecular NMR community would be best
served by having a “one stop” deposition system for NMR
structures, BMRB has been pursuing this goal in collaboration with
the RCSB-PDB
 BMRB plans to institute the same policy with MSD EBL
Worldwide Protein Data Bank
www.wwpdb.org
wwPDB NMR experimental data flow
Processed
NMR-STAR
CERM-BMRB
(export)
Mirror site
Deposited data
MSD/EBI
(deposition/export)
CCPN
Raw NMR-STAR
BMRB
(deposition/processing/export)
ADIT-NMR
central archive
RCSB-PDB
(deposition)
ADIT-NMR
PDBj-BMRB
(deposition/processing/export)
ADIT-NMR
Mirror site
Worldwide Protein Data Bank
www.wwpdb.org
Major developments related to BMRB’s role in the wwPDB
 “One-stop” BMRB-PDB ADIT-NMR deposition site for structures and
NMR data developed in collaboration with PDB is operational, with
BMRB assigning PDB accession codes
 Restraints database for legacy structures is nearing completion as part
of the wwPDB “clean-up”; new tools to automate this process were
developed in collaboration with MSD EBI
 NMR-STAR v3 dictionary has been extended and released
 Graphical interface with Jmol displays integrates PDB coordinate data
with associated NMR parameters
 BMRB is working with SG groups to improve efficiency of capturing
protein NMR data
 BMRB participates in the “PDB-BMRB Task Group on NMR”
Worldwide Protein Data Bank
www.wwpdb.org
New “one-stop” deposition of NMR structures/ data
Worldwide Protein Data Bank
www.wwpdb.org
Deposition interface features
 BMRB and RCSB-PDB depositions are
now generated from a joint interface
 BMRB interface has been streamlined
 RCSB-PDB interface for NMR has been
extended with optional fields for conformer
and constraint statistics
 Files in PDB format, mmCIF, and NMR-STAR can be uploaded to
pre-populate a deposition
 Many fields (i.e., experiment name, software name, software author,
etc.) have pull-down lists to choose from for convenience and to
improve uniformity
 Fields common to multiple forms are linked to eliminate the need to
retype information (i.e., uploaded data file names, author names,
molecule names and others)
 Help and examples have been improved
Worldwide Protein Data Bank
www.wwpdb.org
Restraints grid is keyed to NMR structural entries
Worldwide Protein Data Bank
www.wwpdb.org
Coordinated displays of NMR data and structures
Worldwide Protein Data Bank
www.wwpdb.org
Theoretical Models Policy
Haruki Nakamura
Worldwide Protein Data Bank
www.wwpdb.org
Models
 Define line between “pure” models and models based
on data
 Large experimental spectrum e.g. X-ray, NMR, EM,
SAX, FRET models
 Homology models especially as derived from
structural genomics
 Need a way to archive models that is totally
compatible with PDB
Worldwide Protein Data Bank
www.wwpdb.org
Defining a policy for models
Workshop at Rutgers (November 19-20, 2005)
 Attended by modelers, structural genomicists,
electron microscopists
 Policies and suggested implementations developed
 Outcome published in Structure
– “Outcome of a Workshop on Archiving Structural Models of Biological
Macromolecules”, Helen M. Berman, Stephen K. Burley, Wah Chiu, Andrej
Sali, Alexei Adzhubei, Philip E. Bourne, Stephen H. Bryant, Roland L.
Dunbrack, Jr., Krzysztof Fidelis, Joachim Frank, Adam Godzik, Kim Henrick,
Andrzej Joachimiak, Bernard Heymann, David Jones, John L. Markley, John
Moult, Gaetano T. Montelione, Christine Orengo, Michael G. Rossmann,
Burkhard Rost, Helen Saibil, Torsten Schwede, Daron M. Standley, John D.
Westbrook, Structure, 2006 14/8:1211-1217.
Worldwide Protein Data Bank
www.wwpdb.org
Models Recommendations
 PDB depositions will be restricted to atomic coordinates
that are substantially determined by experimental
measurements on specimens containing biological
macromolecules.

A central, publicly available archive or portal should be established for
models that are the explicit subject of peer review.

Methods for assessing model quality are essential for the integrity and
long-term success of any publicly available model portal, either from a
central repository or a set of linked resources. There was no consensus
as to which single method or group of methods should be applied.
Proposed Portal for Multiple
Databases for Protein Structures
Berman, H. et al. (2006) Structure, 14,
1211-1217.
Theoretical
Model DBs
Theoretical
Model DB 1
Theoretical
Model DB 2
Worldwide Protein Data Bank
www.wwpdb.org
Characteristics of portal
 Data Standards for Models
 Access Models for a Central Portal of Models
– The minimum contents for this portal require a unique identifier for each
model registered with the system, each model's polypeptide chain
sequence, and quality assessment information.
– Additional information should be available, including: keywords, structural
motifs, standard test sets of data, bound ligands, domains, flexibility,
surface electrostatic properties, coding & noncoding SNPs, alternative
splicing, oligomeric state, macromolecular interactions, literature
references, subcellular localization, pathways, transcript profiling, &
drugability.
– Access to these data should be free and constantly available to a diverse
worldwide user community of both model producers and users. Several
levels of access are required for the different levels of users of the portal.
Worldwide Protein Data Bank
www.wwpdb.org
Implementation of models policy
 August 15, 2006: Policy announced with 60 day
period of review
 August 15-October 15, 2006: Transition Plan
– All existing un-processed theoretical model entries as well as
entries deposited during this time were not validated or
processed. Entries will be released as-is without author
review or corrections.
– Authors had the choice of correcting their entries by
withdrawing the original entry and then re-submitting the
corrected version before October 15, 2006.
 October 15, 2006: Theoretical model depositions no
longer accepted
Worldwide Protein Data Bank
www.wwpdb.org
Discussion Issues
Kim Henrick
Worldwide Protein Data Bank
www.wwpdb.org
SAX - New EXP TYPE
 Hamburg to provide templates for consideration
Worldwide Protein Data Bank
www.wwpdb.org
4-letter code?
 Use of PDB 4-letter code can be extended by
allowing alpha-numeric in 1st character to
35 x 36 x 36 x 36 = 1,632,960 combinations
Worldwide Protein Data Bank
www.wwpdb.org
Patent Office
 The structures in the patent office may not represent
a major loss of structures – current investigations
indicate most patent structures are in the PDB.
 A much larger set of structures are in the Pharma on
ligand bound structures.
Worldwide Protein Data Bank
www.wwpdb.org
wwPDB SAC input request
Worldwide Protein Data Bank
www.wwpdb.org
What is a PDB Entry?
Rules for the smallest structure that can be submitted
– Carbohydrate chains?
– How long is a peptide? (24)
– Non-gene product macromolecular biological ligands (e.g.
antibiotics)?
Particular request from NMR depositors
Worldwide Protein Data Bank
www.wwpdb.org
Issues Annotation: EXP details
Experimental Details
 Twinning – twin factor in REMARK 3 requested and
original un-twinned structure factors
 TLS and conventional atomic B factor
 Author derived Validation software and
procedures/results – no longer accepted as in
REMARK 42 – now a REMARK to carry software
used and function
Worldwide Protein Data Bank
www.wwpdb.org
Policy: pre-Release Details
 Entries on HOLD or HPUB – currently details usually
made public –
AUTHOR, TITLE, STATUS
 Authors request all details suppressed – however
Journals need to check validity of IDCODE ...Yes/No?
Worldwide Protein Data Bank
www.wwpdb.org
Deposition policy
 HPUB/HOLD limit of one year Current problem: After one year
no response. Do we release?
... fixed rules?
No problems, release it
Problems, withdraw it
Worldwide Protein Data Bank
www.wwpdb.org
Major changes after remediation
dictionary changes
 Could have major affects on software
 New dictionary will be announced to many
software developers in early November, 2006
Worldwide Protein Data Bank
www.wwpdb.org
Major changes after remediation
nucleic acids
– DNA and RNA differentiated (AD (DNA) & A (RNA))
residue names now A = rna, AD = dna
– Modified nucleotides expressed as 3-letter codes
(removed +’s) e.g. no longer treat as +C etc
C31 as 2'-O-3-AMINOPROPYL CYTIDINE-5'MONOPHOSPHATE
– PDB asterisks replaced by single quotes in atom
names
O2* is back to O2’ as in refinement dictionaries
Worldwide Protein Data Bank
www.wwpdb.org
Major changes after remediation
H-atom names
IUPAC H-atom names for standard amino acids and nucleotides as
in BMRB file
http://www.bmrb.wisc.edu/ref_info/atom_nom.tbl
as recommended by the NMR Task Force
Example
New PDB
H H
HA HA
New
PDB
HG12 pro-R 1HG1
HG13 pro-S 2HG1
New
PDB
HD11 1HD1
HD12 2HD1
Worldwide Protein Data Bank
www.wwpdb.org
Major changes after remediation
other atom names
 “strange” atom names as in co-factors like
FAD i.e. AC1*, AN9, AC8 to be replaced by
C1'A, N9A, C8A
 In HEM atom names ‘N A’ to 'NA'
Worldwide Protein Data Bank
www.wwpdb.org
Other issues
Worldwide Protein Data Bank
www.wwpdb.org
Issues Annotation: Disorder/MODEL
Use of MODEL record with disorder with
Alternate conformations of large portions of
structures e.g. statistical disorder
....in progress
Worldwide Protein Data Bank
www.wwpdb.org
Issues Annotation: ATOM/SEQRES
Mismatch
Fitting species specific ATOM records to a related X-ray
or EM data set of a different species, as for example,
a large complex, ATPase
... needs new tokens in progress
Worldwide Protein Data Bank
www.wwpdb.org
Very large structures in PDB
 A proposed solution
Worldwide Protein Data Bank
www.wwpdb.org
Representing large complexes in the PDB
PART and ENDPRT
 These records will act much like the existing MODEL/ENDMDL
records providing a sectioning mechanism with the PDB file.
 PART sections will include records which describe the different
constituent parts of a large molecular system.
Worldwide Protein Data Bank
www.wwpdb.org
Representing large complexes in the PDB
 A PART/ENDPRT section will include all of the PDB records types which
reference specific structural elements of the molecule.
 PDB records that do not define or reference specific elements of molecular
structure will be at the beginning of the multipart PDB file
Worldwide Protein Data Bank
www.wwpdb.org
Coffee
Worldwide Protein Data Bank
www.wwpdb.org
wwPDB in 2007
 Same again .. but more of it (deposition and
processing)
IN ADDITION
 Rollout new files
 Implement new annotation procedures
 Discuss feasibility of a single
deposition/processing system
 Further team exchanges
 Gather Pharma structures
Worldwide Protein Data Bank
www.wwpdb.org
Long term goals, funding and
stability
Worldwide Protein Data Bank
www.wwpdb.org
We would really like to be the world wide
PDB with regular stable funding
Worldwide Protein Data Bank
www.wwpdb.org
Acknowledgements
E-MSD is supported by grants from the Wellcome Trust, the EU (TEMBLOR,
NMRQUAL and IIMS), CCP4, the BBSRC, the MRC and EMBL.
PDBj is supported by grant-in-aid from the Institute for Bioinformatics Research
and Development, Japan Science and Technology Agency (BIRD-JST), and the
Ministry of Education, Culture, Sports, Science and Technology (MEXT).
The BMRB is supported by NIH grant LM05799 from the National
Library of Medicine.
The RCSB PDB is supported by grants from the National Science Foundation, National
Institute of General Medical Sciences, the Office of Science-Department of Energy, the
National Library of Medicine, the National Cancer Institute, the National Center for
Research Resources, the National Institute of Biomedical Imaging and Bioengineering,
the National Institute of Neurological Disorders and Stroke, and the National Institute of
Diabetes & Digestive & Kidney Diseases.