Transcript Slide 1

The BARCODE Data Standard:
CBOL’s Partnership with the
International Nucleotide Sequence
Database Collaboration (INSDC)
David E. Schindel, Executive Secretary
National Museum of Natural History
Smithsonian Institution
[email protected]; http://www.barcoding.si.edu
202/633-0812; fax 202/633-2938
Infrastructure of Taxonomy:
Fragmented, Disconnected
•
•
•
•
•
•
Collections and databases of specimens
Seedbanks, culture/cell line collections
Compilations of taxonomic names
Floristic and faunistic surveys/inventories
Monographs, Taxonomic revisions
Data repositories (gene sequences,
characters, images, trees)
• The (undigitized) Taxonomic Literature
Linking Logical Categories (1):
Specimens, Names, Opinions
Voucher
Specimen
??
Journal
Publication
Species
Name
Linking Logical Categories (2):
Naming and defining species
Voucher
Specimen
Journal
Publication
Holotype
specimens
Species
Name
Linking Logical Categories (3):
Establishing species boundaries
Voucher
Specimen
??
Journal
Publication
Species concept
beyond holotype
- Paratype series
- Typological versus
population thinking
- Genetic lineages
- BSC (hard to apply)
Species
Name
Linking Logical Categories (4):
Interpreting species boundaries
Voucher
Specimen
Other assigned
specimens:
•Species philosophy of
original author
•Interpretation of user
??
Journal
Publication
Species
Name
Databases of Names, Specimens,
Species Distributions
Voucher
Specimen
Museum
databases of
associated data
Databases of species
occurrences and
distribution (OBIS)
Journal
Publication
Species
Name
Authority files
of taxonomic
names
DNA Barcodes:
A Key Variable for Biodiversity
Informatics
Voucher
Specimen
Museum
databases of
associated data
Databases of species
occurrences and
distribution (OBIS)
Barcode
Sequence
Journal
Publication
Species
Name
Authority files
of taxonomic
names
CBOL’s Working Groups
• Database: Designing/constructing the
Barcode Section of GenBank
• DNA: Protocols for formalin-fixed and old
museum specimens; Producing LIMS for
dissemination
• Data Analysis: Beyond phenetic methods;
population genetics perspective
• (Plants: Initiated discussions of plant
barcode gene region(s))
BARCODE Data Standards
• Consultations with GenBank, ITIS, museum
database developers, GBIF, ISIS, from 2004
• Consensus results of Front Royal meeting
– GBIF
– NBII
– ICZN
 ITIS
 Species2000
 ZooRecord
 GRIN
 IPNI
 OBIS
• GenBank Proposed to International Nucleotide
Sequence Database Collaboration (EMBL,
DDBJ)
• Approved by CBOL and INSDC mid-2005
Reserved Keyword “BARCODE”
• GenBank reviews records against standard
• Adds keyword “BARCODE” in annotation field
• Can be removed by CBOL
Requirements
• Species name selected from authority
• Sequence from COI or other barcode region
approved by CBOL
• Structured link to voucher specimen
• Online access to metadata
• Trace files and quality scores
• Primer sequences and names
• Minimum sequence length (500bp for COI)
• Geographic locality
Recommended fields,
added to INSDC at CBOL’s request
•
•
•
•
Latitude and longitude
Name of the identifier
Name of the collector
Date of collection
New Data Fields
Latitude/Longitude
Collection date
Collector’s name
Identifier’s name
BARCODE Keyword in GenBank
BARCODE Records in INSDC
Specimen
Metadata
Georeference
Habitat
Character sets
Images
Behavior
Other genes
Other
Databases
Phylogenetic
Pop’n Genetics
Ecological
Voucher
Specimen
Barcode
Sequence
Trace files
Primers
Literature
(link to content or
citation)
Species
Name
Indices
- Catalogue of Life
- GBIF/ECAT
Nomenclators
- Zoo Record
- IPNI
- NameBank
Publication links
- New species
Databases
- Provisional sp.
Structured link to Vouchers
Specimen
Metadata
Georeference
Habitat
Character sets
Images
Behavior
Other genes
Other
Databases
Phylogenetic
Pop’n Genetics
Ecological
Voucher
Specimen
Barcode
Sequence
Trace files
Primers
Literature
(link to content or
citation)
Species
Name
Indices
- Catalogue of Life
- GBIF/ECAT
Nomenclators
- Zoo Record
- IPNI
- NameBank
Publication links
- New species
Databases
- Provisional sp.
What constitutes a voucher?
•
•
•
•
Long-term reference tied to BARCODE
Corroborates the species identification
Provides additional tissue
CBOL relies on community decisions:
– Full specimen?
– Parts for morphologic features (e.g., feather?)
– Frozen tissue?
– E-Vouchers for large specimens, destructive
samples, catch-and-release?
Where’s the voucher?
Structured Voucher IDs
Linking to Vouchers
Voucher Specimen ID
• Based on Darwin Core
• Eventually will be replaced by GUID
• Triplet:
Institution Acronym : Collection : Specimen #
NMNH : FISH : 123456
• CBOL, GBIF and NCBI discussing global
registry of:
– Institutional acronyms
– Collection codes
– “Pre-accession” specimen IDs
Link to Species Names
Specimen
Metadata
Georeference
Habitat
Character sets
Images
Behavior
Other genes
Other
Databases
Phylogenetic
Pop’n Genetics
Ecological
Voucher
Specimen
Barcode
Sequence
Trace files
Primers
Literature
(link to content or
citation)
Species
Name
Indices
- Catalogue of Life
- GBIF/ECAT
Nomenclators
- Zoo Record
- IPNI
- NameBank
Publication links
- New species
Databases
- Provisional sp.
Species names in INSDC
NCBI Taxonomy Browser
The good, the bad, and the ugly
•
•
•
•
Species names provided by submitters
Checked against compilations
Linkout to Catalogue of Life, other sources
Names not found added to Taxonomy
Browser
• Submitters informed of errors but not
forced to make corrections
NCBI Taxonomy Browser
NCBI Taxonomy Browser
Some names have no other source
Other names linked to GBIF and
Catalogue of Life…
…and primary data source
Authoritative Species Lists
• Catalogue of Life
• Species lists compiled by barcoding projects
– FISH-BOL from FishBase, CoF
– MBI mosquito catalog
•
•
•
•
Nomenclators
NameBank
New names in publications
Eventually, central registries (e.g., ZooBank)
Provisional Species ID
•
•
•
•
•
Uncertain identifications
Species complexes
Newly discovered variants
Ecogenomic samples
Need general guidelines to ensure:
– Globally unique,
– Stable, retrievable
– Can’t be confused with valid species name
BARCODE Records in INSDC
Specimen
Metadata
Georeference
Habitat
Character sets
Images
Behavior
Other genes
Other
Databases
Phylogenetic
Pop’n Genetics
Ecological
Voucher
Specimen
Barcode
Sequence
Trace files
Primers
Literature
(link to content or
citation)
Species
Name
Indices
- Catalogue of Life
- GBIF/ECAT
Nomenclators
- Zoo Record
- IPNI
- NameBank
Publication links
- New species
Databases
- Provisional sp.
Connecting taxonomic articles
Improving links to
taxonomic journals
Links to Taxonomic Literature
• Library-Laboratory meeting in London,
2005, on electronic access to taxonomic
literature
• Led to formation of Biodiversity Heritage
Library initiative
• Proactive steps with PubMed to add
taxonomic journals to online abstracts
• Aggressive negotiation with publishers of
barcoding papers
• Involvement in Encyclopedia of Life
Long-term data curation
of BARCODE records
Data records
assembled
Community
feedback
Data records
released on
INSDC
Update
records
(audit trail of
species
names
retained)
IDs consistent
with other
records?
GenBank adds
BARCODE flag
CBOL control
of BARCODE
flag
Compliant with
BARCODE
standards?
Data records
published in
BOLD
Acknowledgements
Robert Hanner, University of Guelph,
Chair of CBOL’s Database Working Group
Scott Federhen, NCBI Taxonomy Browser
Donald Hobern, Head of Informatics, GBIF