Bioinformatics Overview, NCBI & GenBank
Bioinformatics Overview, NCBI & GenBank
NCBI & GenBank
What is Bioinformatics
• Find three different definitions of the word
• How is “bioinformatics different from
• What areas of biological research are dependent on
What is Bioinformatics Used
• Database searching
• Genome annotation
• Sequence analysis
• Molecular evolution
• Gene expression
• Genome assembly
Introduction to NCBI
• NCBI, EMBL & DDBJ
• What function do these organizations play in the global society?
• How do their missions differ?
• NCBI Training and Tutorials page
• The NCBI Handbook
• NCBI How-To page
• NCBI Help Manual
• Annotated collection of all publicly available
nucleotide sequences and their protein translations.
• Receives sequences produced in laboratories
throughout the world from more than 100,000
• Grows exponentially, doubling every 10 months
• Initially built and maintained at Los Alamos
• Transferred to NCBI in early 1990s by congressional
• Most journal publishers require deposition of
sequence data into GanBank prior to publication so
an accession number may be cited.
• Submitters may keep their data confidential for a
specified period of time prior to publication.
• A typical GenBank submission consists of a single,
contiguous stretch of DNA or RNA sequence
(contigs) with annotations (metadata).
• If part of a nucleotide sequence encodes a protein, a
conceptual translation, called a CDS (coding
sequence) is annotated, and the span mapped.
• HTGS entries are submitted in bulk by genome
centers, processed by an automated system, and then
released to GenBank.
• Currently, about 30 genome centers are submitting
data for a number of organisms, including human,
mouse, rat, rice, and Plasmodium falciparum.
• Data submitted in 4 phases.
• Phase 0: Sequences are one-to-few reads of a single clone
and are not usually assembled into contigs. They are lowquality sequences that are often used to check whether
another center is already sequencing a particular clone.
• Phase 1: Entries are assembled into contigs that are separated
by sequence gaps, the relative order and orientation of which
are not known.
• Phase 2: Entries are also unfinished sequences that may or
may not contain sequence gaps. If there are gaps, then the
contigs are in the correct order and orientation.
• Phase 3: Sequences are of finished quality and have no gaps.
For each organism, the group overseeing the sequencing
effort determines the definition of finished quality.
Whole Genome Shotgun
• Shotgun sequence reads are assembled into contigs,
submitted, and updated as the sequencing project progresses
and new assemblies are computed.
EST, STS, and GSS
• EST = Expressed Sequence Tags (dbEST): Short (< 1
kb), single-pass cDNA sequences from a particular tissue
and/or developmental stage. They lack annotation.
• STS = Sequence Tagged Sites (dbSTS): Short genomic
landmark sequences. They are operationally unique in
that they are specifically amplified from the genome by
PCR amplification. They define a specific location on the
genome and are thus useful for mapping.
• GSS = Genome Survey Sequences (dbGSS): Short
sequences derived from genomic DNA, about which little
HTC and FLIC
• HTC = High-Throughput cDNA/mRNA: Similar to
ESTs, but often contain more information. May have a
systematic gene name that is related to the lab or center
that submitted them, and the longest ORF is often
annotated as a coding region.
• FLIC = Full-Length Insert cDNA: Contains the entire
sequence of a cloned cDNA/mRNA. Generally longer,
and sometimes full-length mRNAs. Usually annotated
with genes and coding regions. May be systematic gene
names rather than functional names.
• BankIt: Web-based form for submission of a small
number of sequences with minimal annotation to
• Sequin: More appropriate for complicated
submissions containing a significant amount of
annotation or many sequences. Stand-alone
application available on NCBI’s FTP site.
Sequence Data Flow and
• Triage: Within 48 hours of direct submission with BankIt or
Sequin, the database staff reviews the submission to determine
whether it meets the minimal criteria and then assigns an
• All sequences must be > 50 bp in length and be sequenced by, or
on behalf of, the group submitting the sequence.
• GenBank will not accept sequences constructed in silico
• GenBank will not accept noncontiguous sequences containing
internal, unsequenced spacers.
• GenBank will not accept sequences for which there is not a
physical counterpart, such as those derived from a mix of
genomic DNA and mRNA.
• Submissions are checked to determine whether they are new or
Sequence Data Flow and
• Biological validity: Translation, organism lineage, BLAST searches
• Vector contamination: Is there any vector DNA present in the
• Publication status: If published, citation is included in annotation and
linked to Entrez
• Formatting and spelling
Sequences are sent to submitter for final review before release into
the public database.
Sequences must become publicly available once the accession
number or the sequence has been published.
GenBank annotation staff process about 1900
submissions/month, or about 20,000 sequences.
• A curated collection of DNA, RNA, and protein sequences
built by NCBI.
• Unlike GenBank, RefSeq provides only one example of each
natural biological molecule for major organisms ranging from
viruses to bacteria to eukaryotes.
• May include separate linked records for genomic DNA, the
gene transcripts, and the proteins arising from those transcripts.
• Limited to major organisms for which sufficient data is
available (only 4000 as of Jan 2007), while GenBank includes
sequences for any organism submitted (~250k different
Third Party Annotation (TPA)
• Contains nucleotide sequences built from existing
primary data with new annotation that has been
published in a peer-reviewed scientific journal.
• Two types of records:
• Experimental: Annotation supported by wet-lab evidence
• Inferential: Annotation inferred only
• Bridges the gap between GenBank and RefSeq:
Permitting authors publishing new experimental evidence
to re-annotate sequences in a public database as they
think best, even if they are not the primary sequencer or
the curator of a model organism database.
Universal Protein Resource
• Protein sequence database that was formed through
the merger of three protein databases:
1. The Swiss Institute of Bioinformatics
2. The European Bioinformatics Institute’s Swiss-Prot
and Translated EMBL Nucleotide Sequence Data
Library (TrEMBL) databases
3. Georgetown University’s Protein Information
Resource Protein Sequence Database (PIR-PSD)
• Linked on today’s web page