Bioinformatics Overview, NCBI & GenBank

Transcript Bioinformatics Overview, NCBI & GenBank

Bioinformatics Overview,
NCBI & GenBank
JanPlan 2012
What is Bioinformatics
• Find three different definitions of the word
“bioinformatics”
• How is “bioinformatics different from
“computational biology”?
• What areas of biological research are dependent on
bioinformatics?
What is Bioinformatics Used
For?
• Database searching
• Genome annotation
• Sequence analysis
• Metagenomics
• Phylogenetic
reconstruction
• Molecular evolution
• Gene expression
• Genome assembly
Introduction to NCBI
• NCBI, EMBL & DDBJ
• What function do these organizations play in the global society?
• How do their missions differ?
• NCBI Training and Tutorials page
• The NCBI Handbook
• NCBI How-To page
• NCBI Help Manual
GenBank
• Annotated collection of all publicly available
nucleotide sequences and their protein translations.
• Receives sequences produced in laboratories
throughout the world from more than 100,000
distinct organisms.
• Grows exponentially, doubling every 10 months
GenBank
• Initially built and maintained at Los Alamos
National Laboratory.
• Transferred to NCBI in early 1990s by congressional
mandate.
• Most journal publishers require deposition of
sequence data into GanBank prior to publication so
an accession number may be cited.
• Submitters may keep their data confidential for a
specified period of time prior to publication.
Direct Submission
• A typical GenBank submission consists of a single,
contiguous stretch of DNA or RNA sequence
(contigs) with annotations (metadata).
• If part of a nucleotide sequence encodes a protein, a
conceptual translation, called a CDS (coding
sequence) is annotated, and the span mapped.
• Example
High-Throughput Genomic
Sequence (HTGS)
• HTGS entries are submitted in bulk by genome
centers, processed by an automated system, and then
released to GenBank.
• Currently, about 30 genome centers are submitting
data for a number of organisms, including human,
mouse, rat, rice, and Plasmodium falciparum.
High-Throughput Genomic
Sequence (HTGS)
• Data submitted in 4 phases.
• Phase 0: Sequences are one-to-few reads of a single clone
and are not usually assembled into contigs. They are lowquality sequences that are often used to check whether
another center is already sequencing a particular clone.
• Phase 1: Entries are assembled into contigs that are separated
by sequence gaps, the relative order and orientation of which
are not known.
• Phase 2: Entries are also unfinished sequences that may or
may not contain sequence gaps. If there are gaps, then the
contigs are in the correct order and orientation.
• Phase 3: Sequences are of finished quality and have no gaps.
For each organism, the group overseeing the sequencing
effort determines the definition of finished quality.
Whole Genome Shotgun
Sequences (WGS)
• Shotgun sequence reads are assembled into contigs,
submitted, and updated as the sequencing project progresses
and new assemblies are computed.
EST, STS, and GSS
• EST = Expressed Sequence Tags (dbEST): Short (< 1
kb), single-pass cDNA sequences from a particular tissue
and/or developmental stage. They lack annotation.
• STS = Sequence Tagged Sites (dbSTS): Short genomic
landmark sequences. They are operationally unique in
that they are specifically amplified from the genome by
PCR amplification. They define a specific location on the
genome and are thus useful for mapping.
• GSS = Genome Survey Sequences (dbGSS): Short
sequences derived from genomic DNA, about which little
is known.
HTC and FLIC
• HTC = High-Throughput cDNA/mRNA: Similar to
ESTs, but often contain more information. May have a
systematic gene name that is related to the lab or center
that submitted them, and the longest ORF is often
annotated as a coding region.
• FLIC = Full-Length Insert cDNA: Contains the entire
sequence of a cloned cDNA/mRNA. Generally longer,
and sometimes full-length mRNAs. Usually annotated
with genes and coding regions. May be systematic gene
names rather than functional names.
Submission Tools
• BankIt: Web-based form for submission of a small
number of sequences with minimal annotation to
GenBank.
• Sequin: More appropriate for complicated
submissions containing a significant amount of
annotation or many sequences. Stand-alone
application available on NCBI’s FTP site.
Sequence Data Flow and
Processing
• Triage: Within 48 hours of direct submission with BankIt or
Sequin, the database staff reviews the submission to determine
whether it meets the minimal criteria and then assigns an
Accession number.
• All sequences must be > 50 bp in length and be sequenced by, or
on behalf of, the group submitting the sequence.
• GenBank will not accept sequences constructed in silico
• GenBank will not accept noncontiguous sequences containing
internal, unsequenced spacers.
• GenBank will not accept sequences for which there is not a
physical counterpart, such as those derived from a mix of
genomic DNA and mRNA.
• Submissions are checked to determine whether they are new or
updates.
Sequence Data Flow and
Processing
•
Indexing:
• Biological validity: Translation, organism lineage, BLAST searches
• Vector contamination: Is there any vector DNA present in the
sequence?
• Publication status: If published, citation is included in annotation and
linked to Entrez
• Formatting and spelling
•
Sequences are sent to submitter for final review before release into
the public database.
•
Sequences must become publicly available once the accession
number or the sequence has been published.
•
GenBank annotation staff process about 1900
submissions/month, or about 20,000 sequences.
RefSeq
• A curated collection of DNA, RNA, and protein sequences
built by NCBI.
• Unlike GenBank, RefSeq provides only one example of each
natural biological molecule for major organisms ranging from
viruses to bacteria to eukaryotes.
• May include separate linked records for genomic DNA, the
gene transcripts, and the proteins arising from those transcripts.
• Limited to major organisms for which sufficient data is
available (only 4000 as of Jan 2007), while GenBank includes
sequences for any organism submitted (~250k different
organisms).
Third Party Annotation (TPA)
database
• Contains nucleotide sequences built from existing
primary data with new annotation that has been
published in a peer-reviewed scientific journal.
• Two types of records:
• Experimental: Annotation supported by wet-lab evidence
• Inferential: Annotation inferred only
• Bridges the gap between GenBank and RefSeq:
Permitting authors publishing new experimental evidence
to re-annotate sequences in a public database as they
think best, even if they are not the primary sequencer or
the curator of a model organism database.
Universal Protein Resource
(UniProt)
• Protein sequence database that was formed through
the merger of three protein databases:
1. The Swiss Institute of Bioinformatics
2. The European Bioinformatics Institute’s Swiss-Prot
and Translated EMBL Nucleotide Sequence Data
Library (TrEMBL) databases
3. Georgetown University’s Protein Information
Resource Protein Sequence Database (PIR-PSD)
Problem Set
• ftp://ftp.ncbi.nih.gov/pub/education/tutorials/gen
bank.pdf
• Linked on today’s web page

Bioinformatics Overview, NCBI & GenBank

Transcript Bioinformatics Overview, NCBI & GenBank

Directory