Transcript Document
A Field Guide to GenBank
and NCBI’s Molecular Biology Resources
August 30, 2005
University of Colorado Health Sciences Center
NCBI FieldGuide
National Center for Biotechnology Information
About NCBI
GenBank overview
Primary vs derivative databases
The Reference Sequence (RefSeq) project
Entrez databases
Genome resources
Bookshelf
-break-
Entrez text searching
BLAST sequence searching
VAST structure searching
An integrated example
NCBI FieldGuide
Topics
Bethesda, MD
NCBI FieldGuide
The National Institutes of Health
Accepts submissions of primary data
Develops tools to analyze these data
Creates derivative databases based on the primary
data
Provides free search, link, and retrieval of these
data, primarily through the Entrez system
NCBI FieldGuide
The National Center for
Biotechnology Information
NCBI FieldGuide
NCBI WWW Users per Day
450,000
400,000
1997 1998
1999
2000
2001
2002
2003
Number of Users
350,000
300,000
250,000
200,000
150,000
100,000
50,000
0
Christmas & New Year
NCBI FieldGuide
Number of Users Per Day
all[filter]
NCBI FieldGuide
Homepage - accessing the data
all[filter]
3/15/2005
8/15/2005
NCBI FieldGuide
1/11/2005
Primary Data
GenBank
GenBank / DDBJ / EMBL
# records
57.3 million (97.4 %)
Derivative Data
RefSeq
RefSeq reviewed
PDB (structures)
“Total”
1.47 million (2.5 %)
60,000
5,973
59 million
NCBI FieldGuide
Entrez Nucleotide
NCBI’s Primary Sequence Database
Release 149
47 x 106
52 x 109
195 Gigabytes
August 2005
Records
Nucleotides
Over 100 billion
bases!
816
files
• full release every two months
• incremental and cumulative updates daily
• available only through internet
• release notes: gbrel.txt
ftp://ftp.ncbi.nih.gov/genbank/
ftp://genbank.sdsc.edu/pub
ftp://bio-mirror.net/biomirror/genbank
NCBI FieldGuide
GenBank:
Nucleotide only sequence database
Archival in nature
GenBank Data
Direct submissions (traditional records)
Batch submissions (EST, GSS, STS)
ftp accounts (genome data)
Three collaborating databases
GenBank
DNA Database of Japan (DDBJ)
European Molecular Biology Laboratory (EMBL)
Database
NCBI FieldGuide
What is GenBank?
“Organismal”
PRI
ROD
PLN
BCT
INV
VRT
VRL
MAM
PHG
SYN
UNA
(28)
(15)
(13)
(11)
(7)
(7)
(4)
(2)
(1)
(1)
(1)
Primate
Rodent
Plant and Fungal
Bacterial/Archeal
Invertebrate
Other Vertebrate
Viral
Mammalian
Phage
Synthetic
Unannotated
EST
GSS
HTG
PAT
STS
CON
(377)
(138)
(63)
(17)
(9)
(1)
Expressed Sequence Tag
Genome Survey Sequence
High Throughput Genomic
Patent
Sequence Tagged Site
Contigs, virtual
• Organized by taxonomy (sort of)
• Direct submissions (Sequin/Bankit)
• Accurate (~1 error per 10,000 bp)
• Well characterized
“Functional”
• Organized by sequence type
• Batch submissions (ftp/email)
• Inaccurate
• Poorly characterized
NCBI FieldGuide
GenBank Divisions
Expressed Sequence Tag
1st pass single read cDNA
GenBank
EST
GSS
Genome Survey Sequence
1st pass single read gDNA
High Throughput Genomic
incomplete sequences of genomic
HTG
clones
STS
Sequence Tagged Site
PCR-based mapping reagents
Whole Genome Shotgun
NCBI FieldGuide
GenBank Functional (Bulk) Divisions
>IMAGE:275615 5' mRNA sequence
GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG
TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA
TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA
5’
GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC
30,000
TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC
AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN
genes
3’
TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
nucleus
>IMAGE:275615 3', mRNA sequence
- isolate unique clones
NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA
RNA
- sequence once from
TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT
gene products
each end
AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT
CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG
GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC
make cDNA
library
80-100,000 unique
cDNA clones in library
NCBI FieldGuide
EST Division: Expressed Sequence Tags
NCBI FieldGuide
GSS, WGS, HTG
Whole BAC insert (or genome)
shred
sequence
GSS division
or trace archive assembly
isolate clones
whole genome shotgun assemblies
(traditional division)
Draft sequence (HTG division)
LOCUS
AC141845
Honeybee Draft Sequences
147720 bp
DNA linear
HTG
19-MAR-2004
DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE,
14 unordered pieces.
ACCESSION AC141845
VERSION
AC141845.1 GI:29124029
KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.
• Unfinished sequences of BACs
• Gaps and unordered pieces
• Finished sequences (Phase 3) move to
traditional GenBank division
NCBI FieldGuide
HTG Example:
351 projects
Bacteria (251)
Environmental sequences (6)
Archaea (6)
Eukaryotes (88), including:
Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human
Pufferfish (2)
Honeybee, Anopheles, Fruit Flies (3), Silkworm
Nematode (2)
Yeasts (8), Aspergillus (2)
Rice (2)
NCBI FieldGuide
Whole Genome Shotgun Projects
wgs master[properties]
NCBI FieldGuide
Whole Genome Shotgun (WGS) Projects
Sequencing
Centers
UniGene
GenBank
Updated ONLY
by submitters
INV VRT PHG VRL
EST
STS
HTG
GSS
PRI ROD PLN MAM BCT
Labs
NCBI FieldGuide
Derivative Databases
UniSTS
Updated
by NCBI
RefSeq:
RefSeq
Entrez
Gene and
annotation pipelines
Entrez Nucleotide query:
human[organism] AND lipase[title]
NCBI FieldGuide
Why Make Reference Sequences?
Entrez
Nucleotide
query:
Why
Make
Reference
Sequences?
NCBI FieldGuide
human[organism] AND lipase[title]
human[organism]
ANDAND
lipase[title]
endothelial[title]
human[organism]
lipase[title] AND AND
endothelial[title]
4150 bp
2323 bp
3927 bp
261 bp
NCBI FieldGuide
3927 bp
genomes
•
transcripts
proteins
non-redundant; best representative
•updates to reflect current sequence data and
biology
•distinct, stable accession series
NCBI FieldGuide
RefSeq Benefits
Accession
Sequence Type
NM_123456789
NP_123456789
NR_123456
XM_123456
XP_123456
XR_123456
ZP_12345678
mRNA
protein, from NM_
non-coding RNA
predicted mRNA
predicted protein
predicted non-coding RNA
predicted from NZ_
NC_123456
NG_123455
genomic, e.g., chromosomes
genomic, incomplete region
NT_123456
NW_123456
NZ_ABCD12345678
genomic, BAC assembly
genomic, WGS assembly
genomic, WGS collection
blue=curated
NCBI FieldGuide
Reference Sequence: RefSeq
NCBI FieldGuide
Annotation Process
Genomic DNA
(NC, NT, NW)
Scanning....
Model mRNA (XM)
Model protein (XP)
(XR)
Curated mRNA (NM)
(NR)
RefSeq
Genbank
Sequences
Curated Protein (NP)
Genome annotation
NM’s must have
cDNA support
transcript variant 1
transcript variant 2
transcript variant 3
Longest mRNA
NCBI FieldGuide
Creating NM_ Records
NCBI FieldGuide
Where is RefSeq?
CancerChromosomes
Gene
UniGene
UniST
S
Homologen
e
SNP
Genome
PopSet
Nucleotide
GEO
Books
MeSH
PubMed
OMIM
Entrez
Protein
Taxonomy
GENSAT
PubChe
m
PMC
Journal
s
Domains
Structur
e
3D Domains
NCBI FieldGuide
The Entrez System
UniGene
Clusters of ESTs, mRNAs
dbSNP
Single Nucleotide Polymorphisms
GEO
Gene Expression Omnibus
microarray and other expression data
CDD
Conserved Domain Database
protein families (COGs and KOGs)
single domains (PFAM, SMART, CD)
NCBI FieldGuide
A Few Entrez Databases
Gene-oriented clusters of expressed sequences
• Automatic clustering using MegaBlast
• Each cluster represents a unique gene
• Informed by genome hits
• Information on tissue types and map locations
• Useful for gene discovery and selection of mapping
reagents
NCBI FieldGuide
UniGene
NCBI FieldGuide
A Cluster of ESTs
query
5’ EST hits
3’ EST hits
UniGene Collections
NCBI FieldGuide
Example UniGene Cluster
NCBI FieldGuide
Histogram of cluster sizes for UniGene Hs Build 177
NCBI FieldGuide
(Now at Build #186)
UniGene Cluster Hs.95351
NCBI FieldGuide
SELECTED PROTEIN SIMILARITES
UniGene Cluster Hs.95351
NCBI FieldGuide
GENE EXPRESSION
NCBI FieldGuide
UniGene Cluster Hs.95351: expression
NCBI FieldGuide
UniGene Cluster Hs.95351: seqs
web page
ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/
NCBI FieldGuide
Download sequences
Entrez GEO
NCBI FieldGuide
Primary and derivative (RefSNP)
Single nucleotide polymorphisms
Repeat polymorphisms
Insertion-deletion polymorphisms
Over 19 million refSNPs (rsXXXXXXX)
(August, 2005)
NCBI FieldGuide
NCBI’s SNP Database
NCBI FieldGuide
Searching dbSNP
NCBI FieldGuide
RefSNP
NCBI FieldGuide
RefSNP
NCBI FieldGuide
RefSNP
Search Mouse SNP between strains
NCBI FieldGuide
RefSNP
MapView GeneView SeqView
No 3D
OMIM
NCBI FieldGuide
RefSNP
NCBI FieldGuide
RefSNP
Entrez GEO
NCBI FieldGuide
GPL
Platform
descriptions
GSM
GSE
Grouping of
Raw/processed
slide/chip data
spot intensities
from a single “a single experiment”
slide/chip
GEO SaMple:
GEO SEries:
experimental
set of related
conditions
samples
Entrez GEO
Curated by
NCBI
NCBI FieldGuide
Submitted by
Manufacturer*
Submitted by
Experimentalists
GDS
Grouping of
experiments
Entrez
GEO Datasets
Supplied by
submitter
Platform
Sample
Series
(GPL)
(GSM)
(GSE)
array definition
hyb. measurements
related Samples
DataSet
Assembled
by GEO staff
(GDS)
• A collection of experimentally-related samples
processed using the same platform.
• Samples within DataSets are organized into
subgroups based on experimental variables.
• Form the basis of GEO’s query, analysis and data
display tools.
NCBI FieldGuide
What’s a DataSet?
Gene Expression Omnibus (GEO)
NCBI FieldGuide
Dataset browser
GEO Dataset Browser
NCBI FieldGuide
GEO Dataset Report
NCBI FieldGuide
… of 12625
NCBI FieldGuide
GEO Profiles
Entrez CDD
NCBI FieldGuide
Multiple sequence alignments
Position-specific scoring matrices (PSSM)
Sources SMART, PFAM, COGs, KOGs, and
NCBI curated domains (structure-informed
alignments)
NCBI FieldGuide
Conserved Domain Database
>gi|45549418|gb|AAS67634.1| ATP7A [Solenodon paradoxus]
IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPS
STNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEIL
KKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNS
CVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE
NCBI FieldGuide
CDD
NCBI FieldGuide
CDD
Click on a colored bar to align your sequence to
the CD
CD
Pfam
COG
NCBI FieldGuide
Conserved Domain Database: cd00371.1, HMA
NCBI FieldGuide
CDD
CDART: Conserved Domain Architecture Retrieval Tool
NCBI FieldGuide
Linking from Entrez Protein
NCBI FieldGuide
cdd
Genomic Biology
Gene database
Homologene
Map Viewer
Trace Archive
NCBI FieldGuide
Genome Resources
NCBI FieldGuide
Genomic Biology
Gen Biol: Gen Resources
NCBI FieldGuide
NCBI FieldGuide
Gen Biol: Gen Resources
Gen Biol: Gen Resources
NCBI FieldGuide
NCBI FieldGuide
Genome Projects: microb
Gen Biol: Gen Resources
NCBI FieldGuide
Gen Biol: Gen Resources
NCBI FieldGuide
Gen Biol: Gen Resources
NCBI FieldGuide
Gen Biol: Gen Resources
NCBI FieldGuide
NCBI FieldGuide
Gen Biol: Gen Resources
Genomic Biology
Gene database
Homologene
Map Viewer
Trace Archive
NCBI FieldGuide
Genome Resources
A single query interface to …
• Sequences
- RefSeqs
- GenBank
- Homologene
• Maps – MapViewer
• Entrez links
• Linkouts
More organisms, ~ 3000
Entrez integration
NCBI FieldGuide
Entrez Gene
Global Entrez: NADH2
NCBI FieldGuide
Entrez Gene: NADH2
NCBI FieldGuide
Gene Record for Pongo NADH2
NCBI FieldGuide
Not found with “nadh2”
Homo sapiens
A Record With More Data: Human HFE
NCBI FieldGuide
Transcripts with
experimental evidence
NCBI FieldGuide
Human HFE: Transcripts
Gene Table
NCBI FieldGuide
Introns/Exons: Gene Table
NCBI FieldGuide
links to sequence
Human HFE: Links
NCBI FieldGuide
NCBI FieldGuide
Genotype
Genotype
NCBI FieldGuide
Human HFE: Links
NCBI FieldGuide
NCBI FieldGuide
GeneView in dbSNP
NCBI FieldGuide
SNP in Structure
NCBI FieldGuide
SNP in Structure
NCBI FieldGuide
SNP in Structure
H41
S43
C260
Another Variation Source: OMIM
NCBI FieldGuide
Variants in OMIM
NCBI FieldGuide
Genomic Biology
Gene database
Homologene
Map Viewer
Trace Archive
NCBI FieldGuide
Genome Resources
Automated detection of homologs among the annotated genes of
completely sequenced eukaryotic genomes.
No longer UniGene based
Protein similarities first
Guided by taxonomic tree
Includes orthologs and
paralogs
NCBI FieldGuide
The New Homologene
Homologene Build 43.1 (8/23/05)
Species
Number of genes
input
grouped
groups
NCBI FieldGuide
The New Homologene
NCBI FieldGuide
RAG1 → Homologene
NCBI FieldGuide
RAG1 → Homolgene
RAG1
RAG1
NCBI FieldGuide
RING-finger
NCBI FieldGuide
RAG1 → Homolgene
RAG1
RAG1
NCBI FieldGuide
Sugar_tr
NCBI FieldGuide
Homologene: alignment scores
NCBI FieldGuide
BLASTP
bl2seq
LocusLink
Gene database
UniGene
Homologene
Map Viewer
Trace Archive
NCBI FieldGuide
Genome Resources
NCBI FieldGuide
List View
Human MapViewer
NCBI FieldGuide
adar
MapViewer: Human ADAR
NCBI FieldGuide
5’ UTR
MV Hs ADAR
NCBI FieldGuide
3’ UTR
--Sequence maps-Ab initio
Assembly
Repeats
BES_Clone
Clone
NCI_Clone
Contig
Component
CpG island
dbSNP haplotype
Fosmid
GenBank_DNA
Gene
Phenotype
SAGE_Tag
STS
TCAG_RNA
Transcript (RNA)
Hs_UniGene
Hs_EST
Mm_UniGene
Mm_EST
Rn_UniGene
Rn_EST
Ssc_UniGene
Ssc_EST
Bt_UniGene
Bt_EST
Gga_UniGene
Gga_EST
Variation
--Cytogenetic maps-Ideogram
FISH Clone
Gene_Cytogenetic
Mitelman Breakpoint
Morbid/Disease
--Genetic Maps-deCODE
Genethon
Marshfield
--RH maps-= SNP GeneMap99-G3
GeneMap99-GB4
NCBI RH
Standford-G3
TNG
Whitehead-RH
Whitehead-YAC
NCBI FieldGuide
Maps & Options
Maps
& Options
MapViewer
UniGene
Repeats
Gene
NCBI FieldGuide
Component
Phenotype
NCBI FieldGuide
Gene
Variation
NCBI FieldGuide
Maps & Options
Maps
& Options
LocusLink
Gene database
UniGene
Homologene
Map Viewer
Trace Archive
NCBI FieldGuide
Genome Resources
NCBI FieldGuide
Trace Archive Page
NCBI FieldGuide
Macaca Mulatta Traces
NCBI FieldGuide
Access to sequences NOT in GenBank
NCBI FieldGuide
Trace Archive BLAST Page
NCBI FieldGuide
Literature Links
NCBI FieldGuide
BOOKS Database
NCBI FieldGuide
BOOKS Database: hyperlinked
NCBI FieldGuide
BOOKS Database
NCBI FieldGuide
BOOKS Database
NCBI FieldGuide
BOOKS Database
NCBI FieldGuide
Genes & Dis
NCBI FieldGuide
Genes & Dis
NCBI FieldGuide
For More Information…
NCBI FieldGuide
Intermission