Transcript Document
A Field Guide to GenBank and NCBI’s Molecular Biology Resources August 30, 2005 University of Colorado Health Sciences Center NCBI FieldGuide National Center for Biotechnology Information About NCBI GenBank overview Primary vs derivative databases The Reference Sequence (RefSeq) project Entrez databases Genome resources Bookshelf -break- Entrez text searching BLAST sequence searching VAST structure searching An integrated example NCBI FieldGuide Topics Bethesda, MD NCBI FieldGuide The National Institutes of Health Accepts submissions of primary data Develops tools to analyze these data Creates derivative databases based on the primary data Provides free search, link, and retrieval of these data, primarily through the Entrez system NCBI FieldGuide The National Center for Biotechnology Information NCBI FieldGuide NCBI WWW Users per Day 450,000 400,000 1997 1998 1999 2000 2001 2002 2003 Number of Users 350,000 300,000 250,000 200,000 150,000 100,000 50,000 0 Christmas & New Year NCBI FieldGuide Number of Users Per Day all[filter] NCBI FieldGuide Homepage - accessing the data all[filter] 3/15/2005 8/15/2005 NCBI FieldGuide 1/11/2005 Primary Data GenBank GenBank / DDBJ / EMBL # records 57.3 million (97.4 %) Derivative Data RefSeq RefSeq reviewed PDB (structures) “Total” 1.47 million (2.5 %) 60,000 5,973 59 million NCBI FieldGuide Entrez Nucleotide NCBI’s Primary Sequence Database Release 149 47 x 106 52 x 109 195 Gigabytes August 2005 Records Nucleotides Over 100 billion bases! 816 files • full release every two months • incremental and cumulative updates daily • available only through internet • release notes: gbrel.txt ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank NCBI FieldGuide GenBank: Nucleotide only sequence database Archival in nature GenBank Data Direct submissions (traditional records) Batch submissions (EST, GSS, STS) ftp accounts (genome data) Three collaborating databases GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL) Database NCBI FieldGuide What is GenBank? “Organismal” PRI ROD PLN BCT INV VRT VRL MAM PHG SYN UNA (28) (15) (13) (11) (7) (7) (4) (2) (1) (1) (1) Primate Rodent Plant and Fungal Bacterial/Archeal Invertebrate Other Vertebrate Viral Mammalian Phage Synthetic Unannotated EST GSS HTG PAT STS CON (377) (138) (63) (17) (9) (1) Expressed Sequence Tag Genome Survey Sequence High Throughput Genomic Patent Sequence Tagged Site Contigs, virtual • Organized by taxonomy (sort of) • Direct submissions (Sequin/Bankit) • Accurate (~1 error per 10,000 bp) • Well characterized “Functional” • Organized by sequence type • Batch submissions (ftp/email) • Inaccurate • Poorly characterized NCBI FieldGuide GenBank Divisions Expressed Sequence Tag 1st pass single read cDNA GenBank EST GSS Genome Survey Sequence 1st pass single read gDNA High Throughput Genomic incomplete sequences of genomic HTG clones STS Sequence Tagged Site PCR-based mapping reagents Whole Genome Shotgun NCBI FieldGuide GenBank Functional (Bulk) Divisions >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA 5’ GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC 30,000 TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN genes 3’ TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG nucleus >IMAGE:275615 3', mRNA sequence - isolate unique clones NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA RNA - sequence once from TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT gene products each end AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC make cDNA library 80-100,000 unique cDNA clones in library NCBI FieldGuide EST Division: Expressed Sequence Tags NCBI FieldGuide GSS, WGS, HTG Whole BAC insert (or genome) shred sequence GSS division or trace archive assembly isolate clones whole genome shotgun assemblies (traditional division) Draft sequence (HTG division) LOCUS AC141845 Honeybee Draft Sequences 147720 bp DNA linear HTG 19-MAR-2004 DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE, 14 unordered pieces. ACCESSION AC141845 VERSION AC141845.1 GI:29124029 KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT. • Unfinished sequences of BACs • Gaps and unordered pieces • Finished sequences (Phase 3) move to traditional GenBank division NCBI FieldGuide HTG Example: 351 projects Bacteria (251) Environmental sequences (6) Archaea (6) Eukaryotes (88), including: Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human Pufferfish (2) Honeybee, Anopheles, Fruit Flies (3), Silkworm Nematode (2) Yeasts (8), Aspergillus (2) Rice (2) NCBI FieldGuide Whole Genome Shotgun Projects wgs master[properties] NCBI FieldGuide Whole Genome Shotgun (WGS) Projects Sequencing Centers UniGene GenBank Updated ONLY by submitters INV VRT PHG VRL EST STS HTG GSS PRI ROD PLN MAM BCT Labs NCBI FieldGuide Derivative Databases UniSTS Updated by NCBI RefSeq: RefSeq Entrez Gene and annotation pipelines Entrez Nucleotide query: human[organism] AND lipase[title] NCBI FieldGuide Why Make Reference Sequences? Entrez Nucleotide query: Why Make Reference Sequences? NCBI FieldGuide human[organism] AND lipase[title] human[organism] ANDAND lipase[title] endothelial[title] human[organism] lipase[title] AND AND endothelial[title] 4150 bp 2323 bp 3927 bp 261 bp NCBI FieldGuide 3927 bp genomes • transcripts proteins non-redundant; best representative •updates to reflect current sequence data and biology •distinct, stable accession series NCBI FieldGuide RefSeq Benefits Accession Sequence Type NM_123456789 NP_123456789 NR_123456 XM_123456 XP_123456 XR_123456 ZP_12345678 mRNA protein, from NM_ non-coding RNA predicted mRNA predicted protein predicted non-coding RNA predicted from NZ_ NC_123456 NG_123455 genomic, e.g., chromosomes genomic, incomplete region NT_123456 NW_123456 NZ_ABCD12345678 genomic, BAC assembly genomic, WGS assembly genomic, WGS collection blue=curated NCBI FieldGuide Reference Sequence: RefSeq NCBI FieldGuide Annotation Process Genomic DNA (NC, NT, NW) Scanning.... Model mRNA (XM) Model protein (XP) (XR) Curated mRNA (NM) (NR) RefSeq Genbank Sequences Curated Protein (NP) Genome annotation NM’s must have cDNA support transcript variant 1 transcript variant 2 transcript variant 3 Longest mRNA NCBI FieldGuide Creating NM_ Records NCBI FieldGuide Where is RefSeq? CancerChromosomes Gene UniGene UniST S Homologen e SNP Genome PopSet Nucleotide GEO Books MeSH PubMed OMIM Entrez Protein Taxonomy GENSAT PubChe m PMC Journal s Domains Structur e 3D Domains NCBI FieldGuide The Entrez System UniGene Clusters of ESTs, mRNAs dbSNP Single Nucleotide Polymorphisms GEO Gene Expression Omnibus microarray and other expression data CDD Conserved Domain Database protein families (COGs and KOGs) single domains (PFAM, SMART, CD) NCBI FieldGuide A Few Entrez Databases Gene-oriented clusters of expressed sequences • Automatic clustering using MegaBlast • Each cluster represents a unique gene • Informed by genome hits • Information on tissue types and map locations • Useful for gene discovery and selection of mapping reagents NCBI FieldGuide UniGene NCBI FieldGuide A Cluster of ESTs query 5’ EST hits 3’ EST hits UniGene Collections NCBI FieldGuide Example UniGene Cluster NCBI FieldGuide Histogram of cluster sizes for UniGene Hs Build 177 NCBI FieldGuide (Now at Build #186) UniGene Cluster Hs.95351 NCBI FieldGuide SELECTED PROTEIN SIMILARITES UniGene Cluster Hs.95351 NCBI FieldGuide GENE EXPRESSION NCBI FieldGuide UniGene Cluster Hs.95351: expression NCBI FieldGuide UniGene Cluster Hs.95351: seqs web page ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/ NCBI FieldGuide Download sequences Entrez GEO NCBI FieldGuide Primary and derivative (RefSNP) Single nucleotide polymorphisms Repeat polymorphisms Insertion-deletion polymorphisms Over 19 million refSNPs (rsXXXXXXX) (August, 2005) NCBI FieldGuide NCBI’s SNP Database NCBI FieldGuide Searching dbSNP NCBI FieldGuide RefSNP NCBI FieldGuide RefSNP NCBI FieldGuide RefSNP Search Mouse SNP between strains NCBI FieldGuide RefSNP MapView GeneView SeqView No 3D OMIM NCBI FieldGuide RefSNP NCBI FieldGuide RefSNP Entrez GEO NCBI FieldGuide GPL Platform descriptions GSM GSE Grouping of Raw/processed slide/chip data spot intensities from a single “a single experiment” slide/chip GEO SaMple: GEO SEries: experimental set of related conditions samples Entrez GEO Curated by NCBI NCBI FieldGuide Submitted by Manufacturer* Submitted by Experimentalists GDS Grouping of experiments Entrez GEO Datasets Supplied by submitter Platform Sample Series (GPL) (GSM) (GSE) array definition hyb. measurements related Samples DataSet Assembled by GEO staff (GDS) • A collection of experimentally-related samples processed using the same platform. • Samples within DataSets are organized into subgroups based on experimental variables. • Form the basis of GEO’s query, analysis and data display tools. NCBI FieldGuide What’s a DataSet? Gene Expression Omnibus (GEO) NCBI FieldGuide Dataset browser GEO Dataset Browser NCBI FieldGuide GEO Dataset Report NCBI FieldGuide … of 12625 NCBI FieldGuide GEO Profiles Entrez CDD NCBI FieldGuide Multiple sequence alignments Position-specific scoring matrices (PSSM) Sources SMART, PFAM, COGs, KOGs, and NCBI curated domains (structure-informed alignments) NCBI FieldGuide Conserved Domain Database >gi|45549418|gb|AAS67634.1| ATP7A [Solenodon paradoxus] IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPS STNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEIL KKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNS CVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE NCBI FieldGuide CDD NCBI FieldGuide CDD Click on a colored bar to align your sequence to the CD CD Pfam COG NCBI FieldGuide Conserved Domain Database: cd00371.1, HMA NCBI FieldGuide CDD CDART: Conserved Domain Architecture Retrieval Tool NCBI FieldGuide Linking from Entrez Protein NCBI FieldGuide cdd Genomic Biology Gene database Homologene Map Viewer Trace Archive NCBI FieldGuide Genome Resources NCBI FieldGuide Genomic Biology Gen Biol: Gen Resources NCBI FieldGuide NCBI FieldGuide Gen Biol: Gen Resources Gen Biol: Gen Resources NCBI FieldGuide NCBI FieldGuide Genome Projects: microb Gen Biol: Gen Resources NCBI FieldGuide Gen Biol: Gen Resources NCBI FieldGuide Gen Biol: Gen Resources NCBI FieldGuide Gen Biol: Gen Resources NCBI FieldGuide NCBI FieldGuide Gen Biol: Gen Resources Genomic Biology Gene database Homologene Map Viewer Trace Archive NCBI FieldGuide Genome Resources A single query interface to … • Sequences - RefSeqs - GenBank - Homologene • Maps – MapViewer • Entrez links • Linkouts More organisms, ~ 3000 Entrez integration NCBI FieldGuide Entrez Gene Global Entrez: NADH2 NCBI FieldGuide Entrez Gene: NADH2 NCBI FieldGuide Gene Record for Pongo NADH2 NCBI FieldGuide Not found with “nadh2” Homo sapiens A Record With More Data: Human HFE NCBI FieldGuide Transcripts with experimental evidence NCBI FieldGuide Human HFE: Transcripts Gene Table NCBI FieldGuide Introns/Exons: Gene Table NCBI FieldGuide links to sequence Human HFE: Links NCBI FieldGuide NCBI FieldGuide Genotype Genotype NCBI FieldGuide Human HFE: Links NCBI FieldGuide NCBI FieldGuide GeneView in dbSNP NCBI FieldGuide SNP in Structure NCBI FieldGuide SNP in Structure NCBI FieldGuide SNP in Structure H41 S43 C260 Another Variation Source: OMIM NCBI FieldGuide Variants in OMIM NCBI FieldGuide Genomic Biology Gene database Homologene Map Viewer Trace Archive NCBI FieldGuide Genome Resources Automated detection of homologs among the annotated genes of completely sequenced eukaryotic genomes. No longer UniGene based Protein similarities first Guided by taxonomic tree Includes orthologs and paralogs NCBI FieldGuide The New Homologene Homologene Build 43.1 (8/23/05) Species Number of genes input grouped groups NCBI FieldGuide The New Homologene NCBI FieldGuide RAG1 → Homologene NCBI FieldGuide RAG1 → Homolgene RAG1 RAG1 NCBI FieldGuide RING-finger NCBI FieldGuide RAG1 → Homolgene RAG1 RAG1 NCBI FieldGuide Sugar_tr NCBI FieldGuide Homologene: alignment scores NCBI FieldGuide BLASTP bl2seq LocusLink Gene database UniGene Homologene Map Viewer Trace Archive NCBI FieldGuide Genome Resources NCBI FieldGuide List View Human MapViewer NCBI FieldGuide adar MapViewer: Human ADAR NCBI FieldGuide 5’ UTR MV Hs ADAR NCBI FieldGuide 3’ UTR --Sequence maps-Ab initio Assembly Repeats BES_Clone Clone NCI_Clone Contig Component CpG island dbSNP haplotype Fosmid GenBank_DNA Gene Phenotype SAGE_Tag STS TCAG_RNA Transcript (RNA) Hs_UniGene Hs_EST Mm_UniGene Mm_EST Rn_UniGene Rn_EST Ssc_UniGene Ssc_EST Bt_UniGene Bt_EST Gga_UniGene Gga_EST Variation --Cytogenetic maps-Ideogram FISH Clone Gene_Cytogenetic Mitelman Breakpoint Morbid/Disease --Genetic Maps-deCODE Genethon Marshfield --RH maps-= SNP GeneMap99-G3 GeneMap99-GB4 NCBI RH Standford-G3 TNG Whitehead-RH Whitehead-YAC NCBI FieldGuide Maps & Options Maps & Options MapViewer UniGene Repeats Gene NCBI FieldGuide Component Phenotype NCBI FieldGuide Gene Variation NCBI FieldGuide Maps & Options Maps & Options LocusLink Gene database UniGene Homologene Map Viewer Trace Archive NCBI FieldGuide Genome Resources NCBI FieldGuide Trace Archive Page NCBI FieldGuide Macaca Mulatta Traces NCBI FieldGuide Access to sequences NOT in GenBank NCBI FieldGuide Trace Archive BLAST Page NCBI FieldGuide Literature Links NCBI FieldGuide BOOKS Database NCBI FieldGuide BOOKS Database: hyperlinked NCBI FieldGuide BOOKS Database NCBI FieldGuide BOOKS Database NCBI FieldGuide BOOKS Database NCBI FieldGuide Genes & Dis NCBI FieldGuide Genes & Dis NCBI FieldGuide For More Information… NCBI FieldGuide Intermission