Transcript Document
NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide • • • • • • About NCBI NCBI Sequence Databases Other NCBI Databases Entrez Databases and Text Searching Genomic Resources BLAST Services NCBI FieldGuide NCBI Resources Bethesda,MD Created in 1988 as a part of the National Library of Medicine at NIH – – – – Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information NCBI FieldGuide The National Center for Biotechnology Information • Primary Databases – Original submissions by experimentalists – Content controlled by the submitter • Examples: GenBank, SNP, GEO • Derivative Databases – Built from primary data – Content controlled by third party (NCBI) • Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain NCBI FieldGuide Types of Databases NCBI FieldGuide The Entrez System Primary • GenBank / EMBL / DDBJ 35,116,960 Derivative • RefSeq • Third Party Annotation • PDB Total 259,219 3,182 4,703 35,384,248 NCBI FieldGuide Entrez Nucleotides • GenPept (GB,EMBL, DDBJ) • RefSeq • Third Party Annotation • Swiss Prot • PIR • PRF 3,178,346 933,905 4,338 146,978 282,821 12,079 Total 4,314,705 BLAST nr 2,724,717 NCBI FieldGuide Entrez Protein What is GenBank? • Nucleotide only sequence database • Archival in nature • GenBank Data – Direct submissions (traditional records ) – Batch submissions (EST, GSS, STS) – ftp accounts (genome data) • Three collaborating databases – GenBank – DNA Database of Japan (DDBJ) – European Molecular Biology Laboratory (EMBL) Database NCBI FieldGuide NCBI’s Primary Sequence Database NCBI FieldGuide International Sequence Database Collaboration Entrez NIH NCBI GenBank •Submissions •Updates •Submissions •Updates EMBL CIB NIG DDBJ •Submissions •Updates getentry EBI SRS EMBL 35 40 Sequence records Total base pairs 35 Release 140: 32.5 million records 37.9 billion nucleotides 30 25 20 Average doubling time ≈ 12 months 20 15 15 10 10 5 0 5 ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04 0 Total Base Pairs (billions) Sequence Records (millions) 30 25 NCBI FieldGuide The Growth of GenBank Records are divided into 17 Divisions. 1 Patent (11 files) 5 Bulk 11 Traditional Traditional Divisions: BULK Divisions: • Direct Batch Submissions Submission (Sequin andFTP) BankIt) (Email and • Accurate Inaccurate • Well characterized Poorly characterized EST (288) Expressed Sequence Tag PRI (27) Primate GSS (98) Genome Survey Sequence PLN (10) Plant and Fungal HTG (61) High Throughput Genomic BCT (8) Bacterial and Archeal STS (3) Sequence Tagged Site INV (6) Invertebrate HTC (3) High Throughput cDNA ROD (11) Rodent VRL (3) Viral VRT (4) Other Vertebrate MAM (1) Mammalian (ex. ROD and PRI) PHG (1) Phage SYN (1) Synthetic (cloning vectors) UNA (1) Unannotated Entrez query: gbdiv_xxx[Properties] NCBI FieldGuide Organization of GenBank: GenBank Divisions •Direct Submissions (Sequin and BankIt) •Accurate •Well characterized BCT INV MAM PHG PLN PRI ROD SYN VRL VRT Bacterial and Archeal Invertebrate Mammalian (ex. ROD and PRI) Phage Plant and Fungal Primate Rodent Synthetic (vectors, synth. genes) Viral Other Vertebrate NCBI FieldGuide Traditional GenBank Divisions A Traditional GenBank Record REFERENCE AUTHORS TITLE JOURNAL MEDLINE PUBMED REFERENCE AUTHORS TITLE JOURNAL REFERENCE AUTHORS TITLE JOURNAL REMARK COMMENT AF062069 3808 bp mRNA linear INV 23-OCT-2002 Limulus polyphemus myosin III mRNA, complete cds. AF062069 AF062069.2 GI:7144484 . Limulus polyphemus (Atlantic horseshoe crab) Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. 1 (bases 1 to 3808) Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. A myosin III from Limulus eyes is a clock-regulated phosphoprotein J. Neurosci. 18 (12), 4548-4559 (1998) 98279067 9614231 2 (bases 1 to 3808) Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. Direct Submission Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA 3 (bases 1 to 3808) Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. Direct Submission Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA Sequence update by submitter On Mar 2, 2000 this sequence version replaced gi:3132700. NCBI FieldGuide LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM GenBank Record: Locus LOCUS AF062069 3808 bp mRNA linear INV 23-OCT-2002 DEFINITION Limulus polyphemus myosin mRNA, complete LOCUS AF062069 3808 bp III mRNA linearcds.INV 23-OCT-2002 ACCESSION AF062069 VERSION AF062069.2 GI:7144484 KEYWORDS . SOURCE Limulus polyphemus (Atlantic horseshoe crab) ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998) MEDLINE 98279067 PUBMED 9614231 REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700. Length Locus name Molecule type Division Modification Date GenBank Record: Identifiers LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM AF062069 3808 bp mRNA linear INV 23-OCT-2002 Limulus polyphemus myosin III mRNA, complete cds. AF062069 AF062069.2 GI:7144484 . Limulus polyphemus (Atlantic horseshoe crab) Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. 1 (bases 1 to 3808) Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. A myosin III from Limulus eyes is a clock-regulated phosphoprotein J. Neurosci. 18 (12), 4548-4559 (1998) 98279067 9614231 2 (bases 1 to 3808) Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. Direct Submission Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA 3 (bases 1 to 3808) Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. Direct Submission Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA Sequence update by submitter On Mar 2, 2000 this sequence version replaced gi:3132700. ACCESSION VERSION REFERENCE AUTHORS TITLE JOURNAL MEDLINE PUBMED REFERENCE AUTHORS TITLE JOURNAL REFERENCE AUTHORS TITLE JOURNAL REMARK COMMENT AF062069 AF062069.2 GI:7144484 GenBank Record: Organism LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM AF062069 3808 bp mRNA linear INV 23-OCT-2002 Limulus polyphemus myosin III mRNA, complete cds. AF062069 AF062069.2 GI:7144484 . Limulus polyphemus (Atlantic horseshoe crab) Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Limulidae; Limulus. SOURCEXiphosura; Limulus polyphemus (Atlantic horseshoe crab) REFERENCE 1 (bases 1 to 3808) ORGANISM Limulus polyphemus AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Eukaryota; Greenberg,R.M. and Metazoa; Smith,W.C. Arthropoda; Chelicerata; Merostomata; TITLE A myosin III from Limulus eyes isLimulus. a clock-regulated phosphoprotein Xiphosura; Limulidae; JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998) MEDLINE 98279067 PUBMED 9614231 REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700. NCBI’s Taxonomy GenBank Record: Feature Table FEATURES source CDS Location/Qualifiers 1..3808 /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" 258..3302 /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC16332.2" /db_xref="GI:7144485" /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ 1201 a 689 c 782 g 1136 t /protein_id="AAC16332.2" /db_xref="GI:7144485" GenPept IDs BASE COUNT ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3781 aagatacagt aactagggaa aaaaaaaa // A gene-oriented view of sequence entries •MegaBlast based automated sequence clustering •Now informed by genome hits New! •Nonredundant set of gene oriented clusters •Each cluster a unique gene •Information on tissue types and map locations •Includes well-characterized genes and novel ESTs •Useful for gene discovery and selection of mapping reagents NCBI FieldGuide What is UniGene? NCBI FieldGuide UniGene UniGene Build 168 132,990 mRNAs Feb. 24, 2004 6,327 models 7,235 HTC 1,408,949 EST, 3'reads 2,082,199 EST, 5'reads + 774,927 EST, other/unknown ---------4,412,627 total sequences in clusters Final Number of Clusters (sets) =============================== total 105,651 27,511 5,613 104,397 26,291 contain at least one mRNA 3,000,000,000 bp one HTC contain at least contain at least one EST 30 K expected genes contain both mRNAs transcripts and ESTs 75% uncharacterized NCBI FieldGuide Human UniGene NCBI FieldGuide Genome Sequencing - HTG, GSS,(WGS) Whole BAC insert (or genome) shredding sequencing GSS division or trace archive assembly cloning isolating whole genome shotgun assemblies (traditional division) Draft Sequence (HTG division) NCBI FieldGuide Other Genome Sequencing Products Trace Archive Whole Genome Shotgun • Primary reads from WGS and EST projects • Many not available in GenBank • Earliest access to genome data NCBI FieldGuide Trace Archive NCBI FieldGuide Derivative Sequence Databases RefSeq TPA Curators RefSeq TATAGCCG AGCTCCGATA CCGATGACAA Labs Genome Assembly TATAGCCG TATAGCCG TATAGCCG TATAGCCG GenBank UniGene Algorithms NCBI FieldGuide NCBI Derivative Sequence Data NCBI’s Derivative Sequence Database • Curated transcripts and proteins – reviewed – human, mouse, rat, fruit fly, zebrafish, arabidopsis • Model transcripts and proteins • Assembled Genomic Regions (contigs) – human genome – mouse genome • Chromosome records – Human genome – microbial – organelle srcdb_refseq[Properties] ftp://ftp.ncbi.nih.gov/refseq/release/ NCBI FieldGuide RefSeq: • • • • • • • NCBI FieldGuide RefSeq Benefits non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current sequence data and biology data validation format consistency distinct accession series stewardship by NCBI staff and collaborators mRNAs and Proteins NM_123456 NP_123456 NR_123456 XM_123456 XP_123456 XR_123456 Gene Records NG_123456 Chromosome NC_123455 Assemblies NT_123456 NW_123456 Curated mRNA Curated Protein Curated non-coding RNA Predicted mRNA Predicted Protein Predicted non-coding RNA Reference Genomic Sequence Microbial replicons, organelle Contig WGS Supercontig NCBI FieldGuide RefSeq Accession Numbers NCBI FieldGuide Third Party Annotation (TPA) Database • Annotations of existing GenBank sequences • Allows for community annotation of genomes • Direct submissions – BankIt – Sequin tpa[Properties] •dbSNP: •Geo: nucleotide polymorphism Gene Expression Omnibus microarray and other expression data •Gene: gene records Unifies LocusLink and Microbial Genomes •Structure: imported structures (PDB) Cn3D viewer, NCBI curation •CDD: conserved domain database Protein families (COGs and KOGs) Single domains (PFAM, SMART, CD) NCBI FieldGuide Other NCBI Databases NCBI FieldGuide NCBI Structures and Domains Molecular Modeling Data Base • Derived from experimentally determined PDB records • Value added to PDB records including: – Addition of explicit chemical graph information – Validation – Inclusion of Taxonomy, Citation, – Conversion to ASN.1 data description language • Structure neighbors determined by Vector Alignment Search Tool (VAST) NCBI FieldGuide MMDB: NCBI FieldGuide Structure Summary Cn3D viewer Structure Neighbors Conserved Domains 3D Domain Neighbors • • • • Multiple sequence alignments PSI-BLAST –based score matrices Sources SMART, PFAM, COGs, KOGs New NCBI curated domains – structure informed alignments • Stats: – – – – – COGS 4,873 KOGS 4,852 Pfam 5,193 Smart 653 NCBI CDD 316 NCBI FieldGuide NCBI’s Conserved Domain Database Entrez & BLAST NCBI FieldGuide WWW Access 250000 1997 1998 1999 2000 200000 150000 100000 50000 Christmas Day 0 2001 NCBI FieldGuide NCBI Web Traffic NCBI FieldGuide Using Entrez An integrated database search and retrieval system Entrez: Database Integration Word weight PubMed abstracts 3 -D 3-D Structure Structure Taxonomy Genomes Phylogeny BLAST VAST Nucleotide sequences Protein sequences BLAST NCBI FieldGuide Database Searching with Entrez Using limits and field restriction to find human MutL homolog Linking and neighboring with MutL Mapping SNPs onto structure and the genome NCBI FieldGuide Global Entrez Search MutL[All Fields] NCBI FieldGuide Document Summaries: Limits & Preview/Index NCBI FieldGuide Entrez Nucleotides: MutL Author Name EC/RN Number Feature key Filter Gene Name Field Restriction Issue Journal Name Keyword Modification Date Organism Exclude Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Uid Volume bulk sequences NCBI FieldGuide Accession Entrez Nucleotides: Limits All Fields MutL Title == Definition Exclude Bulk Sequences NCBI FieldGuide Entrez Nucleotides: Limits NCBI FieldGuide Document Summaries: Limits Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Uid Volume NCBI FieldGuide Adding Terms: Preview/Index NCBI FieldGuide Human MutL Search Results GenBank Records NCBI FieldGuide Human MutL RefSeq NCBI FieldGuide NM_000249: Links NCBI FieldGuide Literature Links PubMed OMIM Books NCBI FieldGuide NM_000249: PubMed Books Link NCBI FieldGuide Conserved Domain NCBI FieldGuide OMIM: Human Disease Genes NCBI FieldGuide Sequence Links Nucleotide Protein Genome Project BAC similarity Original GenBank mRNAs Original GenBank genomic NCBI FieldGuide NM_000249: Related Sequences NCBI FieldGuide Taxonomy Link The Tax Browser NCBI’s Taxonomy NCBI FieldGuide Taxonomy Link • • • • GenPept GenBank, EMBL, DDBJ CDS translations RefSeq mRNA based (NP_) and genome based (XP_) Swiss-Prot curated high quality protein reviews PIR protein information resource Georgetown University • PRF protein resource foundation • PDB Protein Databank sequences from structures NCBI FieldGuide NCBI Protein Databases BLAST Link Conserved Domains NCBI FieldGuide Protein Link NCBI FieldGuide Related Proteins: Redundancy Redundant Sequences NCBI FieldGuide Related Proteins: Links Sequence from MutL structure Arabidopsis homolog Conserved Domain NCBI FieldGuide BLink: non-redundant relatives NCBI FieldGuide NM_000249: Genome Links