A Field Guide to GenBank and NCBI Resources

Download Report

Transcript A Field Guide to GenBank and NCBI Resources

NCBI Molecular Biology Resources
—— Entrez
Mar. 2005
NCBI
王禄山
NCBI Resources

About NCBI

NCBI Sequence Databases
• Primary Database – GenBank
• Derivative Databases - RefSeq
Entrez Databases and Text Searching

BLAST
NCBI

The National Institutes of Health
Bethesda, MD
NCBI
The National Center for
Biotechnology Information

Accepts submissions of primary data

Develops tools to analyze these data

Creates derivative databases based on the primary data

Provides free search, link, and retrieval of these data, primarily through the
Entrez system
NCBI
The National Center for Biotechnology
Information (NCBI)

Created as a part of the National Library of Medicine in
1988
• Establish public databases
• Research in computational biology
• Develop software tools for sequence analysis
• Disseminate biomedical information
Tools: Entrez (1992) ,BLAST(1990),

GenBank (1992)

Free MEDLINE (PubMed, 1997)

Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM,
UniGene, GeneMap, Taxonomy, CGAP, SAGE, LocusLink,
RefSeq
NCBI

NCBI WWW Users per Day
NCBI
Number of Users and Hits Per Day
450,000
400,000
1997
1998
1999
2000
2001
2002
2003
300,000
250,000
200,000
150,000
NCBI
Number of Users
350,000
100,000
50,000
0
Christmas & New Year
Homepage - accessing the data
all[filter]
NCBI
all[filter]
1/11/2005
NCBI
Entrez Nucleotide
Primary Data
GenBank / DDBJ / EMBL
GenBank

46,974,918 (98.86 %)
Derivative Data

RefSeq

PDB (structures)
Third Party Annotation (TPA)

5,484
4,516
47,518,338
NCBI
“Total”
533,236 (1.12 %)
GenBank: NCBI’s Primary Sequence Database
Release 145
40.6 x 106
44.5 x 109
153 Gigabytes
Dec 2004
Records
Nucleotides
705 files
ftp://ftp.ncbi.nih.gov/genbank/
ftp://genbank.sdsc.edu/pub
ftp://bio-mirror.net/biomirror/genbank
NCBI
• full release every two months
• incremental and cumulative updates daily
• available only through internet
• release notes: gbrel.txt
Molecular Databases

Primary Databases
•
•
Original submissions by experimentalists
Database staff organize but don’t add additional information
•

Example: GenBank
Derivative Databases
•
Human curated
•
•
Computationally Derived
•
•
Example: SWISS-PROT, NCBI RefSeq mRNA
Example: UniGene
Combinations
•
Example: NCBI Genome Assembly
NCBI
•
compilation and correction of data
Primary vs. Derivative Databases
Sequencing
Centers
UniGene
GenBank
STS
Updated ONLY
by submitters
GSS
HTG
PHG
VRL
PRI ROD
PLN
MAM
Updated
by NCBI
RefSeq:
annotation
pipeline
BCT
Curators
RefSeq:
Entrez Gene and
Genomes pipelines
RefSeq
NCBI
INV VRT
Labs
UniSTS
EST
The GenBank Record
NCBI
GenBank Records
The Flatfile Format
Header
Sequence
NCBI
Feature Table
A Typical GenBank Record
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
NM_019570 4279 bp mRNA linear INV 28-OCT-2004
Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNA
NM_019570
NM_019570.3 GI:50811869
= Title
.
NCBI
Entrez
GenBank Record: Feature Table
Entrez
NCBI
GenBank Record: Feature Table
Entrez
GenPept identifier
NCBI
Blast
GenBank Record: sequence
NCBI
skip
Blast
http://www.ncbi.nlm.nih.gov/
NCBI
Homepage
NCBI
Mendelian Inheritance in Man
Entrez
NCBI
BLAST
NCBI
Homepage
Online Help
NCBI
Using Entrez
An integrated database search and retrieval
system
NCBI
Entrez: Neighboring and Hard Links
Word weight
PubMed
abstracts
33-D
-D
Structure
Structure
Taxonomy
Phylogeny
Genomes
BLAST
Nucleotide
sequences
VAST
(MMDB)
Protein
sequences
BLAST
NCBI
GEO(gene expression omnibus, 基因表达汇编):
收集、存贮微阵列基因表达数据的数据库。
NCBI
NCBI
NCBI
Unigene
NCBI
NCBI
NCBI
Database Searching with Entrez
Using limits and field restriction to find mouse GAPD

Linking and neighboring with mouse GAPD
NCBI

Entrez Nucleotides
Mouse
NCBI
Document Summaries: Mouse[All Fields]
7 million records
NCBI
Data Rich, Knowledge Poor
NCBI
不要把自己淹没于「数据信息的海洋」中,
要去找「知识的岛屿」。
什么是数据、信息、知识?
NCBI
一定注意现在生物信息学存贮数据库叫DATABASE
Entrez Nucleotides: Limits: Preview/Index
Mouse
NCBI
Entrez Nucleotides: Limits
NCBI
Accession
All Fields
Author Name
EC/RN Number
Mouse
Feature key
Field Restriction
Filter
Gene Name
Issue
Journal Name
Keyword
Exclude unwanted categories of sequences
Modification Date
Organism
Page Number
Gene Location
Molecule
Primary Accession
Genomic DNA/RNA
Genomic DNA/RNA
Properties
Mitochondrion
mRNA
Protein Name
Chloroplast
rRNA
Publication Date
SeqID String
Only From
Sequence Length
RefSeq
Substance Name
GenBank
Text Word
EMBL
Title Word
DDBJ
Uid
Entrez Nucleotides: Limits: Organism
Mouse
NCBI
Document Summaries: Mouse[Organism]
7,247,131[All Fields]
-6,850,905[Organism]
397,226
NCBI
Exclude Bulk Sequences, mRNA
NCBI
502497
NCBI
Preview /
Index
NCBI
Adding Terms: Preview/Index
Search History
NCBI
glyceraldehyde 3 phosphate dehydrogenase
NCBI
mouse AND glyceraldehyde 3 phosphate dehydrogenase[Title]
NCBI
Mouse GAPD Records
161
NCBI
NCBI
3
19
NCBI
History
NCBI
NCBI
#18 AND # 6
NCBI
NCBI
Displaying
Records
NCBI
Displaying Mouse GAPD Records
NCBI
Summary
Brief
GenBank
ASN.1
Formats
FASTA
GI list
LinkOut
PubMed Links
Protein Links
Links and neighbors (related records)
Nucleotide Neighbors
PopSet Links
Structure Links
Genome Links
Taxonomy Links
OMIM Links
NCBI
NCBI
Entrez GenBank / GenPept
NCBI
GenPept
FASTA Format
>gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of glycerald
GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC
AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC
ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT
CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC
CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT
GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT
>
AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA
CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA
CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT
gi number
Locus Name
ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA
CTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTAT
Database Identifiers
GACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC
Accession number
gb
GenBank
TTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGC
CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGC
emb
EMBL
CAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACC
dbj
DDBJ
CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGG
sp
SWISS-PROT
CTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC
GGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTC
pdb
Protein Databank
GTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA
pir
PIR
TGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCG
prf
PRF
GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC
FASTA Definition Line
>gi|193425|gb|M60978.1|MUSGAPDS
RefSeq
NCBI
ref
NCBI
NCBI
NCBI
Abstract Syntax Notation: ASN.1
Seq-entry ::= set {
level 1 ,
class nuc-prot ,
descr {
title "Mus musculus testis-specific isoform of glyceraldehyde 3-phosphate
dehydrogenase (Gapd-S) mRNA, and translated products" ,
update-date
std {
year 1994 ,
month 11 ,
day 9 } ,
source {
org {
taxname "Mus musculus" ,
common "house mouse" ,
db {
{
db "taxon" ,
tag
id 10090 } } ,
GenPept
GenBank
ASN.1
FASTA
Nucleotide
NCBI
FASTA
Protein
NCBI
NCBI Toolbox
/*****************************************************************************
*
*
asn2ff.c
*
convert an ASN.1 entry to flat file format, using the FFPrintArrayPtrs.
*
*****************************************************************************/
#include <accentr.h>
#include "asn2ff.h"
#include "asn2ffp.h"
#include "ffprint.h"
#include <subutil.h>
#include <objall.h>
#include <objcode.h>
#include <lsqfetch.h>
#include <explore.h>
Toolbox Sources
FILE *fpl;
ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools
Args myargs[] = {
{"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL},
{"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL},
{"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL},
{"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL},
{"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL},
NCBI
ftp> open ncbi.nlm.nih.gov
.
.
#ifdef ENABLE_ID1 ftp> cd toolbox
#include <accid1.h>
ftp> cd ncbi_tools
#endif
Protein Neighbors-Structure Links
NCBI
NCBI
NCBI
NCBI
NCBI
NCBI
NCBI
NCBI
NCBI
NCBI
Protein Neighbors-Structure Links
Related Proteins
Cn3D GAPD Structure
Structure Links
NCBI
Advanced Neighbors: BLink
NCBI
BLink
NCBI
Online Books
NCBI
建
议
千万不要使自己成为data的收集者,不要使
自己只成为database(这是计算机的工作
),要成为这些信息的加工者,使自己成
为有知识的人!

华罗庚
• 读书要从薄到厚,从厚到薄。
NCBI

NCBI
Entrez Structures
Molecular Modeling Database (MMDB) and Cn3D
NCBI
MMDB: Molecular Modeling Data Base

Derived from experimentally determined PDB records

Value added to PDB records including:
•
•
•
•
Structure neighbors determined by
Vector Alignment Search Tool (VAST)
NCBI

Addition of explicit chemical graph information
Validation
Inclusion of Taxonomy, Citation, and other information
Conversion to parseable ASN.1 data description language
Searching MMDB
NCBI
1CET
Structure Summary
BLAST neighbors
VAST neighbors
NCBI
Cn3D viewer
Cn3D : Displaying Structures
NCBI
Chloroquine
Structure Neighbors
NCBI
Structural Alignments
Chloroquine
NADH
NCBI
Why do we need similarity searching?
Identification and annotation
•Incomplete or no annotations (GenBank)
•Incorrectly annotated sequences
but it ain’t necessarily so!
NCBI
 Evolutionary relationships
homologous molecules may
have similar functions
Basic Local Alignment Search Tool

Widely used similarity search tool

Heuristic approach based on Smith Waterman algorithm

Finds best local alignments

Provides statistical significance

All combinations (DNA/Protein) query and database.

DNA vs DNA
DNA translation vs Protein
Protein vs Protein
Protein vs DNA translation
DNA translation vs DNA translation
www, email server, standalone, and network clients
NCBI
•
•
•
•
•
Local Alignment Statistics
High scores of local alignments between two random sequences
follow Extreme Value Distribution
For ungapped alignments:
Expected number with score S or greater
E = Kmne-S
or
E = mn2-S’
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
NCBI
K = scale for search space
 = scale for scoring system
S’= bitscore = (S - lnK)/ln2
Scoring Systems
•Nucleic acids
identity matrix
•Proteins
•Position Independent Matrices
•PAM Matrices (Percent Accepted Mutation)
•Implicit model of evolution
•Higher PAM number all calculated from PAM1
•PAM250 widely used
•BLOSUM Matrices (BLOck SUbstition Matrices)
•Position Specific Score Matrices (PSSM)
•PSI and RPS BLAST
NCBI
•Empirically determined from alignment
of conserved blocks
•Each includes information up to a certain level of identity
•BLOSUM62 widely used
A 4
R -1 5
N -2 0 6
D -2 -2 1 6
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5
Common amino acids have low weights
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
Rare amino acids have high weights
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
TNegative
0 -1 for
0 -1
-1 substitutions
-1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
less-1
likely
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
for -1
more
substitutions
X 0 -1 -1 Positive
-1 -2 -1
-1 likely
-1 -1
-1 -1 -1 -1 -2 0 0 -2 -1 -1 -1
A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62
NCBI
Position Specific Substitution Rates
Typical serine
Active site serine
NCBI
Position Specific Score Matrix (PSSM)
D
G
V
I
S
S
C
N
G
D
S
G
G
P
L
N
C
Q
A
R N D C Q E G H I L K M
-2 0 2 -4 2 4 -4 -3 -5 -4 0 -2
-1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2
1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5
3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2
-5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6
-4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4
-7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5
Serine scored differently
0 2 -1 -6 7 0 -2 0 -6 -4 2 0
in these
-3 -3 -4 -4 -4
-5 7two
-4 positions
-7 -7 -5 -4
-5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7
-4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5
-6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6
-6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6
Active site nucleophile
-6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6
-6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1
-6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4
-4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0
1 4 2 -5 2 0 0 0 -4 -2 1 0
-1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2
F
-6
-3
-6
-5
-7
-5
0
-2
-4
-7
-6
-7
-7
-7
0
-3
-1
0
-2
P
1
-2
-4
-5
-5
-1
-7
-5
-6
-5
-4
-6
-6
9
-6
-6
-4
0
-3
S
0
-2
0
-3
1
4
-4
-1
-3
-4
7
-4
-2
-4
-6
-2
-1
-1
0
T
-1
-1
-2
0
-3
3
-4
-3
-5
-4
-2
-5
-4
-4
-5
-1
0
-1
-2
W
-6
0
-6
-1
-7
-6
-5
-3
-6
-8
-6
-6
-6
-7
-5
-6
-5
-3
-2
Y
-4
-6
-4
-4
-5
-5
0
-4
-6
-7
-5
-7
-7
-7
-4
-1
0
-3
-2
V
-1
-5
-2
0
-6
-3
-4
-3
-6
-7
-5
-7
-7
-6
0
6
0
-4
-3
NCBI
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
A
0
-2
-1
-3
-2
4
-4
-2
-2
-5
-2
-3
-3
-2
-4
-1
0
0
-1
Gapped Alignments
•Gapping provides more biologically realistic alignments
•Statistical behavior not completely understood for
gapped alignments
•Gapped BLAST parameters must be found by
simulations for each matrix
NCBI
•Affine gap costs = -(a+bk)
a = gap open penalty b = gap extend penalty
A gap of length 1 receives the score -(a+b)
Intermission
NCBI
建
议
千万不要使自己成为data的收集者,不要使
自己只成为database(这是计算机的工作
),要成为这些信息的加工者,使自己成
为有知识的人!

华罗庚
• 读书要从薄到厚,从厚到薄。
NCBI
