Diapositiva 1

Download Report

Transcript Diapositiva 1

Genome, Protein
and
Model Organism Databases
Anne Estreicher
Swiss-Prot Group
Swiss Institute of Bioinformatics
Geneva – Switzerland
[email protected]
Bioinformatic and Comparative Genome Analysis Course
HKU-Pasteur Research Centre - Hong Kong, China
August 17 - August 29, 2009
Outline
1. Introduction (definitions, history…)
2. From DNA sequence to genomic tools
3. The flow of information: from DNA to proteins
4. Protein sequence databases
5. MODs at a glance
What is a database ?
• A collection of related data, which are
– structured
– searchable
– updated periodically
– cross-referenced
• Includes also associated tools necessary for access/query,
download, etc.
Why do we need databases ?
 Data need to be stored, curated and made
available for analysis and knowledge discovery
 Efficient way of sharing data, independently of
regular publications
 Essential resources for both experimental and
computational biologists
Databases in biology : not a new
issue …
• 1954
• 1965
First protein sequence (insulin by F. Sanger)
Atlas of Protein Sequence and Structure (65 proteins)
The first protein sequence "database"
by Margaret Dayhoff (1965)
contained 65 proteins
Databases: not a new issue…
•
•
•
•
•
1954
First protein sequence (insulin by F. Sanger)
1965
Atlas of Protein Sequence and Structure (65 proteins)
Mid 70s Improvements in DNA sequencing
1979
Los Alamos Sequence Library (Walter Goad)
1980
~ 80 genes fully sequenced
-> Need to store the data and to make them available for
analysis (in format acceptable for human eyes and machines)
-> ARCHIVE
-> RACE for the central position in life sciences…
And the winner is…
Databases: not a new issue…
EMBL-Bank - Europe 1980
GenBank - USA
1982
DDBJ - Asia
1986
leading to the establishment of the INSDC
(International Nucleotide Sequence
Database Collaboration) -> daily exchanges of data
www.insdc.org
EMBL-BANK - GenBank
- DDBJ
• Main resources for DNA and RNA sequences;
• Used to be retrieved from publications -> direct submissions
from individual researchers, genome sequencing projects and
patent applications:
“Journal publishers generally require sequence deposition prior to
publication so that an accession number can be included in the
paper.”
1. True for nucleic acid, not for protein sequences;
2. Not always put into practice
=> Not submitted sequences are LOST!!!
• Archives (primary databases)
• data belong to submitters
EMBL-BANK - GenBank
- DDBJ
Archive (primary databases) => data belong to the submitter
 Minimal checks, such
as vector contamination
 Annotation by the
submitters
Databases: not a new issue…
•
•
•
•
•
•
1954
1965
1979
1982
1984
1986
First protein sequence (insulin by F. Sanger)
Atlas of Protein Sequence and Structure (65 proteins)
Los Alamos Sequence Library (Walter Goad) – DNA
EMBL-Bank - DNA
GenBank – DNA
DDBJ - DNA
Databases: not a new issue…
•
•
•
•
•
•
1954
1965
1979
1982
1984
1986
First protein sequence (insulin by F. Sanger)
Atlas of Protein Sequence and Structure (65 proteins)
Los Alamos Sequence Library (Walter Goad) – DNA
EMBL-Bank - DNA
GenBank – DNA
DDBJ - DNA
-> ARCHIVES (primary databases) may not be sufficient
-> need to annotate the data to produce KNOWLEDGE
• 1986
Swiss-Prot – protein sequences – a paradigm for
annotated (secondary) databases
The Swiss-Prot concept



non-redundant:
Protein products of
1 gene / 1 species -> 1 entry,
Manually annotated (=> curator judgement on data !),
Highly cross-referenced (1st life-science database to
provide cross-references) (links to > 130 databases
from www.uniprot.org).
Databases: not a new issue…
•
•
•
•
•
1954
1965
1979
1982
1984
• 1986
• 1996
First protein sequence (insulin by F. Sanger)
Atlas of Protein Sequence and Structure (65 proteins)
Los Alamos Sequence Library (Walter Goad) – DNA
EMBL-Bank - DNA
GenBank – DNA
Protein information resource (PIR) – Protein sequences
DDBJ – DNA
Swiss-Prot – protein sequences
TrEMBL (Translated EMBL) – Protein sequences
Complement of Swiss-Prot to cope with the increasing
amount of new sequences; AUTOMATIC ANNOTATION !
UniProtKB/Swiss-Prot growth
Swiss-Prot rel. 57.5 (07-Jul-2009): 470’369 entries
500'000
Number of entries
450'000
400'000
350'000
1996: creation of TrEMBL
Swiss-Prot: 52’205 entries
TrEMBL:
61’137 entries
300'000
250'000
200'000
150'000
100'000
50'000
0
2
1986
3’939 entries
7
12
17
22
27
32
37
42
47
52
57
Release
number
UniProtKB growth
9'000'000
TrEMBL rel.40.5 (07-Jul-2009):
8’594’382 entries
Swiss-Prot rel.57.5 (07-Jul-2009): 470’369 entries
Number of entries
8'000'000
7'000'000
6'000'000
5'000'000
4'000'000
TrEMBL growth
(sequences/day)
2004
2006-2007
2008
2009
 1’500
 3’500
 >5’000
 ~8’000
3'000'000
TrEMBL
Automated curation
Swiss-Prot
Manual curation
2'000'000
1'000'000
Release
number
0
1986
1996
2009
New challenge
 Flood of data -> need to be stored, curated
and made available for analysis and knowledge
discovery
(R)evolution of these last 20 years
 Life sciences used to be rich in hypotheses, well-off in
knowledge and poor in data;
 Today they are very rich in data, not so well-off in
knowledge and very poor in hypotheses.
?
List of parts
Complex
system
Science (1993) 262, 502
EMBL Database Growth
http://www.ebi.ac.uk/embl/Services/DBStats/
Danger !
http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html
http://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat
In 4 months, 374 new genomes
and 77 were completed
~ 100 genomes/month
(in 2008 -> ~50 genomes/month)
+ ~2’360 viral (& viroid) genomes
=> Total ~ 5’600 genomes
http://genomesonline.org/index2.htm
http://www.genomesonline.org/gold.cgi
http://www.genomesonline.org/gold.cgi
Metagenomics:
study of genetic material recovered directly from
environmental samples
• Global Ocean Sampling (C. Venter)
• Whale fall
• Soil, sand beach, New-York air, …
• Human fluids, mouse gut
• …
Venter’s Sorcerer II
Flood in the world of proteins…
1965: first protein sequence "database" by Margaret
Dayhoff (65 proteins)

 July 2009: ~ 20 millions unique protein sequence
(source UniParc - http://www.uniprot.org/uniparc/)
UniParc:
non-redundant database that contains most of the publicly available
protein sequences in the world (includes sequences from EMBL-
Bank/DDBJ/GenBank nucleotide sequence databases, Ensembl, FlyBase, H-Invitational
Database (H-Inv), International Protein Index (IPI), Patent Offices (EPO, JPO and
USPTO), PIR-PSD, Protein Data Bank (PDB), Protein Research Foundation (PRF),
RefSeq, Saccharomyces Genome database (SGD), TAIR Arabidopsis thaliana
Information Resource, TROME, UniProtKB/Swiss-Prot and TrEMBL, Vertebrate Genome
Annotation database (VEGA) and WormBase).
New challenge
 Flood of data
 Flood of databases…
NAR 1st issue of the
year is always
dedicated to
databases + "clean"
list of databases
provided
(! not exhaustive !)
The NAR Online Molecular Biology Database
collection in 2009
A total of 1’170 databases (19 obsolete removed)
http://www.oxfordjournals.org/nar/database/a/
NAR "clean" list of databases
http://www.oxfordjournals.org/nar/database/a/
Most recent NAR
paper about the
database
(not available for all db,
some described in other
journals)
A "clean" list of can be found in the NAR online
molecular biology database collection
http://www.oxfordjournals.org/nar/database/a/
BIOLOGICAL DATABASE CATEGORIES
•
•
•
•
•
•
•
•
•
•
•
•
Databases of nucleic acid sequences (RNA, DNA)
Databases of protein sequences
Databases of protein motifs and protein domains
Databases of structures
Databases containing
Databases of genomes
sequences or data
Databases of genes
directly derived from
Databases of expression profiles sequences.
Databases of SNPs and mutations
Databases of metabolic pathways
Databases of protein interactions
Databases of taxonomy
…
DNA sequences :
What ?
Where ?
How ?
& genomic tools
NCBI
UCSC
Stable accession number
Possible molecule types:
(should always be cited
in
genomic
DNA and RNA
publications) mRNA
other DNA and RNA
rRNA
transcribed RNA
tRNA
unassigned DNA and RNA
viral cRNA
Nucleotide sequence
Accession number
Molecule type
Date of submission
Definition
GenBank entry AF415175
http://www.ncbi.nlm.nih.gov/nuccore/16589063
Accession number
Molecule type
Date of submission
Definition
Taxonomy
Nucleotide sequence
Accession number
Molecule type
Date of submission
Definition
Taxonomy
References
Nucleotide sequence
Accession number
Molecule type
Date of submission
Definition
Taxonomy
References
Organism
Molecule type
Chromosomal location
Features:
Information provided by the submitter Tissue type
May include annotation of the sequenceGene name
CDS annotation
=> protein sequence + Protein
IDentifier (PID: stable
identifier & version number)
Nucleotide sequence
Gives access to the nucleic
acid sequence of the CDS
(not of the entire mRNA)
Protein sequence
"Features" may provide much more information
depending upon the sequence and the submitter…
3’end of chromosome Y
EMBL #AJ271736
Very similar view, links and
options from the 3 sites:
EMBL-Bank – GenBank - DDBJ
http://www.ebi.ac.uk/embl/
http://www.ncbi.nlm.nih.gov/
http://www.ddbj.nig.ac.jp/
How to find a DNA sequence at
the NCBI…
http://www.ncbi.nlm.nih.gov/
Databases @ NCBI
http://www.ncbi.nlm.nih.gov/Database/datamodel/index.html
The Entrez system:
integrated, text-based search and retrieval system used at NCBI for
the major databases, including PubMed, Nucleotide and Protein
Sequences, Protein Structures, Complete Genomes, Taxonomy, and others
=> Maximal interconnectivity
Databases @ NCBI
http://www.ncbi.nlm.nih.gov/Database/datamodel/index.html
Simple search with a
EMBL-Bank/GenBank/DDBJ
accession number
Searching from
a bibliographic reference…
Search results 2 and 3
-> accession numbers provided by the authors in the article
-> GenBank records
Search result 1
-> corresponds to the RefSeq database…
RefSeq (Reference Sequence)
• Provides a comprehensive, integrated, non-redundant, wellannotated set of sequences, including genomic DNA,
transcripts, and proteins;
• Most data extracted from GenBank -> choice of a reference
sequence and annotation (no documented comparison between
sequences)
• Some entries based on predictions (accession: XM_; XR_;
XP_; ZP_);
• Currently, 8'665 species represented;
• Annotation:
 Manual annotation (only in entries tagged as "reviewed");
 Collaboration;
 Propagation from other sources;
 Computation.
RefSeq (Reference Sequence)
CURATION
GENOME ANNOTATION
No
INFERRED
No
MODEL
No
PREDICTED
No
PROVISIONAL
No
REVIEWED
Yes (sequence + functional
information and features)
VALIDATED
Yes (initial sequence)
WGS
No
RefSeq entry NM_015595: SGEF mRNA
Accession number
Definition
Taxonomy
List of references
RefSeq entry NM_015595: SGEF mRNA
Gene name
Exon annotation
CDS annotation and sequence
RefSeq entry NM_015595: SGEF mRNA
Sequence
Searching with
the gene name…
Etc.
GenBank
Refseq
Etc.
NCBI Entrez system
 Looks for the request in all NCBI databases
 Cannot be ignored -> no simple way to search
only in your favourite NCBI database
Searching using BLAST…
RefSeq
UniGene:
Clusters of transcript sequences
that appear to come from the
same transcription locus
UniSTS:62643 maps to
!?
multiple loci in Homo sapiens
Information on
tissue expression
UniGene
Mapping of
known genes
UniGene
Mapping of
known genes
Mapping of RNA
(EMBL/GenBank/DDBJ
& RefSeq)
UniGene
Mapping of
known genes
Mapping of RNA
(EMBL/GenBank/DDBJ
& RefSeq)
Mapping of
RefSeq RNA
UniGene
Mapping of
known genes
Mapping of RNA
(EMBL/GenBank/DDBJ
& RefSeq)
Mapping of
RefSeq RNA
This view by default can be customized
1. Choose desired option;
2. Add it (and/remove undesired)
3. Apply the new display
Zoom out -> a better view
of the genomic context of
the sequence of interest
Original view
Map viewer
~ 110 organisms represented
in Genome database.
(www.ncbi.nlm.nih.gov/sites/entrez?db=genome)
Genomic tools on the
UCSC server:
BLAT search
a total of 47 organisms
And:
A.Gambiae
A.Mellifera
S.cerevisiae
Genome browser @ UCSC
Feb. 2009 assembly: not all data implemented
cDNA !
May be better to use former assembly for the
sequence
time being.
http://genome.ucsc.edu/cgi-bin/hgBlat
Chromosomal location
gDNA sequence
Consensus CDS
& other sequences from
reliable resources
Annotation of genes is provided by multiple public resources, using
different methods, and resulting in information that is similar but not
always identical.
CCDS database goal: provide a standard set of gene annotations.
Collaborative project involving teams (manual and automated annotation):
* European Bioinformatics Institute (EBI)
* National Center for Biotechnology Information (NCBI)
* Wellcome Trust Sanger Institute (WTSI)
* University of California, Santa Cruz (UCSC)
Currently available only for human and mouse genomes (July 2009):
20'159 human CCDS (including isoforms) -> 17'054 CCDS genes
17'707 mouse CCDS (including isoforms) -> 16'889 CCDS genes
http://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi
Chromosomal location
gDNA sequence
All sequences can be retrieved
Consensus CDS
& other sequences from
reliable resources
(Human) mRNAs
(Human) spliced ESTs
(Human) ESTs
(including unspliced)
The view can be
completely
customized…
…including with
various tools allowing
comparative genomics
http://genome.ucsc.edu/
…and including your own data !
Back to the Blat viewer
Arrows >>>> show the direction of transcription
2 transcripts from the same locus:
BDNF (Brain-Derived Neurotrophic Factor)
BDNFOS (BDNF Opposite Strand)
Exons
View of alternative exons
Constitutive exons
Alternative exons
Interested by this exon ?
Just zoom in…
Genome browser @ UCSC has many
great options, give it a try!
http://genome.ucsc.edu/
Typical problems
or
Why wonderful tools will never
replace the brain of a life scientist !
… Once upon a time, there was a gene on chromosome 11…
2 essential genome resources are missing
from this lecture:
Ensembl (http://www.ensembl.org/index.html):
automated annotation of many genomes;
Vega (http://vega.sanger.ac.uk/index.html):
High quality manual annotation of genomes
(currently Homo sapiens, Mus musculus, Danio rerio,
Gorilla gorilla, Macropus eugenii, Sus scrofa, Canis
familiaris).
Please go and visit them!
The flow of information
From DNA sequences
to protein sequences:
A little biology
and
A few databases
From genome to proteome:
the example of human
Proteome
Genome
~ 1'000'000
human proteins
~ 20’500 human
protein-encoding genes
Alternative promoter usage
Alternative splicing
Trans-splicing
mRNA editing …
Increase in complexity
2-5 x
Post-translational modifications
(PTMs)
Increase in complexity
5-10 x
Transcriptome
~ 100’000
human transcripts
Most PTMs cannot be
predicted from DNA
sequences
The hectic life of a protein sequence…
Data not submitted to public databases, delayed or cancelled…
cDNAs, ESTs, genomes, …
Nucleic acid
databases
EMBL
GenBank
DDBJ
International Nucleotide Sequence Database Collaboration
www.insdc.org
…if a Coding Sequence (CDS)
is submitted
no CDS
Sequences
from
publications
Gene prediction
RefSeq, Ensembl
+ some MODs
Journal scan
Protein sequence
databases
Direct
submissions
!!!!
99% of the protein sequences found in
databases come from the translation
nucleotide sequences
=> Experimental evidence may be lacking!
EMBL (DNA)
Product name
TrEMBL
Translated EMBL
Reference
Protein name
Reference + tissue
Tissue
Translated CDS
Translated CDS
Automated extraction
of protein sequence
(translated CDS), gene
name and references +
Automated annotation.
A similar pipeline is
used at the NCBI to
go from GenBank to
GenPept
!!!!
The quality of UniProtKB/TrEMBL (&
GenPept) entries depends upon the quality of
the submissions in the original EMBLBank/GenBank/DDBJ entry.
EMBL
TrEMBL
EMBL (DNA)
Product name
Manual annotation of
the sequence and
review of associated
biological information
Swiss-Prot
Protein nameS
TrEMBL
Reference
Protein name
Many more references
Reference
Full annotation
Tissue
Translated CDS
Translated CDS
Automated extraction
of protein sequence
(translated CDS), gene
name and references.
Automated annotation.
Translated CDS
+ SAPs
+ isoforms
+ …
Splice variants
Sequence
Annotations
Nomenclature
Sequence
features
Ontologies
References
Evidence for protein existence:
Annotation in UniProtKB
5 levels of evidence:
1. evidence at protein level,
2. evidence at transcript level,
3. inferred by homology,
4. predicted,
5. uncertain.
http://www.uniprot.org/uniprot/P35613
http://www.uniprot.org/uniprot/Q9Y471
http://www.uniprot.org/uniprot/Q9Y471
Organism-specific dbs
Sequence dbs
Proteomic dbs
Genome annotation dbs
Family and domain dbs
AGD
BuruList
CGD
CTD
CYGD
DictyBase
EchoBASE
EcoGene
euHCVdb
FlyBase
GenAtlas
GeneCards
GeneDB_Spombe
GeneFarm
Gramene
H-InvDB
HGNC
HPA
LegioList
Leproma
ListiList
MaizeGDB
MGI
MIM
MypuList
Orphanet
PharmGKB
PhotoList
PseudoCAP
RGD
SagaList
SGD
SubtiList
TAIR
TubercuList
WormBase
WormPep
Xenbase
ZFIN
EMBL
IPI
PIR
UniGene
RefSeq
PeptideAtlas
PRIDE
ProMEX
Ensembl
GeneID
GenomeReviews
KEGG
NMPDR
TIGR
UCSC
VectorBase
Gene3D
HAMAP
InterPro
PANTHER
Pfam
PIRSF
PRINTS
ProDom
PROSITE
SMART
TIGRFAMs
Phylogenomic dbs
HOGENOM
HOVERGEN
OMA
Gene expression dbs
ArrayExpress
Bgee
CleanEx
GermOnline
Polymorphism dbs
dbSNP
UniProtKB/Swiss-Prot:
115 explicit links
and 19 implicit links!
Ontologies
GO
2D-gel dbs
2DBase-Ecoli
ANU-2DPAGE
Aarhus/Ghent-2DPAGE (no server)
COMPLUYEAST-2DPAGE
Cornea-2DPAGE
DOSAC-COBS-2DPAGE
ECO2DBASE (no server)
HSC-2DPAGE
OGP
PHCI-2DPAGE
PMMA-2DPAGE
Rat-heart-2DPAGE
REPRODUCTION-2DPAGE
Siena-2DPAGE
SWISS-2DPAGE
World-2DPAGE
Protein family/group dbs
3D structure dbs
Enzyme and pathway dbs
BioCyc
BRENDA
Pathway_Interaction_DB
Reactome
PTM dbs
GlycoSuiteDB
PhosphoSite
PhosSite
Others
BindingDB
PMAP-CutDB
DrugBank
NextBio
Protein-protein
interaction dbs
DIP
IntAct
DisProt
HSSP
PDB
PDBsum
SMR
CAZy
MEROPS
PeroxiBase
PptaseDB
REBASE
TCDB
The UniProt consortium
European Bioinformatics Institute
European Molecular Biology Laboratory
Swiss Institute of
Bioinformatics
Protein
Information
Resource
UniProt mission:
Provide a comprehensive high-quality and
freely accessible resource of protein
sequence and functional annotation.
New release every
3 weeks
Update frequency
A crucial issue !!
• Sometimes very difficult, or even impossible, to
find;
• Crucial not only for the database itself, but also
for tools using databases.
Update frequency
http://www.matrixscience.com/search_intro.html
Mascot MS/MS identification tool is fine,
but it cannot be used from this website !
Solution: Download the database of interest
and make sure you work with an up-to-date
version.
Never hesitate to ask for an update
UniProtKB: protein sequence knowledgebase, 2 sections
UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast,
download) (9’232’223 entries)
UniParc: protein sequence archive (equivalent to EMBLBank/GenBank/DDBJ at the protein level). Each entry contains a protein
sequence with cross-links to other databases where you
find the sequence (active or not). Not annotated. (query, no Blast
on www.uniprot.org, Blast @ EBI, not downloadable) (20’070’606 entries)
UniParc entry contains all
records for a unique
sequence in major publicly
available databases.
TrEMBL entry merged
into Swiss-Prot =>
does not exist anymore
UniProtKB: protein sequence knowledgebase, 2 sections
UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast,
download) (9’232’223 entries)
UniParc: protein sequence archive (EMBL equivalent at the protein
level). Each entry contains a protein sequence with crosslinks to other databases where you find the sequence
(active or not). Not annotated. (query, no Blast on www.uniprot.org, Blast @
EBI, not downloadable) (20’070’606 entries)
UniRef: 3 clusters of protein sequences with 100, 90 and
50 % similarity; useful to speed up sequence similarity
search (BLAST) (query, Blast, download) (UniRef100 8’474’689 entries; UniRef90
5’668'669 entries; UniRef50 2'729'565 entries)
UniRef100, 90 and 50
One UniRef100 entry -> merge of identical sequences
(including subfragments, splice variants). Based on
UniProtKB sequences and selected UniParc records
(such as Ensembl & RefSeq).
One UniRef90 entry -> sequences that have at least
90% or more identity. Built from UniRef100.
One UniRef50 entry -> sequences that are at least 50%
identical. Built from UniRef100.
UniProtKB: protein sequence knowledgebase, 2 sections
UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast,
download) (7’097’874 entries)
UniParc: protein sequence archive (EMBL equivalent at the protein
level). Each entry contains a protein sequence with crosslinks to other databases where you find the sequence
(active or not). Not annotated. (query, no Blast on www.uniprot.org, Blast @
EBI, not downloadable) (17’646’564 entries)
UniRef: 3 clusters of protein sequences with 100, 90 and
50 % similarity; useful to speed up sequence similarity
search (BLAST) (query, Blast, download) (UniRef100 6,652,983 entries; UniRef90
4’438’653 entries; UniRef50 2’104’702 entries)
UniMES: protein sequences derived from metagenomic
projects (Global Ocean Sampling (GOS)) (Blast, download) (UniMes
6'028'191 entries)
What is "Non-Redundancy" ?
• UniParc
– One UniParc entry for all entries corresponding to 100%
identical sequences (100% identity over the entire length)
(from many different databases).
• UniRef
– One UniRef100 entry for all entries corresponding to 100%
identical sequences (including fragments) from UniProtKB,
Ensembl, Refseq, PDB.
• UniProtKB/Swiss-Prot
– One Swiss-Prot entry for all the protein products of one
gene, including fragments, variations/polymorphisms, splice
variants, sequencing errors…
Comparing searches:
NCBI and UniProt
Search for the human
Toll-like receptor 4
Entrez Protein
(NCBI)
GenPept
Swiss-Prot
RefSeq
GenPept
Identical sequences
AAC34135
CAH72619
Identical sequences
AAF05316
BAG55035
CAH72618
AAI17423
AAF89753
NP_612564
O00206
Search for the human
Toll-like receptor 4 in
UniProtKB
Swiss-Prot
Sequences retrieved
in Entrez Protein:
O00206
AAF05316
CAH72618
CAH72619
BAG55035
AAI17423
AAF89753
NP_612564*
AAC34135
*Based on A126770,
BC117422,AL160272 and
AA598398
Major protein sequence resources
PIR
PDB
PRF
UniProtKB: Swiss-Prot + TrEMBL
Resources
integrated in the
entries
Resources integrated
in the search engine
EntrezProtein: Swiss-Prot+GenPept+PIR+PDB+PRF+RefSeq
UniProtKB/Swiss-Prot: manually annotated protein sequences (~12’000 species)
UniProtKB/TrEMBL: submitted CDS (EMBL); automated annotation (~202’000 species)
GenPept: submitted CDS (GenBank)
PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB
PDB: Protein Databank: 3D data and associated sequences
PRF: journal scan of ‘published’ peptide sequences
RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation
Model Organism Databases
(MODs) at a glance
Model organism
Species extensively studied to understand particular biological phenomena,
with the expectation that discoveries made in the organism model will
provide insight into the workings of other organisms.
Model organisms
Mus musculus
Rattus norvegicus
Oryza sativa
Arabidopsis thaliana
Drosophila melanogaster
Schizosaccharomyces pombe
Saccharomyces cerevisiae
Caenorhabditis elegans
Dictyostelium discoideum
Bacillus subtilis
Escherichia coli
Danio rerio (zebrafish)
MODs
MGI http://www.informatics.jax.org/
RGD http://rgd.mcw.edu/
RAP-DB http://rapdb.dna.affrc.go.jp/
TAIR http://www.arabidopsis.org/
FlyBase http://flybase.org/
S. pombe GeneDB http://www.genedb.org/genedb/pombe/
SGD http://www.yeastgenome.org/
WormBase http://www.wormbase.org/
dictyBase http://dictybase.org/
SubtiList http://genolist.pasteur.fr/SubtiList/
ecogene http://ecogene.org/
ZFIN http://zfin.org/
Just a few examples, not an exhaustive list!
Methanocaldococcus jannaschii -> no MOD
Model organism databases (MODs)
Genome annotation;
MODs do not necessarily store sequences,
Gene models;
but give access to them
Gene mapping;
Official nomenclature;
Gene expression;
Functional annotation;
Interactions;
Information about mutants/knockout/transgenic animals;
Phenotypes;
(cross-)references;
Species-specific reagents…
Key resources for information on a given organism
Service provided to/from a given community
Link to cDNA sequences
http://gmod.org/wiki/Main_Page
The world of
databases is a
jungle
A few points to remember
when using databases
- Content ;
- Primary / secondary / meta-databases ;
- Curated / non-curated ;
- manual / automated curation ;
- Redundant / non-redundant.
- Update frequency;
- Stable identifiers ;
- Strategy ;
- Dataflow ;
- Collaborations between databases.
Test a few genomic databases
and tools
Genomes and genomic tools: a few sites
NCBI:
http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome
EBI:
http://www.ebi.ac.uk/genomes/
TIGR:
http://cmr.jcvi.org/tigr-scripts/CMR/shared/Genomes.cgi
Genome annotation and analysis tools:
http://www.ensembl.org/index.html
http://vega.sanger.ac.uk/index.html
http://genome.ucsc.edu/ -> BLAT, Galaxy, Custom tracks, …
http://www.jgi.doe.gov/software/ -> Genome portal, Integrated
Microbial Genomes (IMG) and other tools
Generic Model Organism Database
http://gmod.org/wiki/Main_Page
Genomes and genomic tools:
Hands-on
Find your favorite (completely sequenced) organism in a genome db;
Follow the links to see the options on different sites;
Find the sequences;
Look at the annotation of your favorite gene;
Compare the entries corresponding to this gene across sites;
Test search engines (restrict searches, compare results, …)
Whenever possible use on-line tutorials, such as:
http://www.ensembl.org/info/website/tutorials/index.html
Visit GMOD, see the tools (http://gmod.org/wiki/GMOD_Components)
Play around with the BLAT search, customize display, follow the links, …
Genomes and genomic tools:
Hands-on
Go and visit databases cited in this lecture;
The databases/tools that should be "familiar" to all are:
http://genome.ucsc.edu/cgi-bin/hgBlat
http://www.ensembl.org/index.html
gene/genome databases/tools on http://www.ncbi .nlm.nih.gov/
If none of the databases are of interest for you, go to the NAR
database (http://www.oxfordjournals.org/nar/database/a/) and find
databases that are closest to your interests;
Play around…
Hands on protein sequence databases and UniProt:
http://education.expasy.org/cours/HK09/Protein_database_TP.html
(corrections: http://education.expasy.org/cours/HK09/Protein_database_TP_correction.html)
Thank You !