Transcript No Slide Title
UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the Swiss Institute of Bioinformatics
Andrea Auchincloss ( [email protected]
) Tunis, March 19, 2007 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Outline
• The Swiss Institute of Bioinformatics • What is UniProt?
• UniProt Knowledgebase: Swiss-Prot and TrEMBL • HPI, post-translational modifications, HAMAP • UniRef and UniParc • Databases for protein function and domains: PROSITE, InterPro etc.
• ExPASy; other tools A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Swiss Institute of Bioinformatics (SIB)
• Non-profit foundation created in 1998; • Groups in Geneva, Lausanne and Basel; • Federation of several groups (some of which existed and collaborated long before the foundation of the institute), about 170 researchers in 2006.
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
www.isb-sib.ch
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
SIB missions
• Development of databases and software tools; • High-quality bioinformatics research program; • Courses and seminars for the training of bioinformatics research scientists. This includes a master’s degree in proteomics and bioinformatics, several weekly courses and a doctoral school • Services to the Swiss Life Sciences community (EMBnet node).
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Swiss Institute of Bioinformatics: 20 research and service groups A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Proteins are organic compounds made of amino acids arranged in a linear chain and joined by peptide bonds… Wikipedia A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Proteins are composed of 20 "standard" amino acids, symbolised by a LETTER.
Different ‘views’ of a protein A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Proteins can also work together to perform a particular function, and they often associate to form complexes.
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Proteins are essential parts of all living organisms and participate in every process within cells. -> enzymes -> structural or mechanical functions -> important in cell signaling, immune response, cell adhesion, cell cycle, toxins….
Proteins are a necessary component in our diet, since animals cannot synthesize all the amino acids and must obtain essential amino acids from food. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Protein/Gene number
Organism Number Bacteria 182-8,591
S. cerevisiae C. elegans A. thaliana
6,127 17,947 Drosophila 13,849 ∼ 25,674 Human ∼21,000
The universe in which protein databases evolve
1953: 1st sequence (bovine insulin) 1986: 4,000 sequences 2006: 3.5 million sequences Where will it stop?
AMB, SP20
179,000,021,000
1st estimate: ~30 million species (1.5 million named) 2 nd estimate:
20 million bacteria/archaea x 4,000 genes 5 million protists x 6,000 genes 3 million insects x 14,000 genes 1 million fungi x 6,000 genes 0.6 million plants x 20,000 genes 0.2 million molluscs, worms, arachnids, etc. x 20,000 genes 0.2 million vertebrates x 21,000 genes
The calculation: 2x10 7 x4000+5x10 6 x6000+3x10 6 x14000+10 6 x6000+6x10 5 x 20000+2x10 5 x20000+2x10 5 x21000+21000( you!
) Caveat: this is an estimate of the number of potential sequence entries, but not that of the number of distinct protein entities in the biosphere.
AMB, SP20
What is sequencing is underway right now?
Many eukaryotic & bacterial genomes (varying sizes) Metagenomics (environmental samples) ~ 6 million sequences submitted/published in December 2006, ~ 17 million sequences being generated at the Venter Institute, 6 million proteins are being submitted from the GOS (Global Ocean Sampling) trip A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Protein sequences; what is sequenced?
Currently about 3.5 to 4.0 million ‘known’ protein sequences More than 99% of these are derived by translation of nucleotide sequences Less than 1%: direct protein sequencing (Edman, MS/MS…) -> It is important that users know where the protein sequence comes from… (sequence & gene prediction quality)!
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Level of DNA/RNA sequence quality - DNA/RNA sequencing quality (genome or WGS, cDNA or EST …) - Gene prediction quality; programs used, is there manual intervention afterwards?
For example: Authors can specify the nature of the CDS in the nucleotide databases by using qualifiers: "/evidence=experimental" or "/evidence=not_experimental".
Very rarely done… A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
The hectic life of a sequence …
Public nucleic acid databases Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … EMBL , GenBank , DDBJ …if the submitters provide an annotated Coding Sequence (CDS) Public protein sequence databases
CDS: CoDing Sequence (CDS) CDS provided by the submitters The first Met !
CDS translation provided by EMBL
Data not submitted Complete genome (submitted) only ~ 1,858 CDS available!
Issue for the users: the protein database jungle
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
The hectic life of a sequence …
Public nucleic acid databases Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … EMBL , GenBank , DDBJ …if the submitters provide an annotated Coding Sequence (CDS) Public protein sequence databases
The hectic life of a sequence …
Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … EMBL , GenBank , DDBJ CoDing Sequences provided by submitters Scientific publications derived sequences TrEMBL UniProtKB PIR Swiss-Prot Manually annotated GenPept RefSeq* EnsEMBL* IPI CCDS * Also gene prediction UniParc PRF PDB + species-specific databases (EcoGene, TubercuList, TIGR…)
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Major public protein sequence database ‘sources’ PIR PDB PRF UniProtKB : Swiss-Prot + TrEMBL Integrated resources ‘cross-references’ Separated resources NCBI-nr : Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq UniProtKB/Swiss-Prot: manually annotated protein sequences (11,000 species) UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot (127,000 species) GenPept: submitted CDS (GenBank); redundant with UniProtKB (about 130,000 species) PIR: Protein Information Resource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4,000 species)
Other protein sequence databases CCDS: EBI + NCBI + Wellcome Trust Sanger + UC Santa Cruz (2 species) Consensus human and mouse sequences between 4 institutions… Combining different approaches – ab initio, by similarity - and taking advantage of the expertise acquired by different institutes, including manual annotation… EnsEMBL: UniProtKB + RefSeq + gene prediction (31 species) aligns some eukaryotic genomic sequences with all the sequences found in EMBL, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (→ known genes)- Also does some gene prediction (→ novel genes) IPI: UniProtKB + RefSeq + EnsEMBL + (H-InvDB, TAIR, VEGA) (7 species) provides a guide to the main databases that describe the human, mouse, rat, zebrafish, Arabidopsis, chicken, and cow proteomes. … A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Protein Information Resource
The UniProt consortium
European Bioinformatics Institute European Molecular Biology Laboratory Swiss Institute of Bioinformatics A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
The UniProt Consortium
UniProt (Universal Protein Resource): the world's most comprehensive catalogue of protein information www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006). Provides 3 databases: -UniProtKB (Swiss-Prot + TrEMBL) -UniRef -UniParc and soon UniMES (for Metagenomic and Environmental Sequences) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
The Universal Protein resource components UniProtKB Release 9.7 consists of: UniProtKB/TrEMBL Computer annotated protein sequences 3’600’000 entries ~100’000 species UniProtKB/Swiss-Prot Manually annotated protein sequences 260’000 entries ~ 10’000 species produced by SIB and EBI UniRef100 UniRef 90 UniRef 50 • One UniRef100 All identical sequences (including fragments).
entry = • One UniRef90 entry = Sequences that have at least 90% or more identity .
• One UniRef50 entry = Sequences that are at least 50% or more identity .
Independent of species.
Allows comprehensible BLAST similarity searches by providing sets of representative sequences produced by PIR UniProt Archives ~8’000’000 entries Archived raw protein sequences, found in publicly accessible databases: Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices.
Use with extreme caution: Contains pseudogenes, incorrect CDS predictions, etc… produced by EBI
The Universal Protein resource components UniProtKB/TrEMBL Computer annotated protein sequences 3,900,000 entries ~127,000 species UniProtKB/Swiss-Prot Manually annotated protein sequences 260,000 entries ~ 11,000 species produced by SIB and EBI UniRef100 UniRef 90 UniRef 50 • One UniRef100 All identical sequences (including fragments).
entry = • One UniRef90 entry = Sequences that have at least 90% or more identity .
• One UniRef50 entry = Sequences that are at least 50% or more identity .
Independent of species.
Allows comprehensible BLAST similarity searches by providing sets of representative sequences produced by PIR UniProt Archives ~8’000’000 entries Archived raw protein sequences, found in publicly accessible databases: Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices.
Use with extreme caution: Contains pseudogenes, incorrect CDS predictions, etc… produced by EBI
The Universal Protein resource components UniProtKB/TrEMBL Computer annotated protein sequences 3,900,000 entries ~127,000 species UniProtKB/Swiss-Prot Manually annotated protein sequences 260,000 entries ~ 11,000 species produced by SIB and EBI UniRef100 UniRef 90 UniRef 50 • One UniRef100 All identical sequences (including fragments).
entry = • One UniRef90 entry = Sequences that have at least 90% or more identity .
• One UniRef50 entry = Sequences that are at least 50% or more identity .
Independent of species.
Allows comprehensible BLAST similarity searches by providing sets of representative sequences produced by PIR UniProt Archives ~8’000’000 entries Archived raw protein sequences, found in publicly accessible databases: Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices.
Use with extreme caution: Contains pseudogenes, incorrect CDS predictions, etc… produced by EBI
The Universal Protein resource components UniProtKB/TrEMBL Computer annotated protein sequences 3,900,000 entries ~127,000 species UniProtKB/Swiss-Prot Manually annotated protein sequences 260,000 entries ~ 11,000 species produced by SIB and EBI UniRef100 UniRef 90 UniRef 50 • One UniRef100 All identical sequences (including fragments).
entry = • One UniRef90 entry = Sequences that have at least 90% or more identity .
• One UniRef50 entry = Sequences that are at least 50% or more identity .
Independent of species.
Allows comprehensible BLAST similarity searches by providing sets of representative sequences produced by PIR UniProt Archives ~8,800,000 entries Archived raw protein sequences, found in publicly accessible databases: Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices.
Use with extreme caution: Contains pseudogenes, incorrect CDS predictions, etc… produced by EBI
UniProt web sites…
http://www.expasy.org/sprot/ http://www.pir.uniprot.org/ http://www.ebi.ac.uk/uniprot/ http://www.uniprot.org/ Soon, a new unified web site, with a very powerful search engine….
http://beta.uniprot.org/ Test it! Logon:guest Password: amazing A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
The UniProt groups from SIB, EBI and PIR
(Antibes, September 2004) In Geneva ( SIB ): 2 Group Leaders 44 Annotators 4 Prosite annotators 22 Programmers and Researchers 5 Administrators, science communicators 3 System Administrators 4 Students 1 GISAID ----------------- 85 people At EBI : (Swiss-Prot + EMBL + TrEMBL) 75 people (29 Annotators) A. Auchincloss UniProtKB and ExPASy At PIR: 1 Group Leader 13 Protein Science Team 12 Informatics Team ----------------- 26 people Tunis, March 2007
UniProtKB has biweekly releases; available from about ~100 servers, the main sources being ExPASy and www.uniprot.org
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
UniProtKB From EMBL (DNA) to TrEMBL (protein)
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Gene/protein name Taxonomy Reference CDS EMBL TrEMBL Automated extract of the protein sequence (CDS), gene name, taxonomy and references.
Automated annotation (KWs and protein family).
! TrEMBL does not translate DNA sequences, nor does it use gene prediction programs: only takes the existing CDS proposed by the submitting authors in the EMBL/Genbank/DDBJ entry In particular, the proposed CDS and derived protein sequences can be experimentally proven or derived from gene prediction programs (this is not obvious from the TrEMBL entry) TrEMBL does not validate any sequences
!!!!
The quality of UniProtKB/TrEMBL data is directly dependent on the information provided by the submitter of the original nucleotide entry.
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
UniProtKB From TrEMBL to Swiss-Prot
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
CDS EMBL Automated extraction of the protein sequence (CDS), gene name and references.
Automated annotation. TrEMBL Manual annotation of the sequence and associated biological information (derived from literature, external experts, databases…) Annotation of sequence differences (conflicts, variants, splicing…) Average of 6 independent sequence reports for each human protein Swiss-Prot
Distinguishing Swiss-Prot and TrEMBL
– A TrEMBL entry is a computer-annotated record derived from a coding sequence (CDS) in the nucleotide sequence databases, not in Swiss-Prot, after some redundancy removal and automated annotation.
– A Swiss-Prot entry is a manually annotated record for a given protein.
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
UniProtKB
From TrEMBL to Swiss-Prot Step 1: Sequence check
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
UniProtKB/Swiss-Prot
Non-redundant 1 entry -> 1 gene (1 species) i) Merge all known protein sequences (CDS and amino acid) derived from the same gene -> decreases redundancy and improves sequence reliability ii) Annotation of the sequence differences (including conflicts, polymorphisms, splice variants etc..) -> annotation of protein diversity A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Redundancy…
UniProtKB/Swiss-Prot ~11,000 species UniProtKB/TrEMBL ~127,000 species
260,000 + 3,800,000
3,600,000
Redundancy in TrEMBL & Redundancy between TrEMBL and Swiss-Prot In the future: redundancy is going to decrease: "new" genome sequencing → "new" proteins
- 13 sequences (complete or partial) - derived from mRNA (n=6) or genomic DNA (n=7)
All alternatively spliced sequences are available for BLAST searches, protein identification tools and are downloadable… Human: ~2/3 of the human genes are alternatively spliced A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
- 6 genomic sequences (complete or partial) - 1 protein sequence from PIR A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Multiple alignment of the available clpB sequences A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Within Swiss-Prot?
• A snapshot of the situation (December 2006): – 28,200 entries with 82,000 sequence conflicts; – 2,600 entries with corrected frameshifts; – 15,100 entries with corrected initiation sites; – 4,300 entries with other sequence ‘problems’.
• At least 43,000 entries (19% of Swiss-Prot) required a minimal amount of annotation effort to obtain the “correct” sequence. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Quality of protein information from genome
– Drosophila
projects
• Proteins originating from different genome projects: : what a curated (thanks to FlyBase) genome effort should look like: only 1.8% of the gene models conflict with what we have in UniProtKB/Swiss-Prot; – Arabidopsis : a genome where lots of work was done to annotate it when it was sequenced, but where nothing as been done since (at least in the public view): 19.5% of the gene models are erroneous; – Tetraodon nigroviridis : a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins.
– Bacteria and Archaea have almost no splicing, so prediction is “easier”, however errors are still made…
• Producing a clean set of sequences is not a trivial task; • It is not getting easier as more and more types of sequence data is submitted; • It is important to pursue our efforts in making sure we provide to our users the most correct set of sequences for a given organism. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
• •
New ‘Protein existence evidence’ tag
As most protein sequences are derived from translation of nucleotide sequence and are only predictions, the new PE line indicates whether there is any evidence that proves the existence of a protein; The ‘Protein existence evidence’ will have 5 different qualifiers: 1. Evidence at protein level 2. Evidence at transcript level 3. Inferred from homology 4. Predicted Unassigned (used mostly in TrEMBL) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Righting the wrongs “Sequences are rarely deposited in a “mature” state; as with all scientific research, DNA and protein annotation is a continual process of learning, revision and corrections.” “Sequencing error rates: ~1 base in 10’000” “Making people aware of errors is good and great; making people aware that they’re responsible also for correcting errors is even greater” C. Hardley, EMBO reports, 4(9), 2003.
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
UniProtKB
From TrEMBL to Swiss-Prot Step 2: Annotation:
literature controlled vocabulary A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
•
Annotation
The focal point of the efforts to maintain and develop UniProtKB/Swiss-Prot; • It is becoming more and more important as it provides: a summary of what is known about a protein; creates template for automatic annotation for the many organisms whose genome sequence is/will be available but whose proteins will not be characterized; provides well annotated (corpus) entries to train literature mining tools (text mining).
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
….
(…) A. Auchincloss UniProtKB and ExPASy Source of data - publications (> 1,700 journals cited) -also external scientific expertise & other databases Tunis, March 2007
Comments: “structured free text”, 27 defined topics Manually annotated Information from papers, specialized databases, computer prediction, external experts, brain storming Distinction between data obtained experimentally and computerized inferences A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
UniProtKB
From TrEMBL to Swiss-Prot Step 3: Sequence analysis (bioinformatics tools)
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
The annotation platform
Annotators could not work without the help of our software developers; A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Anabelle: much more than a domain annotation platform
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
We manually check the results !
What else is in a UniProtKB/Swiss-Prot entry?
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Cross-references; a central hub
Gasteiger E. et al, Curr. Issues Mol. Biol. 3:47-55(2001) www.expasy.org/cgi-bin/lists?dbxref.txt
• Swiss-Prot was the first database with X-references; • Explicitly X-referenced to 85 databases: – DNA (EMBL/GenBank/DDBJ), – 3D-structure (PDB) – Family and domain (InterPro, HAMAP, PROSITE, Pfam, etc.) – genomic (OMIM, MGI, FlyBase, SGD, SubtiList, etc.) – 2D-gel (e.g. SWISS-2DPAGE) – specialized db (e.g.GlycoSuiteDB, PhosSite, MEROPS); – literature (PubMed) • Each UniProtKB/Swiss-Prot entry can be seen as a central hub for the data available about the protein it describes A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Organism-specific databases
AGD CYGD DictyBase EchoBASE EcoGene euHCVdb FlyBase GeneDB_Spombe GeneFarm Gramene H-InvDB HGNC HIV HPA LegioList Leproma ListiList MaizeGDB MGI MIM MypuList PhotoList RGD SagaList SGD StyGene SubtiList TAIR TubercuList WormBase WormPep ZFIN
Genome annotation databases
Ensembl GenomeReviews KEGG TIGR
Sequence databases
EMBL PIR UniGene
3D structure databases
HSSP PDB SMR GlycoSuiteDB PhosSite
Enzyme and pathway databases
BioCyc Reactome UniProtKB/Swiss-Prot explicit links
PTM databases Miscellaneous
ArrayExpress dbSNP DIP DrugBank GO IntAct LinkHub RZPD-ProtExp
Family and domain databases
Gene3D HAMAP InterPro PANTHER PIRSF Pfam PRINTS ProDom PROSITE SMART TIGRFAMs
2D-gel databases
ANU-2DPAGE Aarhus/Ghent-2DPAGE COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE HSC-2DPAGE OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE
Protein family/group databases
GermOnline MEROPS PeroxiBase PptaseDB REBASE TRANSFAC
Implicit cross-references on new web server and ExPASy
Implicit X-references to 26 additional db added by the ExPASy server on the www (i.e.: GeneCards, ModBase, etc.) These X-refs are the Swiss-Prot entry as it can be downloaded by ftp, but are not present as hard-coded DR lines added on the fly entry on ExPASy. This can be done because enough information is present in the UniProtKB entry to access the related information in another db. Example: All Swiss-Prot/TrEMBL are linked to the BLOCKS domain db, via the Swiss-Prot/TrEMBL accession number when someone views an in A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Keyword definition and usage in Swiss-Prot Linked to Gene Ontology to further facilitate information retrieval via controlled vocabularies A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
In a UniProtKB/Swiss-Prot entry, you can expect to find:
• All the names of a given protein (and of its gene); • Its biological origin with links to the taxonomic databases; • A selection of references ; • A summary • Numerous • Selected of what is structures, etc.…; keywords • A description of ; known about the protein: function, alternative products, PTM, tissue expression, disease, 3D cross-references; important sequence features: domains, PTMs, variations, etc.; • A (often corrected) protein sequence various isoforms/variants.
and the description of A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Monitoring entry history: The UniProtKB Sequence/Annotation Version archive A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
… and many useful links: A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
And on the new website
other tools are not yet available… A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
UniProt Knowledgebase
• Swiss-Prot: Manually annotated section • TrEMBL: Automatically annotated section
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Distinguishing Swiss-Prot and TrEMBL
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Accession number: to be used when you cite a UniProt entry in anywhere (never cite the entry name (ID) alone) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Non-Redundant Complete Proteome Sets • Text search UniProtKB keyword “Complete proteome”, combined with an organism name • Or download precomputed sets (bacteria, archaea, some eukaryotes):
ftp://ftp.expasy.org/databases/complete_proteomes/entries
• Or EBI Integr8
http://www.ebi.ac.uk/integr8/ A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
• • • • • • • • • • •
Swiss-Prot annotation priorities
The main annotation programs: HAMAP (High quality Automated and Manual Annotation of microbial Proteomes; bacteria, archaea, plastids); HPI (Human Proteomics Initiative); PPAP (Plant Proteome Annotation Project); FPAP (Fungal Proteome Annotation Project); Viral proteins; Tox-Prot (Toxin Annotation Project); ENZYMES (proteins with EC numbers); PTMs 3D-structure Protein-protein interactions Quality assurance, includes controlled vocabularies A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Model organisms
• Organisms for which we want to have a more in-depth coverage; • Completeness, links with specialized databases, specific documents; • Examples: fruitfly,
A.thaliana
.
E.coli
,
B.subtilis
, human, mouse,
C.elegans
, yeast,
S.pombe
, A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Human Proteomics Initiative (HPI)
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
From genome to proteome
~ 21,000 human genes alternative splicing of mRNA 2-5 fold increase ~ 1,000,000 human proteins post-translational modifications of proteins (PTMs) 5-10 fold increase ~ 100,000 human transcripts Considerable increase in complexity
In the case of human genes, the Swiss-Prot/TrEMBL redundancy is still very high:
15,803 + 53,100
about 20,000*
* human gene number estimation: 21,000-35,000 MS proteomics has verified more than 10% of human genes products, but has not identified significant numbers of unpredicted proteins What is missing: • Sequences not submitted to EMBL/GenBank/DDJB (and PIR) • Not yet predicted or known genes ("no CDS provided by the submitters" or no DNA sequence) • Confidential data (Patent application sequences) • Immunoglobulins, T-cell receptors (-> UniParc) •… 1000
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Post-translational modifications (PTMs)
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
PTM definition
a post-translational modification or PTM is a modification of a polypeptide chain involving the making or the breaking of covalent bond(s) that occurs during (co translational class) or after translation. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
PTMs influence or even define protein function
phosphorylation and possibly GlcNAcylation and S-nitrosylation are a means of transducing extracellular signals to the inside of the cells.
methylation has a role in nuclear protein import. lipid addition allows protein to membrane association (e.g. GPI anchor, myristate, palmitate).
intrachain disulfide bonds and N-glycosylation influence protein folding.
interchain disulfide bonds bind subunits together.
other PTMs are directly involved in the protein function, as for example the binding of cofactors (e.g. pyridoxal phosphate), or the synthesis of a cofactor by the modification of amino acids present in the protein (e.g. quinones).
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
PTM variety
acetylation methylation acylation phosphorylation oxidation crosslinks GPI amidation crosslinks methylation Gly Ala Val Leu Ile Lys Arg His Asp Glu Asn Gln side-chain modifications C-terminal modifications Cys Ser Thr Met Pro hydroxylation cofactor binding sulfation C-linked sugar N-linked sugar O-linked sugar S-linked sugar acetylation N-terminal modifications methylation acylation in black: cytoplasmic modifications in dark grey: both cytoplasmic and extracellular modifications, depending on the exact type in light grey: extracellular modifications Phe Tyr Trp A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Large scale experiments (LSE) for PTMs!
• PTM information can now be obtained from results of proteomics large scale experiments (LSE); • In the past 12 months we have added about 6’000 experimental PTMs using data originating from some of these projects.
AMB, SP20
Proteomic studies have lead to the updating of 2767 human Swiss-Prot entries, mainly with PTM information (UniProt release 10.0 , March 2007) Phosphorylation (83%) Subcellular location (4%) Glycosylation (9%) Other PTMs (4%)
Bacteria and Archaea (HAMAP)
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
In 2006, ≈130 new bacterial and archaeal genomes (not WGS) were submitted to the DNA databases; If on "average" 4,000 proteins/genome=>500,000 proteins!
How to cope????
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
H
igh quality
A
utomated and
M
anual
A
nnotation of microbial
P
roteomes HAMAP
Lots of microbial genomes, lots of proteins. What should we do with them in UniProt?
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
http://www.expasy.org/unirule/MF_00319
Automatic annotation of proteins belonging to specified families
(1)
• This program requires the continuous development and adaptation of software tools as well as the development of a database of annotation rules for each family (so far about 1,400).
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Allows us to annotate automatically, yet with a very high level of quality, proteins that belong to well defined protein families; Can be applied to both characterized proteins and to some UPF’s (Uncharacterized Protein Family); The families are based on UniProtKB/Swiss-Prot entries, so we first do all the annotation steps described earlier! A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
/www.expasy.ch/sprot/hamap/ Using HAMAP, we can currently annotate to Swiss-Prot quality level between 10% to 50% of a complete microbial proteome (next step: HAMAP for Fungi…)
Updates
• DNA sequence archives – EMBL/GenBank/DDBJ is an archive • All submitted data goes into the archive • Submitters are responsible for the submitted sequences and the accompanying annotation • Nobody else can change them (including the curators at EMBL/GenBank/DDBJ) • Protein sequence databases – UniPRotKB/Swiss-Prot is NOT an archive • Swiss-Prot chooses what goes into the database and where to place it • Swiss-Prot updates annotation and sequences when necessary A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
**ZB SYP, 28-NOV-2003; ALB, 16-NOV-2004; MIM, 31-Jan-2006; **ZB BER, 13-FEB-2006; LYG, 14-JUN-2006; LYG, 21-SEP-2006; **ZB CHH, 05-DEC-2006;
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
User updates or annotation requests
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Accessing & Searching UniProtKB Direct access (keyword search)
• New search tool – we’ll use it later • Sequence Retrieval System (SRS, Europe), will ( disappear • Entrez (NCBI, USA) – UniProtKB/Swiss-Prot (not TrEMBL) is integrated in GenPept, but with a changed format, and with some information (e.g. implicit cross-references) is missing • Query tools on ExPASy & UniProt http://www.expasy.org/sprot/ , http://www.uniprot.org
)
Indirect access (sequence search)
• Bioinformatics & sequence analysis tools (Blast, Fasta, GCG, Emboss, MS Identification tools…)
Downloading the UniProt Knowledgebase
http://www.expasy.org/sprot/download.html
• Swiss-Prot and TrEMBL form a complete, non-redundant database, the UniProt Knowledgebase • Can be downloaded from ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase • In “Swiss-Prot” format, fasta or xml format • Complemented by sequences of alternative splice isoforms • “everything” about “ all” proteins! (at least all CDS submitted to the public nucleotide sequence databases) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
If you want to develop tools to work with your local copy of UniProtKB:
Swissknife – a PERL parser for UniProtKB Constantly updated according to latest format changes • • Advantage: you do not need to know how exactly the information is stored in the flat file http://swissknife.sourceforge.net/ ftp://ftp.ebi.ac.uk/pub/software/swissprot/ Swissknife/ A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Take home message
• Swiss-Prot is the non redundant, manually annotated and highly cross-referenced section of the UniProt Knowledgebase • Be aware of the differences between UniProtKB/TrEMBL and UniProtKB/Swiss-Prot – Computer vs. Human – Redundant vs. Non-redundant • Always cite the Accession number, not the entry name – The AC is stable – The entry name might change We need your feedback and your expertise!
http://www.expasy.org/sprot/update.html
(and from every UniProtKB entry page on our servers) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
The UniProt Consortium
UniProt (Universal Protein Resource): the world's most comprehensive catalogue of protein information www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006). Provides 3 databases: -UniProtKB (Swiss-Prot + TrEMBL) -UniRef -UniParc and soon UniMES (for Metagenomic and Environmental Sequences) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
UniRef100, 90 and 50 clusters
One UniRef100 entry -> all identical sequences from UniProtKB and some sections of UniParc (including fragments, Swiss-Prot splice variants).
One UniRef90 entry -> sequences that have at least 90% or more identity.
One UniRef50 entry -> sequences that are at least 50% identical.
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
UniRef100, 90 and 50 clusters
One cluster can contain sequences of several species, clustering is done independently of the organism Each cluster has a “representative”, “reference” sequence, preferably that of the best-annotated Swiss-Prot entry UniRef identifiers are of the form UniRef100_P99999, UniRef50_P00414 – not stable, as clusters are recomputed with every biweekly release, and cluster representatives can change!
UniRef is useful for comprehensive BLAST sequence searches by providing sets of representative sequences.
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Implicit cross-link UniProtKB to UniRef: new web view: A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
The UniProt Consortium
UniProt (Universal Protein Resource): the world's most comprehensive catalogue of protein information www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006). Provides 3 databases: -UniProtKB (Swiss-Prot + TrEMBL) -UniRef -UniParc and soon UniMES (for Metagenomic and Environmental Sequences) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
UniParc – the UniProt Archive
• 8.8 million sequences • Sequences and cross-references (AC numbers) • A comprehensive collection of the raw protein sequences in public databases (including those not submitted to the DNA databases): Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices.
• UniParc can be used to track sequence versions Use with extreme caution: also contains pseudogenes, incorrect CDS predictions, etc…and is highly redundant !
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
UniParc tracks a protein sequence and its integration in various databases http://www.pir.uniprot.org/cgi-bin/textSearch_AR A. Auchincloss UniProtKB and ExPASy Patent data Tunis, March 2007
UniParc entry UPI0000033477 part 2 TrEMBL entry probably to be merged into Swiss-Prot A. Auchincloss UniProtKB and ExPASy TrEMBL entry was merged into Swiss-Prot Tunis, March 2007
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
www.expasy.ch/prosite A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
PROSITE
A database of protein families and domains using two kinds of motif descriptors: Patterns or regular expressions : •User friendly (easy to understand and to use) •Well designed for the detection of biologically meaningful sites such as residues playing a structural or functional role •Can be used to scan a protein database in reasonable time on any computer Generalized profiles or weight matrices : •Well adapted to cover the full length of the protein or domain •Are able to detect highly divergent families or domains with only a few well conserved positions A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Identification of protein domains and families
• There are two non-exclusive approaches for the determination of the function of an uncharacterized protein: – Comparison with a complete sequence database (BLAST) – Scanning a database of patterns and profiles • Most proteins can be grouped into families. Proteins belonging to a particular family share functional attributes and are derived from a common ancestor; • Some regions in the sequence are more conserved than others during evolution because they are important for the function or the structure of the protein; • Like fingerprints for police identification, signatures built out of sequence patterns or profiles can be used to formulate hypotheses about the function of uncharacterized proteins.
Definitions of conserved regions
Conserved regions can be classified into 5 different groups: • Families: proteins that have the same domain arrangement, be 1 or many domains.
• Domains: specific combination of secondary structures that assume characteristic three dimensional structures or folds. • Repeats: structural units always found in two or more copies that assemble in specific fold. Assemblies of repeats might also be thought of as domains.
• Motifs: short regions with conserved active- or binding-sites that usually adopt a folded conformation only in association with their ligands. • Sites: functional residues (active sites, disulfide bridges, post-translationally modified residues) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Conserved regions (2)
CSA_PPIASE Cys 181: active site residue PPID family: 1 CSA_PPIASE domain + 3 TPR repeat Binding cleft (motif) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
http://www.expasy.org/tools/scanprosite/ A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Functionally and structurally relevant residues in PROSITE motif descriptors
A new concept to extract more information from profiles Principle : • Combining the advantages of profiles (high sensitivity) and patterns (position-specific information) • Tagging of amino acids at precise positions in the profile and checking their presence in the matched sequence A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
ProRule
Aim: • Provide users with biologically meaningful functional and structural information: active sites, post-translational modification sites, binding sites, disulfide bonds, transmembrane regions.
• Help the UniProtKB/Swiss-Prot annotation and provide enhanced homogeneity: domain name and boundaries, keywords and linked GO terms, EC numbers, false negative PROSITE patterns.
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
www.expasy.ch/prosite/prorule.html
Sigrist et al.: Bioinformatics 21:4060-4066(2005) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Other methods for protein/domain identification
Pfam, TIGRFAMs, SMART, Gene3D, PANTHER, CDD: Hidden Markov Models (HMM), Probabilistic models; PRINTS: “Unweighted” matrices; protein fingerprints BLOCKS: Weight matrix derived from ungapped alignments; PIRSF, SUPERFAMILY: classification system based on evolutionary relationship of whole proteins ProDom: automatic compilation of homologous domains based on recursive PSI-BLAST searches. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
The InterPro project
www.ebi.ac.uk/interpro Integrated Documentation Resource of Protein Families, Domains and Functional Sites A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
The InterPro project
www.ebi.ac.uk/interpro • Unification of PROSITE, PRINTS, Pfam and ProDom into an integrated resource of protein families, domains and functional sites in 2000; • Joint effort in creating a unified yet methodologically diverse system for protein family/domain identification; • Single set of “documents” linked to the various methods; • Distributed with tools by anonymous FTP and through www servers; • Used to enhance the functional annotation of UniProtKB (Swiss-Prot and TrEMBL) • Has progressively incorporated other databases A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Current status of InterPro
Release 14.1 (February 2007) was built from Pfam, PRINTS, PROSITE, ProDom, SMART, TIGRFAMs, PIRSF, Scop based SUPERFAMILY, Gene3D and PANTHER, and the current UniProt/Swiss-Prot + TrEMBL data. (for details see http://www.ebi.ac.uk/interpro/release_notes.html
) InterPro release 14.1 contains 13,953 entries, representing 3,911 domains, 9,610 families, 232 repeats, 34 active sites, 20 binding sites and 19 post-translational modification sites. Overall, there are 15,880,845 InterPro hits from 3,100,874 UniProtKB protein sequences. 92.4% of Swiss-Prot and 76.4% of TrEMBL protein sequences have one or more InterPro hits.
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
http://www.ebi.ac.uk/interpro/ A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
http://www.ebi.ac.uk/interpro/IEntry?ac=IPR001304 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
InterPro: Graphical domain representation
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
http://www.ebi.ac.uk/integr8/ProteomeAnalysisAction.do?orgProteomeID=25 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
http://www.ebi.ac.uk/integr8/ProteomeAnalysisAction.do?orgProteomeId=18 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
The ExPASy www server
• First molecular biology server on the Web (August 1993); ~500 million accesses since; • Dedicated to proteomics: – Databases: UniProtKB, PROSITE, Swiss-2DPAGE, etc.; – Many 2D/MS protein identification/characterization and sequence analysis tools; • Mirror sites in Australia, Brazil, Canada, China and Korea: http://{au|br|ca|cn|kr|www}.expasy.org
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
ExPASy software tools
• Tools for the display and management of databases (NiceProt, Swiss-Shop sequence alerting system, etc.); • Tools for sequence analysis (ScanProsite, ProtParam, ProtScale, RandSeq, Translate, etc.); • Proteomics tools (AACompIdent, FindMod, FindPept, Aldente, PeptideMass, TagIdent, etc.); • 3D-structure analysis and display tools (Swiss Model, Swiss-PDBviewer) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
http://www.expasy.org/tools/ Identification: Aldente, TagIdent, AAcompIdent, MultiIdent Characterization: FindMod, GlycoMod, FindPept - Use annotation in Swiss-Prot and TrEMBL (preprocessing, PTMs, etc.) - Hyper-links between tools and databases PeptideMass, GlycanMass, BioGraph, PeptideCutter ProtScale, Tunis, March 2007 Analysis: ProtParam
http://www.expasy.org/links.html
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Finding out about recent developments:
UniProtKB/Swiss-Prot recent format changes: http://www.expasy.org/sprot/relnotes/sp_news.html
UniProtKB/Swiss-Prot planned format changes: http://www.expasy.org/sprot/relnotes/sp_soon.html
Subscribe to the electronic Swiss-Flash bulletins: http://www.expasy.org/swiss-flash/ What’s new on ExPASy: http://www.expasy.org/history.html
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
UniProtKB/Swiss-Prot: http://www.expasy.org/sprot/sprot-ref.html
References (1)
Wu C. et al. The Universal Protein Resource (UniProt): an expanding universe of protein information.
Nucleic Acids Res. 34:D187-191(2006). Boeckmann B. et al. Protein variety and functional diversity: Swiss-Prot annotation in its biological context Comptes Rendus Biologies 328:882-99(2005). Bairoch A.
Swiss-Prot: Juggling between evolution and stability Brief. Bioinform. 5:39-55(2004). Farriol-Mathis N. et al. Annotation of post-translational modifications in the Swiss-Prot knowledgebase . Proteomics 4:1537-1550(2004).
Gasteiger E. et al. A. Swiss-Prot: Connecting biological knowledge via a protein database Curr. Issues Mol. Biol. 3:47-55(2001).
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
References (2)
PROSITE: Hulo N., et al., D230(2006).
The PROSITE database. Nucleic Acids Res. 34:D227 Sigrist C.J.A., et al., PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform. 3:265-274(2002).
Gattiker A., et al., ScanProsite: a reference implementation of a PROSITE scanning tool. Applied Bioinformatics 1:107-108(2002). Sigrist C.J.A., et al., ProRule: a new database containing functional and structural information on PROSITE profiles. Bioinformatics. 2005 21(21):4060-6. ExPASy: Gasteiger E. et al.
ExPASy: the proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 31:3784-3788(2003).
A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Useful general publications
• Nucleic Acids Res. Database issue 2006, vol. 34, supplement 1: http://nar.oupjournals.org/content/vol34/suppl_1/ • Nucleic Acids Res. Web server issue 2005, vol. 33, supplement 2: http://nar.oupjournals.org/content/vol33/suppl_2/ • Book: Bioinformatics for Dummies, by J.-M. Claverie and C. Notredame Publisher: For Dummies; 2nd edition (December, 2006) ISBN: 0764516965 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Take home message
• We need your feedback!
Or via the website A. Auchincloss UniProtKB and ExPASy Tunis, March 2007
Before the introduction to Swiss-Prot/ExPASy… After the introduction to Swiss-Prot /ExPASy …
Some practical exercises: http://education.expasy.org/cours/Tunis/ 1. Finding databases 2. Comparing protein databases 3. Comparing BLAST programs 4. BLAST output 5. Bacterial start sites 6. UniRef 7. Different views of UniProtKB 8. Environmental sequences 9. Inter-database links & PROSITE 10. InterPro 11. Using UniProtKB/Swiss-Prot to create datasets