No Slide Title

Transcript No Slide Title

UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the Swiss Institute of Bioinformatics

) Tunis, March 19, 2007 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Outline

• The Swiss Institute of Bioinformatics • What is UniProt?

• UniProt Knowledgebase: Swiss-Prot and TrEMBL • HPI, post-translational modifications, HAMAP • UniRef and UniParc • Databases for protein function and domains: PROSITE, InterPro etc.

• ExPASy; other tools A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Swiss Institute of Bioinformatics (SIB)

• Non-profit foundation created in 1998; • Groups in Geneva, Lausanne and Basel; • Federation of several groups (some of which existed and collaborated long before the foundation of the institute), about 170 researchers in 2006.

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

www.isb-sib.ch

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

SIB missions

• Development of databases and software tools; • High-quality bioinformatics research program; • Courses and seminars for the training of bioinformatics research scientists. This includes a master’s degree in proteomics and bioinformatics, several weekly courses and a doctoral school • Services to the Swiss Life Sciences community (EMBnet node).

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Swiss Institute of Bioinformatics: 20 research and service groups A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Proteins are organic compounds made of amino acids arranged in a linear chain and joined by peptide bonds… Wikipedia A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Proteins are composed of 20 "standard" amino acids, symbolised by a LETTER.

Different ‘views’ of a protein A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Proteins can also work together to perform a particular function, and they often associate to form complexes.

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Proteins are essential parts of all living organisms and participate in every process within cells. -> enzymes -> structural or mechanical functions -> important in cell signaling, immune response, cell adhesion, cell cycle, toxins….

Proteins are a necessary component in our diet, since animals cannot synthesize all the amino acids and must obtain essential amino acids from food. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Protein/Gene number

Organism Number Bacteria 182-8,591

S. cerevisiae C. elegans A. thaliana

6,127 17,947 Drosophila 13,849 ∼ 25,674 Human ∼21,000

The universe in which protein databases evolve

1953: 1st sequence (bovine insulin) 1986: 4,000 sequences 2006: 3.5 million sequences Where will it stop?

AMB, SP20

179,000,021,000

1st estimate: ~30 million species (1.5 million named) 2 nd estimate:

20 million bacteria/archaea x 4,000 genes 5 million protists x 6,000 genes 3 million insects x 14,000 genes 1 million fungi x 6,000 genes 0.6 million plants x 20,000 genes 0.2 million molluscs, worms, arachnids, etc. x 20,000 genes 0.2 million vertebrates x 21,000 genes

The calculation: 2x10 7 x4000+5x10 6 x6000+3x10 6 x14000+10 6 x6000+6x10 5 x 20000+2x10 5 x20000+2x10 5 x21000+21000( you!

) Caveat: this is an estimate of the number of potential sequence entries, but not that of the number of distinct protein entities in the biosphere.

AMB, SP20

What is sequencing is underway right now?

Many eukaryotic & bacterial genomes (varying sizes) Metagenomics (environmental samples) ~ 6 million sequences submitted/published in December 2006, ~ 17 million sequences being generated at the Venter Institute, 6 million proteins are being submitted from the GOS (Global Ocean Sampling) trip A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Protein sequences; what is sequenced?

Currently about 3.5 to 4.0 million ‘known’ protein sequences More than 99% of these are derived by translation of nucleotide sequences Less than 1%: direct protein sequencing (Edman, MS/MS…) -> It is important that users know where the protein sequence comes from… (sequence & gene prediction quality)!

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Level of DNA/RNA sequence quality - DNA/RNA sequencing quality (genome or WGS, cDNA or EST …) - Gene prediction quality; programs used, is there manual intervention afterwards?

For example: Authors can specify the nature of the CDS in the nucleotide databases by using qualifiers: "/evidence=experimental" or "/evidence=not_experimental".

Very rarely done… A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

The hectic life of a sequence …

Public nucleic acid databases Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … EMBL , GenBank , DDBJ …if the submitters provide an annotated Coding Sequence (CDS) Public protein sequence databases

CDS: CoDing Sequence (CDS) CDS provided by the submitters The first Met !

CDS translation provided by EMBL

Data not submitted Complete genome (submitted) only ~ 1,858 CDS available!

Issue for the users: the protein database jungle

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

The hectic life of a sequence …

Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … EMBL , GenBank , DDBJ CoDing Sequences provided by submitters Scientific publications derived sequences TrEMBL UniProtKB PIR Swiss-Prot Manually annotated GenPept RefSeq* EnsEMBL* IPI CCDS * Also gene prediction UniParc PRF PDB + species-specific databases (EcoGene, TubercuList, TIGR…)

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Major public protein sequence database ‘sources’ PIR PDB PRF UniProtKB : Swiss-Prot + TrEMBL Integrated resources ‘cross-references’ Separated resources NCBI-nr : Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq UniProtKB/Swiss-Prot: manually annotated protein sequences (11,000 species) UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot (127,000 species) GenPept: submitted CDS (GenBank); redundant with UniProtKB (about 130,000 species) PIR: Protein Information Resource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4,000 species)

Other protein sequence databases CCDS: EBI + NCBI + Wellcome Trust Sanger + UC Santa Cruz (2 species) Consensus human and mouse sequences between 4 institutions… Combining different approaches – ab initio, by similarity - and taking advantage of the expertise acquired by different institutes, including manual annotation… EnsEMBL: UniProtKB + RefSeq + gene prediction (31 species) aligns some eukaryotic genomic sequences with all the sequences found in EMBL, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (→ known genes)- Also does some gene prediction (→ novel genes) IPI: UniProtKB + RefSeq + EnsEMBL + (H-InvDB, TAIR, VEGA) (7 species) provides a guide to the main databases that describe the human, mouse, rat, zebrafish, Arabidopsis, chicken, and cow proteomes. … A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Protein Information Resource

The UniProt consortium

European Bioinformatics Institute European Molecular Biology Laboratory Swiss Institute of Bioinformatics A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

The UniProt Consortium

UniProt (Universal Protein Resource): the world's most comprehensive catalogue of protein information www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006). Provides 3 databases: -UniProtKB (Swiss-Prot + TrEMBL) -UniRef -UniParc and soon UniMES (for Metagenomic and Environmental Sequences) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

The Universal Protein resource components UniProtKB Release 9.7 consists of: UniProtKB/TrEMBL Computer annotated protein sequences 3’600’000 entries ~100’000 species UniProtKB/Swiss-Prot Manually annotated protein sequences 260’000 entries ~ 10’000 species produced by SIB and EBI UniRef100 UniRef 90 UniRef 50 • One UniRef100 All identical sequences (including fragments).

entry = • One UniRef90 entry = Sequences that have at least 90% or more identity .

• One UniRef50 entry = Sequences that are at least 50% or more identity .

Independent of species.

Allows comprehensible BLAST similarity searches by providing sets of representative sequences produced by PIR UniProt Archives ~8’000’000 entries Archived raw protein sequences, found in publicly accessible databases: Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices.

Use with extreme caution: Contains pseudogenes, incorrect CDS predictions, etc… produced by EBI

The Universal Protein resource components UniProtKB/TrEMBL Computer annotated protein sequences 3,900,000 entries ~127,000 species UniProtKB/Swiss-Prot Manually annotated protein sequences 260,000 entries ~ 11,000 species produced by SIB and EBI UniRef100 UniRef 90 UniRef 50 • One UniRef100 All identical sequences (including fragments).

entry = • One UniRef90 entry = Sequences that have at least 90% or more identity .

• One UniRef50 entry = Sequences that are at least 50% or more identity .

Independent of species.

Use with extreme caution: Contains pseudogenes, incorrect CDS predictions, etc… produced by EBI

entry = • One UniRef90 entry = Sequences that have at least 90% or more identity .

• One UniRef50 entry = Sequences that are at least 50% or more identity .

Independent of species.

Use with extreme caution: Contains pseudogenes, incorrect CDS predictions, etc… produced by EBI

entry = • One UniRef90 entry = Sequences that have at least 90% or more identity .

• One UniRef50 entry = Sequences that are at least 50% or more identity .

Independent of species.

Allows comprehensible BLAST similarity searches by providing sets of representative sequences produced by PIR UniProt Archives ~8,800,000 entries Archived raw protein sequences, found in publicly accessible databases: Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices.

Use with extreme caution: Contains pseudogenes, incorrect CDS predictions, etc… produced by EBI

UniProt web sites…

http://www.expasy.org/sprot/ http://www.pir.uniprot.org/ http://www.ebi.ac.uk/uniprot/ http://www.uniprot.org/ Soon, a new unified web site, with a very powerful search engine….

http://beta.uniprot.org/ Test it! Logon:guest Password: amazing A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

The UniProt groups from SIB, EBI and PIR

(Antibes, September 2004) In Geneva ( SIB ): 2 Group Leaders 44 Annotators 4 Prosite annotators 22 Programmers and Researchers 5 Administrators, science communicators 3 System Administrators 4 Students 1 GISAID ----------------- 85 people At EBI : (Swiss-Prot + EMBL + TrEMBL) 75 people (29 Annotators) A. Auchincloss UniProtKB and ExPASy At PIR: 1 Group Leader 13 Protein Science Team 12 Informatics Team ----------------- 26 people Tunis, March 2007

UniProtKB has biweekly releases; available from about ~100 servers, the main sources being ExPASy and www.uniprot.org

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

UniProtKB From EMBL (DNA) to TrEMBL (protein)

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Gene/protein name Taxonomy Reference CDS EMBL TrEMBL Automated extract of the protein sequence (CDS), gene name, taxonomy and references.

Automated annotation (KWs and protein family).

! TrEMBL does not translate DNA sequences, nor does it use gene prediction programs: only takes the existing CDS proposed by the submitting authors in the EMBL/Genbank/DDBJ entry In particular, the proposed CDS and derived protein sequences can be experimentally proven or derived from gene prediction programs (this is not obvious from the TrEMBL entry) TrEMBL does not validate any sequences

!!!!

The quality of UniProtKB/TrEMBL data is directly dependent on the information provided by the submitter of the original nucleotide entry.

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

UniProtKB From TrEMBL to Swiss-Prot

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

CDS EMBL Automated extraction of the protein sequence (CDS), gene name and references.

Automated annotation. TrEMBL Manual annotation of the sequence and associated biological information (derived from literature, external experts, databases…) Annotation of sequence differences (conflicts, variants, splicing…) Average of 6 independent sequence reports for each human protein Swiss-Prot

Distinguishing Swiss-Prot and TrEMBL

– A TrEMBL entry is a computer-annotated record derived from a coding sequence (CDS) in the nucleotide sequence databases, not in Swiss-Prot, after some redundancy removal and automated annotation.

– A Swiss-Prot entry is a manually annotated record for a given protein.

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

UniProtKB

From TrEMBL to Swiss-Prot Step 1: Sequence check

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

UniProtKB/Swiss-Prot

Non-redundant 1 entry -> 1 gene (1 species) i) Merge all known protein sequences (CDS and amino acid) derived from the same gene -> decreases redundancy and improves sequence reliability ii) Annotation of the sequence differences (including conflicts, polymorphisms, splice variants etc..) -> annotation of protein diversity A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Redundancy…

UniProtKB/Swiss-Prot ~11,000 species UniProtKB/TrEMBL ~127,000 species

260,000 + 3,800,000



3,600,000

Redundancy in TrEMBL & Redundancy between TrEMBL and Swiss-Prot In the future: redundancy is going to decrease: "new" genome sequencing → "new" proteins

- 13 sequences (complete or partial) - derived from mRNA (n=6) or genomic DNA (n=7)

All alternatively spliced sequences are available for BLAST searches, protein identification tools and are downloadable… Human: ~2/3 of the human genes are alternatively spliced A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

- 6 genomic sequences (complete or partial) - 1 protein sequence from PIR A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Multiple alignment of the available clpB sequences A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Within Swiss-Prot?

• A snapshot of the situation (December 2006): – 28,200 entries with 82,000 sequence conflicts; – 2,600 entries with corrected frameshifts; – 15,100 entries with corrected initiation sites; – 4,300 entries with other sequence ‘problems’.

• At least 43,000 entries (19% of Swiss-Prot) required a minimal amount of annotation effort to obtain the “correct” sequence. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Quality of protein information from genome

– Drosophila

projects

• Proteins originating from different genome projects: : what a curated (thanks to FlyBase) genome effort should look like: only 1.8% of the gene models conflict with what we have in UniProtKB/Swiss-Prot; – Arabidopsis : a genome where lots of work was done to annotate it when it was sequenced, but where nothing as been done since (at least in the public view): 19.5% of the gene models are erroneous; – Tetraodon nigroviridis : a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins.

– Bacteria and Archaea have almost no splicing, so prediction is “easier”, however errors are still made…

• Producing a clean set of sequences is not a trivial task; • It is not getting easier as more and more types of sequence data is submitted; • It is important to pursue our efforts in making sure we provide to our users the most correct set of sequences for a given organism. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

• •

New ‘Protein existence evidence’ tag

As most protein sequences are derived from translation of nucleotide sequence and are only predictions, the new PE line indicates whether there is any evidence that proves the existence of a protein; The ‘Protein existence evidence’ will have 5 different qualifiers: 1. Evidence at protein level 2. Evidence at transcript level 3. Inferred from homology 4. Predicted Unassigned (used mostly in TrEMBL) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Righting the wrongs “Sequences are rarely deposited in a “mature” state; as with all scientific research, DNA and protein annotation is a continual process of learning, revision and corrections.” “Sequencing error rates: ~1 base in 10’000” “Making people aware of errors is good and great; making people aware that they’re responsible also for correcting errors is even greater” C. Hardley, EMBO reports, 4(9), 2003.

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

UniProtKB

From TrEMBL to Swiss-Prot Step 2: Annotation:

literature controlled vocabulary A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

•

Annotation

The focal point of the efforts to maintain and develop UniProtKB/Swiss-Prot; • It is becoming more and more important as it provides:  a summary of what is known about a protein;   creates template for automatic annotation for the many organisms whose genome sequence is/will be available but whose proteins will not be characterized; provides well annotated (corpus) entries to train literature mining tools (text mining).

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

….

(…) A. Auchincloss UniProtKB and ExPASy Source of data - publications (> 1,700 journals cited) -also external scientific expertise & other databases Tunis, March 2007

Comments: “structured free text”, 27 defined topics Manually annotated Information from papers, specialized databases, computer prediction, external experts, brain storming Distinction between data obtained experimentally and computerized inferences A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

UniProtKB

From TrEMBL to Swiss-Prot Step 3: Sequence analysis (bioinformatics tools)

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

The annotation platform

Annotators could not work without the help of our software developers; A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Anabelle: much more than a domain annotation platform

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

We manually check the results !

What else is in a UniProtKB/Swiss-Prot entry?

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Cross-references; a central hub

Gasteiger E. et al, Curr. Issues Mol. Biol. 3:47-55(2001) www.expasy.org/cgi-bin/lists?dbxref.txt

• Swiss-Prot was the first database with X-references; • Explicitly X-referenced to 85 databases: – DNA (EMBL/GenBank/DDBJ), – 3D-structure (PDB) – Family and domain (InterPro, HAMAP, PROSITE, Pfam, etc.) – genomic (OMIM, MGI, FlyBase, SGD, SubtiList, etc.) – 2D-gel (e.g. SWISS-2DPAGE) – specialized db (e.g.GlycoSuiteDB, PhosSite, MEROPS); – literature (PubMed) • Each UniProtKB/Swiss-Prot entry can be seen as a central hub for the data available about the protein it describes A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Organism-specific databases

AGD CYGD DictyBase EchoBASE EcoGene euHCVdb FlyBase GeneDB_Spombe GeneFarm Gramene H-InvDB HGNC HIV HPA LegioList Leproma ListiList MaizeGDB MGI MIM MypuList PhotoList RGD SagaList SGD StyGene SubtiList TAIR TubercuList WormBase WormPep ZFIN

Genome annotation databases

Ensembl GenomeReviews KEGG TIGR

Sequence databases

EMBL PIR UniGene

3D structure databases

HSSP PDB SMR GlycoSuiteDB PhosSite

Enzyme and pathway databases

BioCyc Reactome UniProtKB/Swiss-Prot explicit links

PTM databases Miscellaneous

ArrayExpress dbSNP DIP DrugBank GO IntAct LinkHub RZPD-ProtExp

Family and domain databases

Gene3D HAMAP InterPro PANTHER PIRSF Pfam PRINTS ProDom PROSITE SMART TIGRFAMs

2D-gel databases

ANU-2DPAGE Aarhus/Ghent-2DPAGE COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE HSC-2DPAGE OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE

Protein family/group databases

GermOnline MEROPS PeroxiBase PptaseDB REBASE TRANSFAC

Implicit cross-references on new web server and ExPASy

Implicit X-references to 26 additional db added by the ExPASy server on the www (i.e.: GeneCards, ModBase, etc.) These X-refs are the Swiss-Prot entry as it can be downloaded by ftp, but are not present as hard-coded DR lines added on the fly entry on ExPASy. This can be done because enough information is present in the UniProtKB entry to access the related information in another db. Example: All Swiss-Prot/TrEMBL are linked to the BLOCKS domain db, via the Swiss-Prot/TrEMBL accession number when someone views an in A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Keyword definition and usage in Swiss-Prot Linked to Gene Ontology to further facilitate information retrieval via controlled vocabularies A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

In a UniProtKB/Swiss-Prot entry, you can expect to find:

• All the names of a given protein (and of its gene); • Its biological origin with links to the taxonomic databases; • A selection of references ; • A summary • Numerous • Selected of what is structures, etc.…; keywords • A description of ; known about the protein: function, alternative products, PTM, tissue expression, disease, 3D cross-references; important sequence features: domains, PTMs, variations, etc.; • A (often corrected) protein sequence various isoforms/variants.

and the description of A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Monitoring entry history: The UniProtKB Sequence/Annotation Version archive A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

… and many useful links: A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

And on the new website

other tools are not yet available… A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

UniProt Knowledgebase

• Swiss-Prot: Manually annotated section • TrEMBL: Automatically annotated section

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Distinguishing Swiss-Prot and TrEMBL

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Accession number: to be used when you cite a UniProt entry in anywhere (never cite the entry name (ID) alone) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Non-Redundant Complete Proteome Sets • Text search UniProtKB keyword “Complete proteome”, combined with an organism name • Or download precomputed sets (bacteria, archaea, some eukaryotes):

ftp://ftp.expasy.org/databases/complete_proteomes/entries

• Or EBI Integr8

http://www.ebi.ac.uk/integr8/ A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

• • • • • • • • • • •

Swiss-Prot annotation priorities

The main annotation programs: HAMAP (High quality Automated and Manual Annotation of microbial Proteomes; bacteria, archaea, plastids); HPI (Human Proteomics Initiative); PPAP (Plant Proteome Annotation Project); FPAP (Fungal Proteome Annotation Project); Viral proteins; Tox-Prot (Toxin Annotation Project); ENZYMES (proteins with EC numbers); PTMs 3D-structure Protein-protein interactions Quality assurance, includes controlled vocabularies A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Model organisms

• Organisms for which we want to have a more in-depth coverage; • Completeness, links with specialized databases, specific documents; • Examples: fruitfly,

A.thaliana

E.coli

B.subtilis

, human, mouse,

C.elegans

, yeast,

S.pombe

, A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Human Proteomics Initiative (HPI)

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

From genome to proteome

~ 21,000 human genes alternative splicing of mRNA 2-5 fold increase ~ 1,000,000 human proteins post-translational modifications of proteins (PTMs) 5-10 fold increase ~ 100,000 human transcripts Considerable increase in complexity

In the case of human genes, the Swiss-Prot/TrEMBL redundancy is still very high:

15,803 + 53,100



about 20,000*

* human gene number estimation: 21,000-35,000 MS proteomics has verified more than 10% of human genes products, but has not identified significant numbers of unpredicted proteins What is missing: • Sequences not submitted to EMBL/GenBank/DDJB (and PIR) • Not yet predicted or known genes ("no CDS provided by the submitters" or no DNA sequence) • Confidential data (Patent application sequences) • Immunoglobulins, T-cell receptors (-> UniParc) •… 1000

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Post-translational modifications (PTMs)

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

PTM definition

a post-translational modification or PTM is a modification of a polypeptide chain involving the making or the breaking of covalent bond(s) that occurs during (co translational class) or after translation. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

PTMs influence or even define protein function

phosphorylation and possibly GlcNAcylation and S-nitrosylation are a means of transducing extracellular signals to the inside of the cells.

methylation has a role in nuclear protein import. lipid addition allows protein to membrane association (e.g. GPI anchor, myristate, palmitate).

intrachain disulfide bonds and N-glycosylation influence protein folding.

interchain disulfide bonds bind subunits together.

other PTMs are directly involved in the protein function, as for example the binding of cofactors (e.g. pyridoxal phosphate), or the synthesis of a cofactor by the modification of amino acids present in the protein (e.g. quinones).

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

PTM variety

acetylation methylation acylation phosphorylation oxidation crosslinks GPI amidation crosslinks methylation Gly Ala             Val  Leu Ile      Lys      Arg His Asp Glu Asn Gln side-chain modifications              C-terminal modifications                  Cys Ser                Thr      Met    Pro hydroxylation cofactor binding              sulfation         C-linked sugar N-linked sugar   O-linked sugar S-linked sugar acetylation      N-terminal modifications            methylation          acylation    in black: cytoplasmic modifications in dark grey: both cytoplasmic and extracellular modifications, depending on the exact type in light grey: extracellular modifications Phe Tyr               Trp        A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Large scale experiments (LSE) for PTMs!

• PTM information can now be obtained from results of proteomics large scale experiments (LSE); • In the past 12 months we have added about 6’000 experimental PTMs using data originating from some of these projects.

AMB, SP20

Proteomic studies have lead to the updating of 2767 human Swiss-Prot entries, mainly with PTM information (UniProt release 10.0 , March 2007) Phosphorylation (83%) Subcellular location (4%) Glycosylation (9%) Other PTMs (4%)

Bacteria and Archaea (HAMAP)

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

In 2006, ≈130 new bacterial and archaeal genomes (not WGS) were submitted to the DNA databases; If on "average" 4,000 proteins/genome=>500,000 proteins!

How to cope????

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

igh quality

utomated and

anual

nnotation of microbial

roteomes HAMAP

Lots of microbial genomes, lots of proteins. What should we do with them in UniProt?

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

http://www.expasy.org/unirule/MF_00319

Automatic annotation of proteins belonging to specified families

(1)

• This program requires the continuous development and adaptation of software tools as well as the development of a database of annotation rules for each family (so far about 1,400).

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

 Allows us to annotate automatically, yet with a very high level of quality, proteins that belong to well defined protein families;  Can be applied to both characterized proteins and to some UPF’s (Uncharacterized Protein Family); The families are based on UniProtKB/Swiss-Prot entries, so we first do all the annotation steps described earlier! A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

/www.expasy.ch/sprot/hamap/ Using HAMAP, we can currently annotate to Swiss-Prot quality level between 10% to 50% of a complete microbial proteome (next step: HAMAP for Fungi…)

Updates

• DNA sequence archives – EMBL/GenBank/DDBJ is an archive • All submitted data goes into the archive • Submitters are responsible for the submitted sequences and the accompanying annotation • Nobody else can change them (including the curators at EMBL/GenBank/DDBJ) • Protein sequence databases – UniPRotKB/Swiss-Prot is NOT an archive • Swiss-Prot chooses what goes into the database and where to place it • Swiss-Prot updates annotation and sequences when necessary A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

**ZB SYP, 28-NOV-2003; ALB, 16-NOV-2004; MIM, 31-Jan-2006; **ZB BER, 13-FEB-2006; LYG, 14-JUN-2006; LYG, 21-SEP-2006; **ZB CHH, 05-DEC-2006;

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

User updates or annotation requests

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Accessing & Searching UniProtKB Direct access (keyword search)

• New search tool – we’ll use it later • Sequence Retrieval System (SRS, Europe), will ( disappear • Entrez (NCBI, USA) – UniProtKB/Swiss-Prot (not TrEMBL) is integrated in GenPept, but with a changed format, and with some information (e.g. implicit cross-references) is missing • Query tools on ExPASy & UniProt http://www.expasy.org/sprot/ , http://www.uniprot.org

)

Indirect access (sequence search)

• Bioinformatics & sequence analysis tools (Blast, Fasta, GCG, Emboss, MS Identification tools…)

Downloading the UniProt Knowledgebase

http://www.expasy.org/sprot/download.html

• Swiss-Prot and TrEMBL form a complete, non-redundant database, the UniProt Knowledgebase • Can be downloaded from ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase • In “Swiss-Prot” format, fasta or xml format • Complemented by sequences of alternative splice isoforms • “everything” about “ all” proteins! (at least all CDS submitted to the public nucleotide sequence databases) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

If you want to develop tools to work with your local copy of UniProtKB:

Swissknife – a PERL parser for UniProtKB Constantly updated according to latest format changes • • Advantage: you do not need to know how exactly the information is stored in the flat file http://swissknife.sourceforge.net/ ftp://ftp.ebi.ac.uk/pub/software/swissprot/ Swissknife/ A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Take home message

• Swiss-Prot is the non redundant, manually annotated and highly cross-referenced section of the UniProt Knowledgebase • Be aware of the differences between UniProtKB/TrEMBL and UniProtKB/Swiss-Prot – Computer vs. Human – Redundant vs. Non-redundant • Always cite the Accession number, not the entry name – The AC is stable – The entry name might change We need your feedback and your expertise!

[email protected]

http://www.expasy.org/sprot/update.html

(and from every UniProtKB entry page on our servers) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

The UniProt Consortium

UniRef100, 90 and 50 clusters

One UniRef100 entry -> all identical sequences from UniProtKB and some sections of UniParc (including fragments, Swiss-Prot splice variants).

One UniRef90 entry -> sequences that have at least 90% or more identity.

One UniRef50 entry -> sequences that are at least 50% identical.

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

UniRef100, 90 and 50 clusters

One cluster can contain sequences of several species, clustering is done independently of the organism Each cluster has a “representative”, “reference” sequence, preferably that of the best-annotated Swiss-Prot entry UniRef identifiers are of the form UniRef100_P99999, UniRef50_P00414 – not stable, as clusters are recomputed with every biweekly release, and cluster representatives can change!

UniRef is useful for comprehensive BLAST sequence searches by providing sets of representative sequences.

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Implicit cross-link UniProtKB to UniRef: new web view: A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

The UniProt Consortium

UniParc – the UniProt Archive

• 8.8 million sequences • Sequences and cross-references (AC numbers) • A comprehensive collection of the raw protein sequences in public databases (including those not submitted to the DNA databases): Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices.

• UniParc can be used to track sequence versions Use with extreme caution: also contains pseudogenes, incorrect CDS predictions, etc…and is highly redundant !

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

UniParc tracks a protein sequence and its integration in various databases http://www.pir.uniprot.org/cgi-bin/textSearch_AR A. Auchincloss UniProtKB and ExPASy Patent data Tunis, March 2007

UniParc entry UPI0000033477 part 2 TrEMBL entry probably to be merged into Swiss-Prot A. Auchincloss UniProtKB and ExPASy TrEMBL entry was merged into Swiss-Prot Tunis, March 2007

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

www.expasy.ch/prosite A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

PROSITE

A database of protein families and domains using two kinds of motif descriptors: Patterns or regular expressions : •User friendly (easy to understand and to use) •Well designed for the detection of biologically meaningful sites such as residues playing a structural or functional role •Can be used to scan a protein database in reasonable time on any computer Generalized profiles or weight matrices : •Well adapted to cover the full length of the protein or domain •Are able to detect highly divergent families or domains with only a few well conserved positions A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Identification of protein domains and families

• There are two non-exclusive approaches for the determination of the function of an uncharacterized protein: – Comparison with a complete sequence database (BLAST) – Scanning a database of patterns and profiles • Most proteins can be grouped into families. Proteins belonging to a particular family share functional attributes and are derived from a common ancestor; • Some regions in the sequence are more conserved than others during evolution because they are important for the function or the structure of the protein; • Like fingerprints for police identification, signatures built out of sequence patterns or profiles can be used to formulate hypotheses about the function of uncharacterized proteins.

Definitions of conserved regions

Conserved regions can be classified into 5 different groups: • Families: proteins that have the same domain arrangement, be 1 or many domains.

• Domains: specific combination of secondary structures that assume characteristic three dimensional structures or folds. • Repeats: structural units always found in two or more copies that assemble in specific fold. Assemblies of repeats might also be thought of as domains.

• Motifs: short regions with conserved active- or binding-sites that usually adopt a folded conformation only in association with their ligands. • Sites: functional residues (active sites, disulfide bridges, post-translationally modified residues) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Conserved regions (2)

CSA_PPIASE Cys 181: active site residue PPID family: 1 CSA_PPIASE domain + 3 TPR repeat Binding cleft (motif) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

http://www.expasy.org/tools/scanprosite/ A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Functionally and structurally relevant residues in PROSITE motif descriptors

A new concept to extract more information from profiles Principle : • Combining the advantages of profiles (high sensitivity) and patterns (position-specific information) • Tagging of amino acids at precise positions in the profile and checking their presence in the matched sequence A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

ProRule

Aim: • Provide users with biologically meaningful functional and structural information: active sites, post-translational modification sites, binding sites, disulfide bonds, transmembrane regions.

• Help the UniProtKB/Swiss-Prot annotation and provide enhanced homogeneity: domain name and boundaries, keywords and linked GO terms, EC numbers, false negative PROSITE patterns.

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

www.expasy.ch/prosite/prorule.html

Sigrist et al.: Bioinformatics 21:4060-4066(2005) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Other methods for protein/domain identification

Pfam, TIGRFAMs, SMART, Gene3D, PANTHER, CDD: Hidden Markov Models (HMM), Probabilistic models; PRINTS: “Unweighted” matrices; protein fingerprints BLOCKS: Weight matrix derived from ungapped alignments; PIRSF, SUPERFAMILY: classification system based on evolutionary relationship of whole proteins ProDom: automatic compilation of homologous domains based on recursive PSI-BLAST searches. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

The InterPro project

www.ebi.ac.uk/interpro Integrated Documentation Resource of Protein Families, Domains and Functional Sites A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

The InterPro project

www.ebi.ac.uk/interpro • Unification of PROSITE, PRINTS, Pfam and ProDom into an integrated resource of protein families, domains and functional sites in 2000; • Joint effort in creating a unified yet methodologically diverse system for protein family/domain identification; • Single set of “documents” linked to the various methods; • Distributed with tools by anonymous FTP and through www servers; • Used to enhance the functional annotation of UniProtKB (Swiss-Prot and TrEMBL) • Has progressively incorporated other databases A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Current status of InterPro

Release 14.1 (February 2007) was built from Pfam, PRINTS, PROSITE, ProDom, SMART, TIGRFAMs, PIRSF, Scop based SUPERFAMILY, Gene3D and PANTHER, and the current UniProt/Swiss-Prot + TrEMBL data. (for details see http://www.ebi.ac.uk/interpro/release_notes.html

) InterPro release 14.1 contains 13,953 entries, representing 3,911 domains, 9,610 families, 232 repeats, 34 active sites, 20 binding sites and 19 post-translational modification sites. Overall, there are 15,880,845 InterPro hits from 3,100,874 UniProtKB protein sequences. 92.4% of Swiss-Prot and 76.4% of TrEMBL protein sequences have one or more InterPro hits.

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

http://www.ebi.ac.uk/interpro/ A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

http://www.ebi.ac.uk/interpro/IEntry?ac=IPR001304 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

InterPro: Graphical domain representation

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

http://www.ebi.ac.uk/integr8/ProteomeAnalysisAction.do?orgProteomeID=25 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

http://www.ebi.ac.uk/integr8/ProteomeAnalysisAction.do?orgProteomeId=18 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

The ExPASy www server

• First molecular biology server on the Web (August 1993); ~500 million accesses since; • Dedicated to proteomics: – Databases: UniProtKB, PROSITE, Swiss-2DPAGE, etc.; – Many 2D/MS protein identification/characterization and sequence analysis tools; • Mirror sites in Australia, Brazil, Canada, China and Korea: http://{au|br|ca|cn|kr|www}.expasy.org

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

ExPASy software tools

• Tools for the display and management of databases (NiceProt, Swiss-Shop sequence alerting system, etc.); • Tools for sequence analysis (ScanProsite, ProtParam, ProtScale, RandSeq, Translate, etc.); • Proteomics tools (AACompIdent, FindMod, FindPept, Aldente, PeptideMass, TagIdent, etc.); • 3D-structure analysis and display tools (Swiss Model, Swiss-PDBviewer) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

http://www.expasy.org/tools/ Identification: Aldente, TagIdent, AAcompIdent, MultiIdent Characterization: FindMod, GlycoMod, FindPept - Use annotation in Swiss-Prot and TrEMBL (preprocessing, PTMs, etc.) - Hyper-links between tools and databases PeptideMass, GlycanMass, BioGraph, PeptideCutter ProtScale, Tunis, March 2007 Analysis: ProtParam

http://www.expasy.org/links.html

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Finding out about recent developments:

UniProtKB/Swiss-Prot recent format changes: http://www.expasy.org/sprot/relnotes/sp_news.html

UniProtKB/Swiss-Prot planned format changes: http://www.expasy.org/sprot/relnotes/sp_soon.html

Subscribe to the electronic Swiss-Flash bulletins: http://www.expasy.org/swiss-flash/ What’s new on ExPASy: http://www.expasy.org/history.html

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

UniProtKB/Swiss-Prot: http://www.expasy.org/sprot/sprot-ref.html

References (1)

Wu C. et al. The Universal Protein Resource (UniProt): an expanding universe of protein information.

Nucleic Acids Res. 34:D187-191(2006). Boeckmann B. et al. Protein variety and functional diversity: Swiss-Prot annotation in its biological context Comptes Rendus Biologies 328:882-99(2005). Bairoch A.

Swiss-Prot: Juggling between evolution and stability Brief. Bioinform. 5:39-55(2004). Farriol-Mathis N. et al. Annotation of post-translational modifications in the Swiss-Prot knowledgebase . Proteomics 4:1537-1550(2004).

Gasteiger E. et al. A. Swiss-Prot: Connecting biological knowledge via a protein database Curr. Issues Mol. Biol. 3:47-55(2001).

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

References (2)

PROSITE: Hulo N., et al., D230(2006).

The PROSITE database. Nucleic Acids Res. 34:D227 Sigrist C.J.A., et al., PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform. 3:265-274(2002).

Gattiker A., et al., ScanProsite: a reference implementation of a PROSITE scanning tool. Applied Bioinformatics 1:107-108(2002). Sigrist C.J.A., et al., ProRule: a new database containing functional and structural information on PROSITE profiles. Bioinformatics. 2005 21(21):4060-6. ExPASy: Gasteiger E. et al.

ExPASy: the proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 31:3784-3788(2003).

A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Useful general publications

• Nucleic Acids Res. Database issue 2006, vol. 34, supplement 1: http://nar.oupjournals.org/content/vol34/suppl_1/ • Nucleic Acids Res. Web server issue 2005, vol. 33, supplement 2: http://nar.oupjournals.org/content/vol33/suppl_2/ • Book: Bioinformatics for Dummies, by J.-M. Claverie and C. Notredame Publisher: For Dummies; 2nd edition (December, 2006) ISBN: 0764516965 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Take home message

• We need your feedback!

[email protected]

Or via the website A. Auchincloss UniProtKB and ExPASy Tunis, March 2007

Before the introduction to Swiss-Prot/ExPASy… After the introduction to Swiss-Prot /ExPASy …

Some practical exercises: http://education.expasy.org/cours/Tunis/ 1. Finding databases 2. Comparing protein databases 3. Comparing BLAST programs 4. BLAST output 5. Bacterial start sites 6. UniRef 7. Different views of UniProtKB 8. Environmental sequences 9. Inter-database links & PROSITE 10. InterPro 11. Using UniProtKB/Swiss-Prot to create datasets