Speakers - 2
• You can use the podium microphones and this
will ensure that your image is captured
correctly on the camera, but you can also use a
wireless lapel mike;
• Please use the mouse to point on objects in
your presentation instead of the laser pointer.
"Plus ça change, plus
c'est la même chose…”
the next 20 years
This will not be a talk on
the (pre)-history of Swiss-Prot
The universe in which Swiss-Prot
1953: 1st sequence (bovine insulin)
1986: 4’000 sequences
2006: 3.5 million sequences
Where will it stop?
179'000'025'042 (179 billion)
1st estimate: ~30 million species (1.5 million named)
2nd estimate:
million bacteria/archea
4'000 genes
million protists
6'000 genes
million insects
14'000 genes
million fungi
6'000 genes
0.6 million plants
20'000 genes
0.2 million molluscs, worms, arachnids, etc.
20'000 genes
0.2 million vertebrates
25'000 genes
The calculation:
105x20000+2x105x25000+25000(Craig Venter)+42(Douglas Adam)
Caveat: this is an estimate of the number of potential sequence entries,
but not that of the number of distinct protein entities in the biosphere.
When will UniProtKB be complete?
• Swiss-Prot:
In July 2009: 500’000 entries;
In 2013: 1 million entries;
In 2026 (40th anniversary): 10 million entries;
In 2036 (50th anniversary): 100 million entries.
– In May 2080 TrEMBL will have reached 10 billion entries;
– We can’t compute with Excel when we will reach 179 billion
– But we are confident these dates are worthless as new
sequencing techniques will have made all of these projections
a very futile exercise!
• The bread of Swiss-Prot. And yes: annotations
are the butter!;
• >99% of the protein sequences originate from
translation of mRNA or genomic sequences;
• Do we still need manual intervention to cater for
sequences or can we just build smart filters to
obtain those we want from TrEMBL?
So what is the current status?
• A snapshot of the situation:
28’200 entries with 82’000 sequence conflicts;
2’600 entries with corrected frameshifts;
15’100 entries with corrected initiation sites;
4’300 entries with other sequence ‘problems’.
• At least 43’000 entries (19% of Swiss-Prot)
required a minimal amount of curation effort so
as to obtain the “correct” sequence.
Quality of protein information from genome projects
• Lets look at proteins originating from 3 different
genome projects:
– Drosophila: the example of what a curated (thanks to
FlyBase) genome effort should look like: only 1.8% of
the gene models conflict with what we have in SwissProt;
– Arabidopsis: a typical example of a genome where lots
of work was spent to annotate it at the time where it was
sequenced, but where nothing as been done since (at
least in the public view): 19.5% of the gene models are
– Tetraodon nigroviridis: the typical example of a quick
and dirty automatic run through a genome with no
manual intervention: >90% of the gene models produce
incorrect proteins.
Human sequence entries as an example
• We have about 14’500 human entries in
– 4’300 entries contain information about 8’000
splice variants;
– 4’600 entries contain information about 27’000
sequence variants;
– 7’500 entries contain information about 22’000
sequence conflicts;
– In average each human entry is produced by
merging together sequence information from 6.2
different nucleotide sequence entries.
Take home message
• Producing a clean set of sequences is not a
trivial task;
• It is not getting easier as more and more
type of sequence data gets submitted;
• It is important to pursue our efforts in making
sure we provide to our users the most
correct set of sequences for a given
Post-translational modifications (PTMs)
• If sequences are important, their are generally not fully
representative of the final ‘biological entity’: most proteins
are the target of PTMs;
• PTMs are important at various levels, including the 3D
structure, interactions, subcellular location and also the
• The story of the integration of PTMs in Swiss-Prot consists
of 3 distinct parts;
• 1st part: a long time ago in a distant proteogalaxy:
The 2 phase: 2000 to 2005
• Complete overhaul and significant extension of a
controlled vocabulary for PTMs;
• Creation of a PTM annotation program within
the Swiss-Prot groups at SIB and EBI;
• Development of new tools (Sulfinator, DGPI) for
the prediction of some PTMs;
• Massive clean up and re-annotation of many
classes of PTMs.
The expanding world of PTMs
• We now have 283 different PTM descriptions
(excluding processing, disulfide bonds and
glycosylation events).
The new document listing post-translational modifications
Contains many information items and is available in html format
or by ftp in tab-delimited format.
Finally LSEs for PTMs!
• Finally «Proteoman» has arrived! And PTM
information can now be obtained from results of
proteomics large scale experiments (LSE);
• In the past 12 months we have added about 6’000
experimental PTMs using data originating from
some of these projects.
But LSEs are not so easy to deal with
• Issues mundane to the incorporation of LSE PTM data:
– Quality:
• Trying to assess whether the methodology really allows the detection of invivo modifications;
• How many false positives are expected (often absent or very well hidden!);
– Accessing the data:
• Often in supplementary material tables and in a variety of formats (HTML
tables, excel spreadsheets, etc.);
• With a variety of identifiers (UniPRotKB, NCBI gi, pID, etc.);
– Sanity checking:
• Making sure that the right sequence position is modified;
• Does it make sense in the biological context;
– Propagating the information to orthologs.
• So the big issue is how will we be able to scale up and deal
with the expected increase in the number of such projects!
Cross-references: then
• The ‘DR’ lines were introduced in release 4 in
April 1987; they first linked Swiss-Prot to
• They were instrumental in the development of
SRS by Thure Etzold in the early 90’s;
• And also for ExPASy, the first web server in the
life sciences in 1993.
Organism-specific gene
Genome annotation
Sequence databases
Enzyme and pathway
Family and domain
explicit links
3D structure
2D-gel databases
Protein family/group
PTM databases
Cross-references: now
• There are now cross-references from Swiss-Prot to
74 different databases (6 more are in the pipeline);
• Almost 3 million DR lines: an average of 12 per
• Many other links to external resources are also
available through the OX (NCBI taxonomy), RX
(PubMed, DOI), CC («Web resource» topic) and
FT lines (dbSNP);
• Cross-references are not only a mean to help
navigate between resources, they sometimes add
information to the entries.
Examples of cross-references that provide information
• The cross-references to the Gene Ontology (GO):
GO; GO:0005634; C:nucleus; ISS.
GO; GO:0005515; F:protein binding; IPI.
GO; GO:0007165; P:signal transduction; TAS.
• The PDB cross-references include information on the
mapping of the structure on the sequence:
PDB; 1QQG; X-ray; A/B=4-267.
• The cross-references to domain databases include
information on the name/acronyms of the domains and the
number of occurrences of these domains:
PROSITE; PS50026; EGF_3; 2.
PROSITE; PS50092; TSP1; 3.
PROSITE; PS01208; VWFC_1; 1.
From sequences to structures..and back!
• Efficient bidirectional links between UniProtKB
and PDB/MSD are very important;
• Currently 10’000 Swiss-Prot entries are linked to
30’200 PDB entries;
• These links are constantly updated and verified; the
converse is unfortunatly still not yet true;
• We have always made use of 3D structure
information to help in the annotation process;
• But we are only now starting to systematically mine
3D structures to extract various information such as
disulfide bonds, metal-binding sites, active sites,
So what is the future of cross-references?
• Will we really need hard-coded cross-references in the
• Can we gradually replace some of them by computed
«on the fly» links using referenceable objects?
• Will we make more use of client-server systems such
as the distributed annotation system (DAS)?
• The answer is obviously dependent on
• But the Life Sciences are still living in the
dark ages of the tower of Babel
CVs and ontologies
• Since the very beginning of Swiss-Prot we have been
building a growing sets of controlled vocabularies
• Species, strains, plasmids, journals, tissues, PTMs;
domain names and, of course, keywords are all
«under control» (see posters SP117 and SP120);
• We are very well advanced in the process of having a
CV for pathways (see the UniPathway poster;
• We are now tackling the problems of protein and
gene names (see poster SP118). But this is of course
not very easy!
Do we need annotations?
• Annotators spend a big part of their time capturing and
synthesizing a huge amount of «functional» information;
• For example we populate Swiss-Prot with data relevant to
Role and function of the proteins;
Subcellular location;
Interactions (binary and “complex”);
Tissue specificity, developmental stage;
Involvement in diseases.
• We have many «anecdotal» evidence that users find this
very important and that this is one of the important
hallmark of Swiss-Prot. Yet is this really true?
Do we need annotations? – part 2
• This is a time consuming process and we will
never be complete and up-to-date;
• Many users want quick and easy to «summarize»
answers, yet the more detailed an entry becomes
the less it is easy to transform it into a
summarizable entity;
• We are often the victims of the «fasta format
syndrome»: users expect everything important
about a protein to be available in the header of a
fasta format entry!;
• So should we continue?
Yes we need annotation!
• Because (among many other reasons):
– Automatized annotation is the only way to transfer
knowledge from a model organism to a less studied
– To apply such techniques safely one needs template
entries that are representative of the state of the
– While literature mining tools could be conceived as a
way to automatically build a summary view of the
knowledge around a given protein, these techniques are
not yet powerful enough to create a coherent synthetic
– Literature mining tools also require the existence of
well annotated (corpus) entries.
From pull to push..
• For now more than 20 years we have
been «pulling» information and
knowledge from various sources, but
mainly from literature;
• It is now time to make sure that the
next 20 years will be defined by the
fact that researchers «push» their
results and the interpretation of their
results in the knowledgebase.
• Attempt to try to get the community to directly
submit information on the proteins that they are
• Using a wikepedia-type model/interface;
• Will first be «field-tested» in the yeast community;
• We are hopeful, yet we are realist: only a small
percentage of life researchers will take the time and
are altruistic enough to fully participate in such a
Grey grey matter
• Many life scientists with knowledge of the
molecular world and that are computerproficient are reaching retirement age;
• Some want to continue to play a role in the
advancement of research, yet they will not be
able to do lab work anymore;
• We should offer them the tools necessary for
them to contribute to the annotation process.
Anabelle and Asterix
• Two important tools could contribute to the
democratization of Swiss-Prot style annotation:
– Anabelle: a web based protein sequence analysis
– Asterix: the new Swiss-Prot editor.
Anabelle selection module
Viewer Layout:
Link to entry NiceProt view
Blast (full) entry
more links!...
Link to InterPro
Link to most similar Align most similar
entry NiceProt view entry with entry
Blast uncharted region
Link to domain
original database
And here is
what the
users gets
But what about the rest of the life
• We saw how we could get parents (adopt a
protein) and grand parents (grey matter
count) involvements, but what about the
• …the young researchers, those who are
active in producing new knowledge?
Two carrots, a stick and lots of
• The carrots:
– Making sure that granting agencies see favorably
the involvement of researchers in the process of
submitting information to databases;
– The same criteria should be considered by any
hiring or promotion committee;
• The stick: getting journal editors to refuse to
accept to publish a paper if the results have
not been submitted to the relevant knowledge
• Everyone should feel concerned;
• Awareness of the content and usage of
knowledge resources is a pre-requisite to do any
type of « serious » research in the field of
molecular life sciences;
• Organizations such as EMBNet, EBI, SIB,
NCBI, NIG should continue and strenghten their
«outreach» efforts;
• We (databases providers) should do more in term
of providing tutorials (on-line and on-site).
An important issue…
• The process of developing a data resource for
the Life Sciences is akin to the work of middle
age copists, renaissance encyclopedists or the
19th century OED development : it is a very
tedious, manually intensive, long term job…
How to get funding for knowledge
infrastructures in the life sciences?
• Funding knowledge resources is difficult:
– It’s a very long term process;
– It’s not prestigious;
– and its not cheap!
And its not only databases that are
Service groups are
also at risk
Proposition for a new tax
• Each grant proposal for a high throughput dataproducing project would be obliged to set aside a
predefined percentage of the grant money to help
cover the cost of storing and managing the
produced data;
• How this money would be redistributed is not
trivial to define and even less to implement;
• The priority would be to use this tax as a financial
tool to help fund the data repositories.
The tax for Biomolecular data archival
The 6 observations of a « databaser »
1. Your task will be much more complex and far bigger that
you ever thought it could be;
2. If your database is successful and useful to the user
community, then you will have to dedicate all your efforts to
develop it for a much longer period of time than you would
have thought possible;
3. You will always wonder why life scientists abhor complying
with nomenclature guidelines or standardization efforts that
would simplify your and their life;
4. You will have to continually fight to obtain a minimal amount
of funding;
5. As with any service efforts, you will be told far more what
you do wrong rather than what you do right;
6. But when you will see how useful your efforts are to your
users, all the above drawbacks will loose their importance!!
