Seminar MIG, INRA, Jouy-en

Transcript Seminar MIG, INRA, Jouy-en

PIR Seminar , University of Georgetown,
Washington DC, 12.11.2003
Information integration
for
Swiss-Prot annotation
Anne-Lise Veuthey
Swiss Institute of Bioinformatics
The simplified story of
a Swiss-Prot entry
Some data are not submitted to the public databases !!
Delayed or cancelled…
cDNAs, genomes, …
EMBLnew
EMBL
« Automated »
• Redundancy check (merge)
• Family attribution (InterPro)
• Annotation (computer)
TrEMBL
« Manual »
• Redundancy (merge, conflicts)
• Annotation (manual)
• SWISS-PROT tools (macros…)
• SWISS-PROT documentation
• Medline
• Databases (MIM, MGD….)
• Brain storming
CDS
TrEMBLnew
SWISS-PROT
Once in SWISS-PROT, the entry is no more in TrEMBL, but still in EMBL (archive)
CDS: proposed and submitted at EMBL by authors or by genome projects (can be experimentally
proven or derived from gene prediction programs). TrEMBL neither translates DNA sequences, nor
uses gene prediction programs: only takes CDS proposed by the submitting authors in the EMBL entry.
Annotation projects
• Human Proteomics Initiatve (HPI)
• HAMAP: microbial annotation
• Plant Proteome Annotation Project
(PPAP)
• Fungi annotation
• Tox-Prot: annotation of toxin proteins
PIR seminar, Washington
12/11/2003
HPI Goals
• Annotation of all known human proteins;
• Annotation of all mammalian orthologs.
With a particular emphasis on:
– Alternative splicing;
– Polymorphisms;
– PTMs;
– Structural information.
PIR seminar, Washington
12/11/2003
http://www.expasy.org/sprot/hpi/hpi_stat.html
PIR seminar, Washington
12/11/2003
Swiss-Prot / TrEMBL
Chromosome by chromosome
http://www.ebi.ac.uk/proteome/HUMAN/
100
90
% in Swiss-Prot
70
60
50
49%
40
30
20
10
0
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
Chromosome
PIR seminar, Washington
12/11/2003
Total number of entries (SP+Tr)
80
1: 1'640
2: 1'044
3: 879
4: 630
5: 745
6: 969
7: 752
8: 549
9: 617
10: 594
11: 976
12: 852
13: 275
14: 508
15: 469
16: 706
17: 538
18: 229
19: 1'121
20: 554
21: 179
22: 405
X: 645
Y:
61
Problem of information overflow
• Almost 136 356 entries in SwissProt
• But over 1 million sequences waiting in TrEMBL for manual
curation to be incorpotated in Swiss-Prot
• The number of TrEMBL entries is increasing exponentially
Solutions
• Development of quality automated annotation (HAMAP)
• Improvement of manual annotation efficiency by:
– Bioinformatic tools integration (Anabelle)
– Text mining tools to help literature screening (BioMinT)
PIR seminar, Washington
12/11/2003
HAMAP
High-quality Automated and Manual Annotation
of microbial Proteomes.
More than 150 complete genomes are available in
public databases. Collectively they encode more than
350’000 protein sequences. Such a large amount of
sequences makes classical manual annotation an
intractable task.
HAMAP handles annotation of:
• Microbial proteomes (eubacteria and archaea)
• Plastid proteomes (cyanelle and chloroplast)
• Soon extended to mitochondrial and viral proteomes
PIR seminar, Washington
12/11/2003
HAMAP goals
Annotate automatically, with a high
throughput but no decrease in quality,
proteins from complete microbial genomes
which either belong to a family or have no
similarities
Improve human-machine interaction for the
manual annotation of remaining proteins
Provide an integrated information at the
level of a microbial genome
PIR seminar, Washington
12/11/2003
HAMAP
pipeline
125 genomes
360'000 proteins
914 families
965 profiles
32'000 hits
PIR seminar, Washington
12/11/2003
HAMAP modules
Upstream
interface
Operational
FAM
Operational
SAM
Operational
Cleanup &
Family
Sequence
Redundancy Assignment AnnotationM
Module
odule
Fix errors in
TrEMBLnew
entries. Merge
redundancy with
Swiss-Prot.
Maintain
HAMAP profiles,
assign new
family members.
Detect ORFans.
Given a family
rule and a set of
entries, produce
annotated
entries based on
the rule. Also
annotate
ORFans.
CAM
GAM
Downstream
interface
Prototype
In development
Operational
Complex
Annotation
Module
Genome
Awareness
(herbs)
External
communication
Reliably
annotate
complex
families.
PIR seminar, Washington
Warn about
missing
proteins,
potential
problems with
paralogs and
inconsistent
pathways.
12/11/2003
Build an
interactive web
site to present
HAMAP data
and allow
retrieval of
complete
proteomes.
FAM : The family assignment module
Similarity searches
•Profile searches against db of families
•Searches in PROSITE, Pfam, TIGRFAM,…
•BLAST against the protein db
Belongs to a family
May belong to a family
Other similarities
•High sequence
similarity
•sub-threshold score
No similarities to known
families, but
•Correct length
•incorrect length
•problems with features
•Characteristic features
No similarities
(except to ORFans in
close species)
•Significant BLAST matches
•Matches in PROSITE, Pfam,…
Family attribution
Family attribution
New family/function?
ORFan
SAM module
SAM module
Manual annotation
SAM module
Manual correction
Family Assignment Module
Profile generation
Profiles are automatically generated from the
manually curated family alignments
Profile calibration
Check for profile accuracy
Set cutoff = lower score of members
No false positives allowed in Swiss-Prot
Run against TrEMBL and report result s
Identify trusted / member / weak matches
PIR seminar, Washington
12/11/2003
SAM : The sequence annotation module
FAM module
Similarity searches
Match to a defined family
Family rule-based
annotation
PIR seminar, Washington
No matches
Prediction program-based
annotation of hypothetical
proteins
(ORFans)
12/11/2003
Sequence Annotation Module
SAM annotates a protein entry based on
A family rule and a sequence, or
An ORFan sequence
Conditions in family rules are fully handled
External programs are called for computed features :
ProfileScan, TMHMM, SignalP, REP, COILS, as
instructed in the family rule
Computed features are combined using exclusion rules
Warnings are generated in case of problems
PIR seminar, Washington
12/11/2003
HAMAP family database
HAMAP families have the following
properties:
Single function, or function that can be
deduced from taxonomy or
presence/absence of metabolic
pathways
Family members (except divergent
sequences) have more similarity to one
another than to other proteins
PIR seminar, Washington
12/11/2003
HAMAP family database
All the annotation can be inferred from:
The rules given in the family (including
conditions and dependencies)
The alignment of the new entry with the existing
alignment in the family (for feature propagation)
The entry taxonomy
Additional information about metabolic
pathways and number of membranes
Results of ad hoc analysis tools
PIR seminar, Washington
12/11/2003
Computed features
Inteins (upstream of
identification step)
Signal type 1
Signal type 2 (lipoprotein)
Signal type 4 (pilin)
Transmembrane regions
Coiled coils
ATP/GTP binding sites
LPxTG cell-wall anchor
Repeats : ANK, LRR, TPR, WD,
Kelch
3 PROSITE profiles: C+N-terminal
junctions + homing endonuclease
SignalP neural network (von Heijne)
PROSITE rule; later, Pedro Gonnet's
patoseq program ?
PROSITE pattern
TMHMM (Krogh)
Modified COILS (Lupas)
Walker A profile (to do)
PROSITE profile
REP (Bork &Andrade) profiles +
program
PIR seminar, Washington
12/11/2003
Complex Annotation Module
Extension of SAM to allow:
Hierarchical assignment
ABC transporter profile => generic annotation
+ CysA subfamily profile => specific annotation
Modular annotation
Domain rules
Metamotifs to express domain arrangements
Is now part of the Anabelle project
Not limited to bacteria
Also usable for semi-automated annotation
PIR seminar, Washington
12/11/2003
Genome Awareness Module
Runs after a complete proteome has been
automatically annotated
Warn about missing proteins, potential
problems with paralogs and inconsistent
pathways
Implemented in an expert system:
HERBS project
PIR seminar, Washington
12/11/2003
HERBS: Hamap Expert Rule-Based System
• A Sibelius project in collaboration with INRIA,
Grenoble.
• Expert system using JESS a rule generation motor
written in JAVA
• Description of metabolic pathways using DAG:
step4
pyrimidine
synthesis
MF_00225
MF_01211
step5
MF_01208
MF_00224
http://www.expasy.org/sprot/hamap/
PIR seminar, Washington
12/11/2003
Annotation
of eukaryotic proteomes
Increase in complexity:
• Multi-domain proteins
• Molecular complexes
• Multigenic families
• Multiple subcellular locations
• Alternative splicing
• Post-translational modifications
• Pathways complexity
In conclusion, automated procedures of annotation are
far less easy to implement
PIR seminar, Washington
12/11/2003
Swiss-Prot Entry Creation Flowchart
Step1
Selecting a new protein to annotate
Step2
Searching
PubMed
Step4
Analysing sequence with
bioinformatic tools
Step3
Reading papers
Step5
Data integration
Step6
Creation of a new/updated
Swiss-Prot entry
Where the computer is helping?
Step1
Selecting a new protein to annotate
Step2
Searching
PubMed
Step4
Analysing sequence with
bioinformatic tools
Step3
Reading papers
Step5
Data integration
Step6
Creation of a new/updated
Swiss-Prot entry
Anabelle
• Rule-based system for manual, automated
and automatic annotation
• Proteins with a complex domain architecture,
and simple once.
• Unlike HAMAP, it annotates also proteins
which are not yet characterized, but which
share defined domains.
• It consists of 3 modules
PIR seminar, Washington
12/11/2003
Anabelle
• Module 1
• Runs protein sequence analysis tools (PC,
internal server, external servers)
• General feature format (gff)
AEGP_RAT
TMHMM|2.0
Transmembrane
NterLocation "O" ; Category "TOPOLOGY"
1157
1179
.
.
.
Level 0 ;
AEGP_RAT
ps_scan|v1.5 PS50068
230
268
10.938
.
.
Name "LDLRA_2" ; Level 0
; RawScore 993 ; FeatureFrom 1 ; FeatureTo -1 ; Sequence "RCPLGHHHCQNKACVEPHQLCDGEDNCGDSSDEdpLICS" ; KnownFalsePos 1 ;
InterProID "IPR002172" ; Category "DOMAIN“
www.sanger.ac.uk/Software/formats/GFF
• Module 2
• Module 3
AEGP_RAT
SignalP-NN|v2.0|euk
Signal
1
21
.
max "0.761,22,Y" ; Y-max "0.783,22,Y" ; S-max "0.992,12,Y" ; S-mean "0.934,Y" ; Category "TOPOLOGY"
PIR seminar, Washington
.
12/11/2003
.
Level 0 ; C-
Anabelle
• Module 1
• Module 2
• automatic pre-selection
• visualizer
• post-processing tool
• Module 3
PIR seminar, Washington
12/11/2003
Anabelle
• Module 1
• Module 2
• Module 3
• applies annotation rules
PIR seminar, Washington
12/11/2003
Selection of methods
PIR seminar, Washington
12/11/2003
PIR seminar, Washington
12/11/2003
PIR seminar, Washington
12/11/2003
Annotation section of an annotation rule for the
serine protease domain
CC -!- SIMILARITY: Belongs to peptidase family S1.
case <FTGroup:1>
KW Hydrolase
KW Serine protease
end case
FT From: PS50240
FT DOMAIN from to
Serine protease #.
FT Group: 1
FT ACT_SITE 42
42 Charge relay system (Potential).
FT Group: 1; Condition: H
FT ACT_SITE
91
91 Charge relay system (Potential).
FT Group: 1; Condition: D
FT ACT_SITE 205 205 Charge relay system (Potential).
FT Group: 1; Condition: S
FT DISULFID
27 43
Potential.
FT DISULFID 111 192
Potential.
FT DISULFID 156 171
Potential.
FT DISULFID 182 210
Potential.
PIR seminar, Washington
12/11/2003
UniRule format and repository
• We are currently developing a general
rule format (UniRule format), which we
suggest to be used by all partners for
annotation rules
• We are creating a central CVS
repository accessible to all rule curators
in which we will store all rules in the
UniRule format.
PIR seminar, Washington
12/11/2003
Advantages (1/2)
Common types of rules …
Protein family, Protein, Domain, Site
… for safe rule interactions
hierarchy of rules (a rule can supersede
another one)
triggering of a rule from another rule
etc…
PIR seminar, Washington
12/11/2003
Advantages (2/2)
Common tools, …
rule creation, update and maintenance
syntax checking
non-redundancy checking
etc…
… storage and access
PIR seminar, Washington
12/11/2003
Various groups of rules
•
•
•
•
HAMAP rules (microbes and plastids)
MitoRules: mitochondrial proteins
ProRules (complex protein families)
AnaRules (Rules for domains from many
programs such as SignalP, TMHMM)
• Rulebase
• PIR rules
• … and maybe groups of rules for plants,
yeast, viruses, etc
PIR seminar, Washington
12/11/2003
UniRule format and central CVS storage will
be discussed during the 2nd AAM at the EBI in
December
PIR seminar, Washington
12/11/2003
Where the computer is helping?
Step1
Selecting a new protein to annotate
Step2
Searching
PubMed
Step4
Analysing sequence with
bioinformatic tools
Step3
Reading papers
Step5
Data integration
Step6
Creation of a new/updated
Swiss-Prot entry
Medical Annotation
Annotation of genetic diseases and
polymorphisms in human
Particularities:
• Annotation has to be as complete as
possible, implying large number of
retrieved documents
• Only mutation that don’t drastically
modify sequence are kept (no stop or
frame shift mutations)
PIR seminar, Washington
12/11/2003
Medical Annotation Tool:
Specifications
 Query interface adapted to search
mutation-related articles
 Classification of retrieved documents
• Information extraction from documents
• Mutation position control and SwissProt lines generation
PIR seminar, Washington
12/11/2003
Query interface
PIR seminar, Washington
12/11/2003
Classifier Training Corpus
Total distribution
Dataset:
Bad
70%
• 2192 abstracts
• from 32 genes in
Unclear
Good
16%
14%
• three categories
• “Good” - relevant for annotation (14%)
• “Bad” - irrelevant to annotation (70%)
• “Unclear” - no decision could be made about
abstract’s relevance (16%)
• Used to train a hierarchical probabilistic classifier (in
collaboration with XRCE)
Unclassified
0%
PIR seminar, Washington
12/11/2003
Classifier Architecture
retrieved
documents
morpho-syntactic
analysis
feature selection
normalization:
mutation points,
gene & protein synonyms
cascade of categorizers
PIR seminar, Washington
12/11/2003
term extraction
reordered
documents
Classifier Performance
precision
Classified list evaluation
• “Good”
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
•
•
Precision =
49%
Recall = 84%
• “Bad”
•
0%
20%
40%
60%
80%
100%
recall point
probabilistic 2 stage classifier
•
Precision =
96%
Recall = 82%
pubmed
PIR seminar, Washington
12/11/2003
The
project
• 3 year FP5 European Project, started January 2003
• Official web site: www.biomint.org
• 5 teams involved:
– University of Manchester (UK, coordinator)
– PharmaDM (Belgium)
– Austrian Research Institute for Artificial Intelligence (Austria)
– University of Geneva, AI Lab (Switzerland)
– University of Antwerp, CNTS (Belgium)
– Swiss Institute of Bioinformatics (Switzerland)
PIR seminar, Washington
12/11/2003
The goals of BioMinT
To develop a generic text mining tool that:
–
–
–
–
interprets different types of queries
retrieves relevant documents from the biological literature
extracts the required information
outputs the result as a database slot filler or as a structured
report
The tool will thus provide two essential research support
services:
1. A curator's assistant: it will accelerate, by partially
automating, the annotation and update of bio-databases;
2. A researcher's assistant: it will generate readable reports in
response to queries from biological researchers.
PIR seminar, Washington
12/11/2003
Architecture of the prototype
PIR seminar, Washington
12/11/2003
A curator’s assistant for proteomic
databases
• Swiss-Prot protein sequence knowledgebase
• PRINTS protein « fingerprints » database
Both hand-annotated by trained biologists.
PIR seminar, Washington
12/11/2003
Information retrieval and
query management
A semantic meta-query engine built round legacy
search engines of servers such as PubMed that
operates in two steps:
1)
2)
An expansion of the initial query with synonyms or
related terms derived either from domain ontologies
or from existing database entries.
A filtering and ranking of documents retrieved from
these servers using task-specific heuristics.
PIR seminar, Washington
12/11/2003
Query interface prototype
Acetoin
Acetoin
Acetoin
Acetoin
catabolism
& catabolism
& degradation
& breakdown
Species
Acetylation
Acetylat*
Human
BLAST

Mouse receptor inhibitor
Acetylcholine
Acetylcholine & receptor & inhibitor

Rat
Actin-binding
Drosophila
Actin binding
ActinYeast
& bind*
Escherichia coli
Acute phase
Bacillus subtilist
Acute-phase
A. thaliana
Acyltransferase
Acyl &
C.transfer*
Elegans
ADP-ribosylation
ADP-ribosylat*
ADP & ribosylat*
Alginate biosynthesis
Alginate & biosynthesis
Alginate & synthesis
Alkaloid metabolism
Alkaloid & metabolism
caffeine & metabolism
nicotine & metabolism
morphine & metabolism
Developed by Pavel Dobrokhotov in the framework of SwissProt medical annotation:
Bioinformatics 19(suppl. 1): i91-i94 (ISMB 2003)
Query Result organisation
•
Filtering, classification and clustering according to
different rules:
- Selecting the articles with target protein as primary
subject:


•
Key phrases: cloned <X>, <X> was characterised, isolate(d) <X>, <X>
(is/,) a new protein, identify(ied) <X>
Frequency of gene/protein name and synonyms in the abstract
Journal/Journal type/Publication date
Same lab/authors
Species
Articles on key annotation topics: PTMs, mutations,
function, diseases
Results presented to the curator by information type
PIR seminar, Washington
12/11/2003
Information extraction
• Based on data mining software using Inductive Logic
Programming (ILP).
• Using basic text analysis tools developed by CTNS
and Tilburg University, namely a Memory-Based
Shallow Parser (MBSP) based on a Machine
Learning package (TiMBL)
• Using biomedical terminology resources, publicly
available or provided by end users.
PIR seminar, Washington
12/11/2003
Topics for information extraction
• Function(s) and role(s); enzymes:
a.
b.
c.
d.
Catalytic activity (if EC number)
Cofactor
Enzyme regulation
Pathway
• Subunit (Protein/protein interactions)
• Subcellular location
• Alternative products (alt. splicing, alt. initiation, RNA editing)
• Tissue specificity (Northern and Western results)
• Developmental stage
• Induction
• Domain
• Post-translational modifications (PTM)
• Mass spectrometry
• Polymorphisms
• Disease
• Biotechnology
• Pharmaceutical
• Miscellaneous
• Similarities
• Caution
• Database (specialized cross-references)
PIR seminar, Washington
12/11/2003
Benchmark environment
for training and evaluation
We need a corpus of supervised abstracts
 To train the text-mining tools
 To elaborate rules for specific information
extraction
What do we need to tag ?
• Fragments of, or whole sentences describing
information useful for protein annotation
• Specific words describing a specific type of
information
PIR seminar, Washington
12/11/2003
Where the computer is helping?
Step1
Selecting a new protein to annotate
Step2
Searching
PubMed
Step4
Analysing sequence with
bioinformatic tools
Step3
Reading papers
Step5
Data integration
Step6
Creation of a new/updated
Swiss-Prot entry
Textual annotation edition
• Currently CRISP: text editor enhanced
with in-house macros
• Development of Spedit: a XML editor
dedicated to Swiss-Prot annotation
• Complete integration of sequence
analysis and text-mining tools in a
unique graphical user interface
PIR seminar, Washington
12/11/2003
Technical issues
•
•
•
•
•
•
Conversion in a relational database on Oracle;
Production of a XML format;
Links to GO terms;
Update of PDB links;
Standardisation of a number of topics of CC lines;
Reformatting of GN line and ‘ALTERNATIVE
PRODUCTS’ CC topic;
• Controlled vocabulary for PTM modification in FT lines;
• Ongoing conversion to mixed case lines.
• Enhancement of Swiss-Prot entry views
PIR seminar, Washington
12/11/2003
VARIANT 34
34
A->E (IN LDHB DEFICIENCY).
/FTId=VAR_004174
Effects of Variations in Human Proteins
on Local Substructure and Functionality
Database
Modelling
Populate
Assess
Evaluate & Analyse Effects of
Point mutations on 3D-Structures
PIR seminar, Washington
12/11/2003
THE MODSNP DATABASE
AC
ID
3D
flag.-dis
n
1
relseqft
n
Isoid (FK)
Ftid (FK)
1
sp_sequences
n
diseases
sp_variants
1
sp_entries
Ftid
Type
seqpos
Fromaa
Toaa
Dis_status
isoflag
ftid
omimid
Full-dis
Abbrev-dis
1
n
isoid
AC (FK)
Sequence
Checksum
relseqpdb
1
analysis
models
n
isoid (FK)
chainid (FK)
Alid
Evalue
Seqfrom
Seqto
Is_selected
nn
11
1
n
mid
filename
Type
Isoid (FK)
Ftid (FK)
Chainid (FK)
Builtversion
Method
Created_date
Local substructure analysis
And
Many others!
chsaw-chains
Pdb
Code
Residues
Amino acids
Heterogens
Solvent
sequence
organism
Compound
eccode
chainid
pdb
struct_info
Code
deposited
experiment
resolution
header
title
revdat
Tempid
Scop-id
Cath-id
Acknowledgements
HAMAP:
Alexandre Gattiker, Karine Michoud, Virginie Lesaux, Corinne Lachaize,
Anne Morgat, Isabelle Phan
ANABELLE:
Brigitte Boeckmann, Alexandre Gattiker, Xavier Martin, Nicolas Hulo,
Christian Sigrist, Silvia Braconi
TEXT MINING:
Pavel Dobrokhotov, Eric Gaussier (XRCE), Cyril Goutte (XRCE)
Violaine Pillet, Marc Zehnder
EDITOR:
Alain Gateau, Stéphanie Federico, Brigitte Boeckmann
VARIANT MAPPING:
Holger Scheib, Lina Yib
PIR seminar, Washington
12/11/2003
SWISS-PROT group
at ISB and EBI
Barcelona 2002
Amos Bairoch
ISB
http://www.expasy.org/people/swissprot.html#people