PRO and IntAct protein complexes

Download Report

Transcript PRO and IntAct protein complexes

PRO and IntAct protein
complexes
Sandra Orchard
PRO Meeting, June 19, 2014
IntAct
•
•
•
•
•
•
•
Project aims
Reference resource for macromolecular
complexes
Create species-specific stable complex
identifiers
Central reference resource to link all
related efforts (UniProt for protein
complexes)
Dedicated online Portal to search and
visualise (text and graphics), can also
export to Cytoscape
Emphasis on major model organisms
Stored in relational database (IntAct)
with existing update mechanisms
Download format – PSI-MI XML, can
write user-defined format from
database
•
•
•
•
•
PRO
Reference ontology for protein
complexes
PRO terms can span across species or
be species-specific
Stable ontology term identifier
Searched and viewed (text-only) in
existing PRO website – can export to
Cytoscape
Emphasis on major model organisms
•
Stored in an internal database with
existing update mechanism
•
Download – ontology (OBO, OWL),
annotation file
IntAct
•
•
Complex definition
PRO
A stable set (2 or more, to include
homodimers) of interacting protein
molecules which
•
Protein complexes, including homo
complexes (e.g. homodimers)
–
–
•
Complexes may include non-proteins
components
can be co-purified and
have been shown to exist as a
functional unit in vivo.
Non-protein molecules (e.g. small
molecules, nucleic acids) may also be
present in the complex.
Does not include
• Molecules associated in a pulldown /
coimmunoprecipitation but with no
functional link
• Enzyme/substrate, receptor/ligand or
similar transient interactions (except
when required for stable complex
formation)
IntAct
•
•
•
•
•
•
•
•
•
•
Data capture
Participants – proteins (UniProt),
small molecules (ChEBI), nucleic
acids (ChEBI, (RNACentral))
Participant features - binding
domains, required PTMs
Species
Stoichiometry – when known
Topology (linked binding domains)
– when known
Function – free text
Assembly, e.g. homodimer,
heterotetramer…
Physical properties, e.g. MW, size,
topology/assembly
Ligands
Disease
•
•
PRO
Participants – proteins (PR
identifiers), small molecules
(ChEBI), nucleic acids (?)
PTMs - implicit in PR term
•
•
Species
Cardinality – indicates
stoichiometry
•
Definition – free text “composed of
x number of subunits of various
components”
disease and functional properties
are added as an annotation in PAF
if known
•
Data capture - nomenclature
IntAct
Recommended name:
- most recognisable name from
literature, use GO component if
specific complex exists in GO
Systematic name:
-based on Reactome’s new CV names
– ‘string of (species-specific) gene
names with stoichiometry’
Synonyms:
- all other names the complex may
be known as
PRO
Name:
- most recognisable name from
literature, use GO component if
specific complex exists in GO
Systematic name:
-based on Reactome’s new CV names
(stoichiometry not incorporated)
Synonyms:
- all other names the complex may
be known as
IntAct
Data Capture - xrefs
• GO (BP, MF, CC) – manually
curated to complex, not just
imported from proteins
• Cross references to experimental
evidence: IMEx (+ non-IMEx
IntAct, MINT & DIP, MatrixDB),
• Reactome (human)
• PDB, EMDB
• ChEMBL
• PubMed (for further information)
• IntEnz (enzyme EC numbers)
• OMIM/EFO (disease)
• TaxID
PRO
• GO – used as parent term
• Reactome (human)
• PubMed
• TaxID
Data capture - evidence
IntAct
ECO codes
ECO:0000353 (physical interaction
evidence used in manual assertion) full experimental evidence for the
complex added to the entry.
ECO:0000266 (sequence orthology
evidence used in manual assertion) +
inferred from “complex ID” – across
species
ECO:0000250 (sequence similarity
evidence used in manual assertion) +
inferred from “complex ID” – within
species
ECO:0000306 (inference from
background scientific knowledge used
in manual assertion) - modelled
PRO
ECO Codes
EXP experimentally verified →
ECO:0000269 (experimental evidence
used in manual assertion)
ECO:0000088 (biological system
reconstruction) - modelled
Linked binding domains
PTMs annotated using MOD
Protein Complex Statistics
Species
IntAct
PRO
Human
226
215
Mouse
173
93
Rat
49
0
Cow
3
0
Drosophila
Melanogaster
12
0
C.elegans
(2)
0
Xenopus laevis
3
0
Arabidopsis
thaliana
0
8
Saccharomyces
cerevisiae
301
S.pombe
16
E. coli
87
0
Total
870
352 (+215 protein
agnostic parent
terms)
17
IntAct - Parallel Annotation of
complexes in GO
•
Project start > 400 complex terms in GO Cellular Component branch, mostly
children of GO:0043234 protein complex – lacks hierarchical structure
•
Collaboration agreed with GO to provide more structured annotation whilst
also adding new terms
•
Parent terms mainly based on complex function e.g. enyzme complexes,
transcription factor complexes
– TermGenie (TG) Standard Form <protein_complex_by_activity>
– Otherwise use TG Free Form
– Some complexes still direct children of GO:0043234 protein complex
•
Adding “logical definitions” / “cross-products” / “extensions”
– e.g. “capable of x activity”
IntAct Data Sources/Curation priorities
•
•
•
•
•
•
•
•
PDBe – almost 1000 complexes imported, more planned. Experimental data
can be imported at same time (N.B. many of these have proven to be
partial/sub-complexes so will not directly translate into 1000 finished
products. Also many from non-model organisms) – curation ongoing PDB
collaborating and mayadd curation effort
ChEMBL – 81 drug-target complexes imported – curation complete, more to
come with each release (mostly human/mouse/rat)
MatrixDB (Sylvie Richard-Blum, Univ. of Lyon) – list of extracellular complexes
– curation complete (human/mouse)
Reactome – mapping into PSI-MI XML → direct import into IntAct ongoing,
issue with sets has now been resolved (human)
Mining UniProt (Bernd Roechert, SIB – manually) – curation ongoing (yeast)
Manual curation from IMEx DBs & the literature
SGD yeast complex list – SGD contributing curation effort
EcoCyc – complex list has been dumped into Excel sheet, useful as ‘to do’ list
but not suitable for import – curation ongoing (E.coli)
PRO data sources/Curation priorities
• Toll-like receptor pathway. Curation of both human and mouse
(Anna Maria Masci at Duke and Veronica Shamovsky/Peter
D’Eustachio from Reactome)
• Complexes for the Brassinosteroid signaling pathway in Arabidopsis
(Mengxi Lv and Cecilia Arighi at University of Delaware)
• Complexes in TGF-beta signaling pathway (Cecilia, human
complexes aligned with Reactome data)
• Complexes in cell cycle spindle checkpoint for human and yeast
(Karen Ross, University of Delaware)
• Beta catenin related complexes (Irem Celen, University of Delaware)
What else has IntAct to offer?
1. Web-based editorial tool
– Institution/curator management system enables
attribution of effort to institute
- APIs to UniProt, ChEBI (RNA Central when available)
allow immediate import of interactors plus selected
xrefs.
- OLS enables enrichment of CV terms e.g. GO names
when AC no used for import
- Pulldown menus restricts CV usage to appropriate
fields
- Intelligent ‘syntax checker’ limits curator error
What else has IntAct to offer?
2. JIRA issue tracker
- enables tracking of complexes requiring QC by
2nd curator
- used to request addition of new complex GO
terms or hierarchy re-organization, this then
undertaken via Term Genie
- could additionally be used to request IntAct
curation of experimental evidence papers not already
in database(s)
What else has IntAct to offer?
3. Automated update process
- protein update system. Tracks changes to
underlying sequence with every release of
UniProt and remaps features (binding domains,
PTMs) accordingly. Withdrawn proteins
(TrEMBL) remapped.
- CV update system.
Proposal for joint curation
1. IntAct/PRO to align curation rules – discussions
ongoing
2. IntAct to import PRO complexes & update all
existing to joint rule set
3. IntAct to produce script to write complexes to
flat file format
4. PRO curators to train on IntAct editor – all new
complexes curated in IntAct
5. IntAct responsible for long-term data
maintenance
Proposal for joint curation
6. IntAct to write flat files for new/updated
complexes with every release
7. PRO to map UniProt + MOD → PR IDs
8. PRO to create ontology, including addition of
parent ‘species-agnostic’ terms
(IntAct will have “super-complex
(Reactome ‘set’ equivalent)/complex/subcomplex relationship – OK for PRO?)