Transcript Ontologies for biological annotation
Weaving and untangling the GO
•
is_a
completeness ~9 slides • granularity & BP ~3 slides • Linking MF to BP ~15 slides • Sensu ~13 slides – linguistic qualifiers vs relations • Linking GO to other ontologies ~40 slides – GO+Cell
Tangled DAGs and complexity
• paths increasing • GO process
in general
has a multiple axes of classification – qualifier -ve +ve – anatomy • structural • spatial – chemical • structural • functional
is_a
completeness
GO and
is_a
completeness
• Why?
• What’s wrong with every term having at least one is_a
or
part_of parent?
– this is the way we’ve always done things
Ontologies should be complete
• • No errors of omission
is_a
completeness is the ontologically correct thing to do – every entity type is a subtype of some other thing • Accurate ontologies = accurate queries – currently a query for “find all kinds of development” does not return “ovarian follicle development” • this is wrong
missing
is_a
s hinders common tool use
• We should play nicely with the others in the playground • Most (non-GOC) tools expect is_a completeness – GO looks funny when viewed in other tools • the standard is to show only
is_a
relations in default tree view – missing
is_a
s breaks reasoners
Filling
is_a
gaps brings practical benefits
• • Easier for tools to find inconsistencies in GO
We can start to untangle displays
Example: current displays mix relations
• it’s a mess
untangling
is_a
and
part_of
• difficult if
is_a
hierarchy is incomplete –
is_a is_a
orphans show up at root node in pure display • not everything must have an asserted
part_of
parent – can infer from
is_a
parents
The new complete cellular component
• Current CC: – 277 is_a orphans / 1688 terms – avg is-a-paths-to-root 1.4
– avg mixed-paths-to-root 6.97
• Jane’s fixed CC: – 0 is_a orphans – avg is-a-paths-to-root 3.36
– avg mixed-paths-to-root 38.6
Granularity and the organisation of GO:BP
Fixing the upper levels of BP
• The upper portion of any ontology is very important for organisation • Design decisions percolate down • Many users exploring GO top-down see this first • Diamonds are particularly bad in the upper level – significantly increases tangledness
others cellular process biological process physiological process cellular physiological process organismal physiological process
biological process Processes that are carried out
at the cellular level
, but are not necessarily restricted to a single cell. For example, cell communication occurs among more than one cell, but occurs at the cellular level cellular process A phenomenon marked by changes that lead to a particular result, mediated by one or more gene products Those processes specifically pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms physiological process The processes pertinent to the integrated function of
a cell
cellular physiological process organismal physiological process The processes pertinent to the function of an organism
above the cellular level
; includes the integrated processes of tissues and organs
Consider… (long term view)
• Making top division by
granularity of the process itself
– biological process • molecular level process?
• cellular level process • (multi-cellular) level process • These types are
disjoint
• But what about physiological process?
– this is not disjoint from the granularity of the process itself
Relations between GO ontologies
Outline
• We focus on MF & BP • biological example from David • the types and relations in reality – maintaining the ALL-SOME definition of relations • how should this be implemented in the GO?
– what links should be manifested – retain some level of redundancy, or eliminate it?
GO:0006548 Histidine catabolism GO:0019557 Histidine catabolism to glutamate and formate formamide GO:0004397 Histidine ammonia GO:0050480 imidazolopropionase activity Formimidoyl GO:0050416 Formimidoylglutamate deiminase activity formiminotetrahydrofolate Overbeek, et al. The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes. NAR 2005, 33-17:5691-5702
Ontological Representation
• I will try and be clear when I am talking about – types in reality – types we wish to manifest as terms in the GO (or in other ontologies) • all GO terms should be types • not all types need to have terms created - we limit for practical reasons
What are the relations in reality?
• Between types in the same ontology, different levels of granularity – part_of • Between functions and processes (at the same level of granularity) – functioning_of • Between component and function – has_function • Between process and component – located_in
What are the instances and relations in reality?
some gene product instance has function some molecular function instance functioning of function some multistep process instance part_of some molecular function
ING
instance process
What are the types and type level relations in reality?
some type of gene product has function some type of molecular function function functioning of some type of multistep process part (direction?) some type of molecular function
ING
process
types example
issues: -- ALL-SOME structure histidine catabolism part?
coarse histidine ammonia lyase
function
function functioning of histidine ammonia lyase
reaction
process fine
What are the types and relations in reality?
issues: -- ALL-SOME structure histidine catabolism to glutamate and formate has part?
coarse Formimidoylglutmat e deiminase
function
function functioning of Formimidoylglutmat e deiminase
reaction
fine process
We want to capture these real relationships between biological types
• Between granular levels • Between orthogonal ontologies • But first we must be clear on the definitions of these types, and which types should be manifested as GO terms
Can we just manifest this in the GO?
issues: -- not all function terms have a function
ING
corresponding term some type of multistep -- even if they do, redundancy is generally to be avoided process has part(?) coarse some type of molecular function function functioning of some type of molecular function
ING
process fine
We already have some redundancy
• function & process redundancy • iron transport (BP) • iron transporter (MF) • function & component redundancy • voltage-gated ion channel function • voltage-gated ion channel complex • If we retain this redundancy, these relations can be trivially added • But we don’t always have this redundancy – not all functions have a corresponding functioning term
Manifest shortcut relationships
• one relation standing for two some type of process has part(?) coarse some type of molecular function function functioning of some type of molecular function
ING
process fine
most functionings are implicit
• current paradigm coarse histidine catabolism has part(?) histidine ammonia lysase function function functioning of histidine ammonia lyase REACTION process fine
When do we manifest functions and processes?
• Need consistent stable policy • Nothing in function ontology should have activity suffix – even though to a biochemist activity==potential, this is still confusing • Beyond this, do we retain current policy – some redundancy • Or take a more extreme approach – eliminate redundancy – eliminate current ‘activity’ MF terms and manifest corresponding reaction terms in BP (Amelia)
‘purist process’ approach
some type of gene product has function histidine ammonia lyase
function
function functioning of histidine catabolism part histidine ammonia lysase
reaction
process
When is it safe to eliminate redundancy?
• Does functioning always imply function?
– iron transport does not imply iron transporter – but we could still extend annotation to allow for specification of functioning-as-function • Reactions and other ‘single-step’ processes involving no helper – function and corresponding functioning imply one another • Redundancy between function and component should be retained • Any obsoletion obviously causes disruption
Difficult functionings
• Structural constituents • function
ing
happens at lower level of granularity than is covered by GO • these will not be linked to process - for now
Implementation
• Still need to curate the actual links – trivial links can be computed automatically • Can proceed independently of resolving ontological issues – most likely retain current policy re: manifesting terms – need maintain 3 kinds of links • granular (part, same ontology) • functioning_of (function and functioning) • ‘diagonal’ – ALL-SOME definition
Sensu
Sensu - outline
• Original use – A linguistic qualifier – denote differing community usage of a terminological entity (a term) • Perverted use – A type qualifier – Used for when the part_of structure is specific to an organism type • The fix – provide separate mechanisms for each
Terms vs kinds
• The term ‘term’ is confusing – Term (sensu GO) – Term (sensu normal usage) • strings, tokens • GO is not a terminology • A GO ID identifies a
type
– a
kind
of entity of entity – a
universal
(as opposed to instance) – more specific than a
class
– but not a concept
Sensu - original usage
• Sometimes the same
string
refers to different
types
– nucleus (sensu particle physicist) – nucleus (sensu astrophysicist) – nucleus (sensu biologist) • Canonical GO example: –
bud
• no longer relevant, terms obsoleted –
trichome
Linguistic qualifiers are about language, not biological reality
• No ontological requirement for linguistically related terms to be ontologically related – current GO docs are not correct • trichome, sensu plant community – should not state that there is some biological relation between an instance of a trichome and the plant community
The original usage has been conflated
• Organism type specificity is a genuine challenge for the GO – ‘contextual’ part_ofs – e.g. X part_of Y in species Z • Sensu has been wrongly recruited to fix this – standard pattern: • X, sensu Z
part_of
Y • X, sensu Z
is_a
Z • Two problems – conflation of meaning of sensu – conflation results in lack of precision • “as in, but not restricted to taxon” not rigorous enough
Two problems, two solutions
• Retain sensu as a linguistic qualifier only – re-interpret as:
sensu S community
– no requirement for taxon IDs – no ontology structure requirements • Introduce a new relation for genuine organism-type specific terms –
in_organism
– standard inference rules can be used •
e.g.
–
X in_organism X’, Y in_organism Y’, X is_a Y <=> X’ is_a Y’
Contextual synonyms
[Term] name: trichome (sensu insecta) synonym: EXACT “hair” [] synonym: EXACT “trichome” [] {context=insecta} def: “ a polarized cellular extension that covers much of the insect epidermis ” [Term] name: trichome (sensu plant) synonym: EXACT “trichome” [] {context=plant} def: “ An outgrowth from the epidermis. Trichomes vary in size and complexity and include hairs, scales, and other structures and may be glandular. In Arabidopsis, patterning of trichome development is not random but does not appear to be lineage-based like stomata ”
Advantages
• Lexical qualifiers dealt with use lexical oboedit tags • No need to be as specific as a taxon – only as specific as is needed to decontextualise • No false reasoning is done over synonyms – cellular component types and cell types should not be siblings • Big user-friendliness win?
– Displays customised for particular users may choose to display contextual exact synonyms in place of the wordier sensu name
in_organism • Standard ALL-SOME definition: • Type level definition: – P in_organism O • for all instances p of P, there exists some organism o of type O, and some time t, such that p in_organism o at time t • More specific relation than
located_in
OBO relations ontology in • Standard logical rules can be applied
thylakoid
is_a
thylakoid, in cyanobacteria
in organism
photosystem I
part of is_a
photosystem I, in cyanobacteria
in organism
cyanobacteria
Open question
• Sometimes the relation between two types is largely lexical – eg trichome • Sometimes it isn’t so clear • Can we have both a relation to a taxon,
and
contextual synonyms a • Is ‘eye’ an exact contextual synonym for ‘compound eye’ for the arthropod community?
Practical considerations
• Use NCBI Taxonomy as our organism ontology • xref or relationship tags?
– xrefs are more lightweight – relationship tags are more accurate – relationship tags would be ‘dangling’ unless organism ontology is loaded • See next section…
Composite terms in GO finally…
Composite terms - outline
• The problems inherent in composite terms and diamonds - brief review • Actively managing composite terms in GO – big change: parseable
logical definitions
• Implementation plan • Progress so far: logical definitions referring to cell types • Pre vs post composition – composite terms in ontologies and annotations
biosynthesis
is_a
metabolism
cysteine
is_a
serine family amino acid
is_a
amino acid
is_a
amine
cysteine
is_a
serine family amino acid
is_a
amino acid
is_a
serine
Composed terms currently cause problems
– No link to external ontology term – Redundancy – Inconsistency – Extra work – Annotation bottleneck – Tangled DAGs and confusing displays • we have no way to disentangle • Solution so far: – fix errors based on results of term name parsing (Obol) • reactive, not proactive
Solution:
actively manage
composed terms
• Composed terms should now/soon be generated using oboedit plugin – building block terms are
recorded in ontology
along with composite term • Correct DAG structure can be inferred from external ontologies – placement & consistency checking automated – additional work can be automated • synonyms, text definitions
How will composite terms be recorded by oboedit?
• How do we record a definition for a composite term?
– using a
logical definition
(computational
essence
) • A logical definition consists of: – a
generic
term (aka genus) – relationships to other terms which serve to
discriminate
this specific term from other is_a children of the generic term (aka differentiae) • Can be written in natural language as: – A <
generic term
> which <
discriminating characteristics
>
Example of composite term record
• cysteine biosynthesis – generic term: • biosynthesis – discriminating characteristics: •
outputs
cysteine – a biosynthesis process which
outputs
cysteine id: GO:0019344 ! cysteine biosynthesis intersection_of: GO:0009058 ! biosynthesis intersection_of: outputs CHEBI:15356 ! cysteine
Now we have the ability to untangle
• Process axis view (primary
is_a
s, via generic term): – biological_process • metabolism – biosynthesis » cysteine biosynthesis • Process participant axis view: – amine • amino acid – serine family amino acid » cysteine • Combined view – (same as current tangled diamond lattice)
Recording the relationship is important
• Why not just a simple cross-product?
– e.g. biosynthesis x cysteine • Relationships are important for reasoning and querying – Consider: • cysteine biosynthesis from serine • mRNA export from nucleus during heat stress • Without the relations, the logical definition is not specific enough – the
essence
is not captured
Multiple discriminating characteristics are allowed
• Cysteine biosynthesis from serine – Generic term: • biosynthesis – Discriminating characteristics: • •
output
cysteine
input
serine intersection_of: GO:0009058 intersection_of: outputs CHEBI:15356 intersection_of: input CHEBI:17822
Composite terms can be nested
• regulation of cysteine biosynthesis intersection_of: GO:0050789 ! regulation of biological process intersection_of: regulates GO:0019344 ! cysteine biosynthesis id: GO:0019344 ! cysteine biosynthesis intersection_of: GO:0009058 intersection_of: outputs CHEBI:15356
Composite terms can optionally be manufactured in bulk
• Generic term: {metabolism,biosynthesis} • Differentia:
has_output
cysteine, …} {serine, • With caution… – Sparse vs dense matrices – not all combinations are types
On the importance of necessary
and sufficient
conditions
• Why intersection_of ?
• Why not just make normal links in the GO DAG?
– normal relationships are for necessary conditions only – we want
both
necessary and sufficient conditions • captures the
essence
of the term
Normal DAG links only capture
necessary conditions
, not
essence
immune cell activation inflammatory response text def: A change in morphology and behavior of a macrophage resulting from exposure to a cytokine, chemokine, cellular ligand, pathogen, or soluble factor macrophage activation part_of
Normal DAG links only capture necessary conditions, not essence
macrophage immune cell activation is_a inflammatory response activates macrophage activation part_of
essence
captured by genus differentia
immune cell activation is_a inflammatory macrophage activation part_of id: GO:macrophage_activation intersection_of: GO:cell_activation intersection_of: activates CL:macrophage response
essence
captured by genus differentia
text def: A change in morphology and behavior of a macrophage resulting from exposure to a cytokine, chemokine, cellular ligand, pathogen, or soluble factor immune cell activation is_a inflammatory response macrophage activation part_of id: GO:macrophage_activation intersection_of: GO:cell_activation intersection_of: activates CL:macrophage
essence
captured by genus differentia
cell activation (genus) activates macrophage immune cell activation is_a macrophage activation inflammatory part_of response
The power of reason
• with genus-differentia definitions that are computationally parseable, we can do a lot more consistency checking
Pre- vs post- composition
• It makes sense to pre-compose terms and maintain them as part of GO • Annotations can post-compose terms if they choose to do so – MGI, DictyBase are doing this already • results remain local to MOD – AmiGO-NG will allow querying of these • The two approaches are
complementary compatible
– proviso: if done properly and
SO already contains composite terms
• A silenced gene is a
gene
which has the quality of being
silenced
Plan: outline
• We want all new composite terms to be created using appropriate oboedit plugin – logical definitions automatically recorded – term management automated • Changes: – editors
must
now be ‘OBO-aware’ – annotators and end-users can remain unaware of changes
if they choose to do so
• but using the logical defs can bring benefits • But first we need to find logical definitions for all the existing composite terms
Where we were at, 2005
• Lots of terms to be retrofitted – Where to start?
• Previous strategy: – Obol guesses logical def for each term – Obol uses logical def to reason • errors of omission • inconsistencies – Batch reports to curators
OBO editor cjm obol config go.obo
oboedit name parser go+ ldefs reasoner obol go ‘fixed’ GO editor obol report
Obol produces genus-differentia logical definitions
OBO editor go.obo
oboedit GO editor cjm obol config name parser Ego.obo
reasoner obol go ‘fixed’ obol report
Limitations of this approach
• Good as proof-of-principle • But..
– only the
end results
are evaluated – Obol makes the identical mistakes in
guessing logical definitions
each iteration – we want to evaluate and preserve the logical definitions that are generated by Obol
What we’ve been doing since then
• Focused on OBO Cell ontology • Used Obol to infer logical defs • Manually curate logical defs • Feed back results to improve Obol • Iterate and refine • Use oboedit reasoner to check consistency between GO & CellO • Next: incorporate into curation process
OBO editor cjm obol config go.obo
oboedit name parser GO editor ego-cell .obo
obol
Results so far
• Test set of 337 logical definitions curated – only a fraction of the composite terms in GO • Relations not finalised • Composite terms involving CellO present some interesting challenges • …but first, here’s a demo
Open issues: what relations do we use?
• We are concerned for now with relations between processes and cells – neuroblast activation & neuroblast – T cell differentiation & T cell – T cell homeostasis & T cell – cell homeostasis & homeostasis – sperm incapacitation & sperm – sperm motility & sperm
OBO Relations ontology
• OBO Relations ontology has – has_participant • sub-relations: – has_agent (active participant) – has_patient (inactive participant) » (not in obo-rel yet) – between a process and a
continuant
– follows standard ALL-SOME structure
has_participant •
P
has_participant
C
if and only if: given any process
p
that instantiates
P
there is some continuant
c
, and some time
t
, such that:
c
instantiates
C
at
t
and
c
participates in
p
at
t
• has_participant is a primitive instance-level relation between a process, a continuant, and a time at which the continuant participates in some way in the process. The relation obtains, for example, when this particular process of oxygen exchange across this particular alveolar membrane has_participant this particular sample of hemoglobin at this particular time
Is this the appropriate relation?
neuroblast activation
has_participant
neuroblast T cell differentiation
has_participant
T cell T cell homeostasis
has_participant
T cell cell homeostasis
has_participant
homeostasis sperm incapacitation
has_participant
sperm sperm motility
has_participant
sperm these are all correct… …but are they too general?
more specific kinds of participation
• has_agent (has_active_participant) – As for has_participant, but with the additional condition that the component instance is causally active in the relevant process • has_patient (has_inactive_participant) – Yes, this is a daft name – The component instance is acted upon • (not yet in OBO REL)
Cell differentiation
• T cell differentiation – A cell differentiation instance in which a cell
acquires_features_of
T cell • problem: – not a simple relation between the process (T cell differentiation) and the cell (T cell) • 3-place relation: process, instance, type
Cell differentiation, attempt 2
• T cell differentiation
has_output
T cell – Compare to: • cysteine biosynthesis
has_output
cysteine • We should distinguish between participation relations in which the continuant relations are – transformation_of – derives_from • e.g. something made (biosynthesis) vs something transformed (differentiation)
Cell differentiation, attempt 3
• T cell differentiation
has_transformed_output_participant
T cell – …not exactly catchy…
has_primary_participant
• T cell differentiation
has_primary_participant
T cell – aka has_theme • ontologically a good relation?
• Meaning partly resides in the process term • Can be migrated to other relations later
To decompose or not to decompose
• We could have a logical definition for sperm incapacitation – genus: incapacitation – differentia:
has_participant
sperm • Requires creating a new term – incapacitation • Not used in any other logical def • Logical def does not capture full essence – this term is a little more complex • involves at least three continuants • Instead just use a relationship to capture
necessary conditions
only
‘Anonymous’ terms
• border follicle cell delamination – The splitting off of border cells from the anterior epithelium • genus: delamination – no such term • we can create as ‘anonymous’ term – exists only in order to make logical definitions • ..or we can just create a normal term
Implementation
• We have 337 logical definitions (nearly) ready • When can we merge them into the GO?
adding logical defs to the GO
• Will this cause disruption to users?
• gene_ontology.obo file exactly the same as before, but will have – fewer inconsistencies!
– new intersection_of tags • specified in obo v1.2
• can easily be ignored by parsers • oboedit users must either: – load cell.obo, relationship.obo at same time as go.obo
– OR select “allow dangling terms” • may still confuse some users – ‘anonymous’ terms
power users & advanced applications cvs gene_ontology _edit.obo
filter gene_ontology.obo
cvs rel.obo
oboedit cvs cell.obo
normal downstream stuff (website, amigo, users) unaffected GO editor CellO editor
Applications may want to take advantage of enhanced GO
• enhanced GO isn’t just to help curation • queries possible with ego: – find genes associated with blood cells • annotations to microglial cell activation – differentiation of any microglial precursor • annotations to monocyte differentiation
Post-composition
• This approach is highly compatible with post composition • We should extend the annotation format to allow denoting more specific classes – e.g.
• cholesterol transport
in
liver – advanced applications can query this – standard applications suffer no loss – extended annotations can be used to help seed new terms in the ontology • This is already being done (MGI,Dicty) – we just want to capture this in interopeable way
Post-composition in gene association files
• New column in file format Gene Product AABC1 Term ID … Slots AABC2 AABC3 GO:0030301 (cholesterol transport) GO:0048663 (neuron fate development) GO:000003 OBOREL:located_in[MA:liver] OBOREL:has_primary_participant[FB bt:Y_neuron]
Important note on post composition
• This is not an either-or situation • We will retain pre-composed terms – terms will continue to be created for real biological types • Annotation post-composition can be used to
further
refine existing pre-composed terms – if the post-composed term is later created in the GO, the annotation can be
automatically
migrated • Tools can ignore post-composition for small loss in specificity – defaults to the current paradigm
Avoiding diamonds
• Surely larval locomotory behavior involves a diamond?
• yes, but we can disentangle the two axes of classification
Solution
• Curator
asserts
: id: GO:larval_locomotory_behavior intersection_of: GO:locomotory_behavor intersection_of: occurs_in FBbt:larval_stage • Oboedit
infers
diamond: id: GO:larval_locomotory_behavior intersection_of: GO:locomotory_behavor intersection_of: occurs_in FBbt:larval_stage is_a: GO:locomotory_behavor ! genus is_a: GO:larval_behavior ! inferred
Next Steps
• • Tidy up cell logical definitions
integrate them into curation process
• Look at composite terms within GO – larval locomotory behaviour – regulation • Chemicals • Anatomical entities