Transcript NCBIO-Berkeley
Some thoughts on PATO Chris Mungall BBOP Hinxton May 2006
Outline
Motivation revisited The Ontology: PATO OBD & using PATO for annotation
Who should use PATO?
Originally: model organism mutant phenotypes But also: ontology-based evolutionary systematics neuroscience; BIRN clinical uses OMIM clinical records to define terms in other ontologies e.g.
diploid condensed
cell;
invasive
chromosome tumor,
engineered
gene,
Unifying goal: integration
Integrating data within and across these domains across levels of granularity across different perspectives Requires Rigorous formal definitions in
both
ontologies and annotation schemas
Some thoughts on the ontology itself
Outline Definitions how do we define PATO terms?
what exactly is it we’re defining?
is_a
hierarchy what are the top-level distinctions?
what are the finer grained distinctions?
shapes and colors
It’s all about the definitions
Everything is doomed to failure without rigorous definitions even more so with PATO than other ontologies OBO Foundry Principle Definitions should describe things in reality, not how terms are used def should not use the word ‘describing’ Should we come up with a policy for definitions in PATO currently: 19 defs (2.5 are circular) proposed breakout session: examine all these
consistency: the property of holding together and retaining shape amplitude: The size of the maximum displacement from the 'normal' position, when periodic motion is taking place placement: The spatial property of the way in which something is placed pointed value: A sharp or tapered end epinastic value: A downward bending of leaves or other plantnparts oblong value: Having a somewhat elongated form withnapproximately parallel sides elliptic value: Elliptic shapen hearted value: Heart shaped fasciated value: Abnormally flattened or coalescedn opacity: The property of not permitting the passage of electromagnetic radiatio opaque value: Not clear; not transmitting or reflecting light or radiant energy undulate value: Having a sinuate margin and rippled surface permeability: The property of something that can be pervaded by a liquid (as by osmosis or diffusion) porosity: The property of being porous; being able to absorb fluids porous value: able to absorb fluids viscosity: a property of fluids describing their internal resistance to flow viscous value: a relatively high resistance to flow.
latency: The time that elapses between a stimulus and the response to it power: The rate at which work is done
Proposal: genus-differentia definitions
An S is a G
which
D Each def should refine the is_a parent
Single
is_a parent Example: (non-PATO) binucleate cell def= a cell
which
has two nuclei Example (proposed PATO def): convex shape def= a shape
which
has no indentations opacity def= an optical quality electromagnetic radiation
which
exists by virtue of the bearer’s capacity to block the passage of v similar to existing def
This policy will reap benefits
Advantages: Helps avoid circularity Ensures precision Consistency in wording user-friendly Considerations: Sometimes leads to awkward phrasing
-ity
suffix “an opacity which…” Solution: allow shortened gerund form having…, being…., ….
most of the existing defs conform already implicit prefix “A G which exists by virtue of the bearer…”
From the top down
First, the fake term ‘pato’ must be removed How do we define ‘attribute’?
Note: I prefer the term ‘quality’ or ‘property’ attribute implies attribution
length_in_centimetres
is an attribute we can of course continue to say ‘attribute’ but I use ‘quality’ in these slides most of new new pato defs are phrased as ‘a property of…’ which I like, but inconsistent with calling the root ‘attribute’ Well then, what is a quality/property?
What a quality is NOT
Qualities are not measurements Instances of qualities exist independently of their measurements Qualities can have zero or more measurements These are not the names of qualities: percentage process abnormal high
Some examples of qualities
The particular redness of the left eye of a single individual fly An instance of a quality type The color ‘red’ A quality type Note: the eye does not instantiate ‘red’ PATO represents
quality types
PATO definitions can be used to classify quality instances by the types they
instantiate
the type “red” instantiates the type “eye” instantiates the particular case of redness (of a particular fly eye) inheres in (is a quality of, has_bearer) an instance of an eye (in a particular fly)
Qualities are
dependent
entities
Qualities require
bearers
Bearers can be physical objects or processes Example: A shape requires a physical object to bear it If the physical object ceases to exist (e.g. it decomposes), then the shape ceases to exist Some qualities are relational they relate a bearer with other entities e.g.
sensitivity
(to) Compare with: functions
The PATO hierarchy
Proposal for a new top level division Proposal for granular divisions
Proposal 1: top level division
Spatial quality Definition: A quality
which
has a physical object as bearer Examples: color, shape, temperature, velocity, ploidy, furriness, composition, texture Spatiotemporal quality Definition: A quality
which
has a process as bearer Examples: rate, periodicity, regularity, duration
Proposal 2: subsequent divisions
Based on granularity (i.e. size scale) a good account of granularity is vital for inferences from molecular (gene) level to organismal (disease) level How do we partition the levels?
Some qualities are realised at certain levels of granularity Others can be realised across levels shape, porosity Sum-of-parts vs emergent
Scale
Physical
Bearer
Cont.
Physical Phys/Che m Molecular Cellular Cellular Organism Cont.
Liquid Gene Cell Cell Tissue
Quality
Mass Opacity
Definition (proposed)
Equivalent to the sum of the mass of the parts of the bearer (mass at the particle level is primitive/outwith PATO) An optical quality manifest by the capacity of the bearer to block light Concentration splicing quality ploidy transformative potency??
A compositional relational quality manifest by the relative quantity of some chemical type contained by the bearer manifest by the splicing processes undergone by the bearer A cellular quality manifest by the number of genomes that are part of the bearer A cellular quality manifest by the capacity of the bearer cell to differentiate to different cell types tone
Scale Bearer
Cont.
Quality
morphology _ shape __ 2D shape __ 3D shape
Definition (proposed)
A morphological quality which is manifest
Granular hierarchy
quality spatial quality spatial physical and physico-chemical quality
mass, concentration
spatial biological quality spatial molecular quality spatial cellular quality spatial organismal quality spatial quality, multiple scales morphology/form optical quality
color, opacity, fluorescence
Advantages of dividing by granularity
Modular strategic question should we focus on biological qualities and work with others on morphology, physics-based qualities etc?
Good for annotation easy to constrain at high level e.g. organismal qualities cannot be borne by molecules Mirrors GO and OBO Foundry divisions Easier to find terms to be proved, but I believe so
Considerations
Possible objection: The upper level of an ontology is what the user sees first terms such as “cross-granular quality” may be perceived as undesirable and/or abstruse by some users Counter-argument Solvable using ontology views aka subsets, slims
Relative and absolute
Currently PATO terms often come in 3s: e.g. mass, relative mass, absolute mass Why do we need these?
PATO: One or two hierarchies?
Currently two hierarchies attribute value My position: there should be one hierarchy of qualities My compromise: it should be possible to transform PATO automatically into a single hierarchy
attribute
Current PATO
value color colorV is_a hue sat.
var.
hueV sat.V
var.V
…
blackV blueV darkV paleV range
is_a hue attribute
Proposed change
attribute color color sat.
var.
hue sat.
…
var.
black blue dark pale
Arguments for a single hierarchy
Practical elimination of redundancy no clear line for deciding what should be A and what should be V shape, bumpy vs bumpiness Ontological what kind of thing is a ‘value’?
Diederich 1997: [quote here]
Arguments against
Two hierarchies reflect cognitive and linguistic structures e.g. the color of the rose changed from red to brown 3 cognitive artifacts we want to present data in a way that is natural to users …but this can be solved with a single collapsed hierarchy Two are useful for cross-products see later - distinguish modifiers from values EAV is common database pattern so…?
Compromise: transformations
The Two Hierarchies approach is workable if they can be
automatically
collapsed Prerequisite: univocity Each ‘value’ must be defined to mean exactly one thing only i.e. Each ‘value’ must be the ‘range’ of a single attribute Example having a value ‘fast’ that could be applied to both the spatial quality ‘velocity’ and the process quality ‘duration’ would be forbidden
is_a hue attribute
Collapse on ‘ranges’
value color colorV sat.
var.
hueV sat.V
var.V
…
blackV blueV darkV paleV range
Shapes and colors
How many types of shape are there?
notched, T-shaped, Y-shaped, branched, unbranched, antrose, retrose, curled, curved, wiggly, squiggly, round, flat, square, oblong, elliptical, ovoid, cuboid, spherical, egg-shaped, rod shaped, heart shaped, … How do we define them?
How do we compare them?
Is it worth the effort?
Shape types need precise definitions to be useful
Real shapes are not mathematical entities but mathematical definitions can help Axes of classification: Dimensionality 2 4D (process “shapes”) concave vs convex angular vs non-angular number of sides corners Primitive and composed shapes Work with morphometrics community?
Shape likeness
We can post-coordinate some shape types egg-shaped head-shaped A2-segment-shaped Dangers of circularity Only for genuine likeness (e.g. homeotic transformation) not “heart-shaped leaf” See annotation section of this presentation
Color
Keep PATO HSV model but is black a color hue?
We should allow overlapping partitions of color space different domains have ‘sub-terminologies’ of color Is color relational?
Humans vs tetrachromatic UV-seeing animals Composition using has_part
Color hierarchy
Physical quality Optical quality: a physical quality which exists in virtue of the bearer interacting with visible electromagnetic radiation Chromatic quality: Color hue an optical quality which exists in virtue of the bearer emitting, transmitting or reflecting visible electromagnetic radiation Color saturation Color variation Color Opacity: an optical quality which exists in virtue of the bearer aborbing visible electromagnetic radiation opaque translucent transparent
Part 2: Annotation using PATO
Annotation scheme desiderata OBD Dataflow Proposed annotation scheme
Annotation scheme desiderata
Rigour There is a subset of the scheme which is simple The entire scheme is expressive
It should have an
unambiguous
mapping to real world entities
Even if PATO is completely unambiguous, an ill defined annotation scheme may leave room for ambiguity Example: Annotation: E=eye, Q=red What does this mean?
both eyes are red in this one fly instance at least one eye is red in this one fly instance a typical eye is red in this many-eyed spider both eyes are red in this one fly at some point in time both eyes are red in this one fly at all times all eyes are red in all flies in this experiment some eyes are red in some flies in this experiment
There should be a certain usable subset that is simple
Rationale - MODs have limited resources: building entry tools for simple subsets is easier building databases and query/search engines is easier curating with a less expressive formalism is easier, faster and requires less training MODs primary use case is search, for which expressivity is less useful Specifics Tools should have an (optional) simple facade Simple annotations should be expressible in a simple syntax that is understood by users with relatively little training There should be an exchange format and/or database schemas that use traditional technology as might be used in a MOD eg XML, relational tables
The scheme must be highly expressive
Rationale May be required by other NCBCs (BIRN) May be required for cbio 200 gene list Will be required in future Specifics Expressive superset will be optional MODs can ‘pick and choose’ their subset Native exchange and storage format will be logic based Details outwith scope of this presentation
Dataflow
How will various kinds of phenotypic data get into OBD?
what kinds of data suppliers will use different formalisms?
3 scenarios… (more possible)
Example dataflow I
generic MOD curators annotates phenotypes using Phenote Annotations stored directly in MOD’s central DB MOD periodically submits to OBD eg using Phenote to create pheno-xml OBD converts pheno-xml to native logic based formalism Users can query MOD directly, or OBD OBD will allow more expressive queries and have more data integrated
Example dataflow 2
Non-MOD generates complex annotations and stores them locally e.g. BIRN group?
Periodic submissions to OBD e.g. as OWL or Obo-format instance data OBD converts to native logic-based formalism Users can query OBD using more complex queries
Example dataflow 3
cBio MOD curates 200 genes using Phenote Annotations may be stored outside normal MOD schema schema may not be expressive enough for complicated phenotypes TBD - up to MOD Periodic submissions to OBD Phenote can be used to submit pheno-xml, OWL or OBO MOD doesn’t have to worry about format OBD converts to native formalism Users can query OBD using relatively complex queries Is this (should it be) different from #1?
MOD A MOD B MOD C Non-MOD pheno-detailed XML file OBD
Proposed annotation schema
The schema will be described
informally
using a simple syntax I use ‘E’ for entity and ‘Q’ for quality Pretend it is EAV if you like with implicit superfluous ‘A’ The schema has (will have) a formal interpretation aim: database exchange and removal of ambiguities can be expressed using logical language OBD will use an internal logic-based representation
Outline of annotation schema
‘EAV’ or ‘EQ’ is not enough Fine for (very) simple subset Extensions: time relational qualities post-coordination of entity types count qualities measurements …
Standard case: monadic qualities
Examples E=kidney, Q=hypertrophied autodef: a kidney which is hypertrophied We assume that there is more contextual data (not shown) e.g. genotype, environment, number of organisms in study that showed phenotype Interpretation (with the rest of the database record): all fish in this experiment with a particular genotype had a hypertrophied kidney at some point in time
Quantification
long thick thoracic bristles 2 statements E=thoracic bristle, Q=long E=thoracic bristle, Q=thick Default interpretation A typical thoracic bristle is long and thick Optional entity quantifiers EQuant={some,all,most,
in this one individual fly
OBD internal representation
Time
Example: E=brain,Q=small,during=stage A E which has quality that instantiates Q during T E has the quality Q for some extent of time, and that extent
overlaps
T
during
and other temporal relations will come from the OBO Relations ontology
Relational qualities
E.g. sensitivity E=eye, Q=sensitive, E2=red light
Post-coordinating entity types
E=blood in head Q=pooled Problem: The E may not be pre-defined (pre-coordinated, pre-composed) in the anatomy ontology We can
post-compose
a type representation (aka make a cross-product) E=(blood has_location(head)) The ability to post-coordinate may not be available in the ‘simple-subset’ can be expressed easily in pheno-xml, obo, owl, phenote(soon) OBD will handle all required reasoning
Pre-coordinating phenotypes
Mammalian phenotype ontology has pre coordinated phenotype terms osteoporosis pink fur OBD will be able to translate post-coordinated queries to annotations on pre defined terms queries on pre-defined terms to post-coordinated phenotypes Requirement computable logical definitions are added to MP
Count qualities
wingless polydactyly spermatocytes devoid of asters
Absence can never be instantiated
wingless E=wing, Q=absent autodef “an instance of wing which is absent” Proposal: restate as: E=mesothoracic segment, Q=missing part, E2=wing This has other advantages works better for “spermatocyte devoid of asters”
The quality of ‘being many’ does not inhere in a finger
Polydactyly E=finger, Q=supernumerary autodef: “a finger which is supernumerary” Restate as: E=hand, Q=supernumerary parts, E2=finger “a hand which has more fingers as parts than is typical” With count extension E=hand, Q=supernumerary parts, E2=finger, Count=6 could also say +1 “a hand with 6 fingers, which is more than normal”
Proposed PATO sub-hierarchy
part count quality lacking parts having normal part count having extra parts lacking all lacking some
Mass count qualities
furriness porosity Bearers possess these qualities by virtue of the number and qualities of their granular parts hairiness by virtue of: number, width, length, spacing, orientation of hair-parts
What is the essence of hairy?
Attempt 1: E=skin,Q=hairy but what if we do not have ‘hairy’ pre-coordinated in PATO?
Alternate representation: E=skin,Q=excess fine-grained parts,E2=hair open Q: is this equivalent to, subsumed by, or related to representation 1?
Another representation: E=hair, Q=long this is something different
increased brown fat cells
“increased brown fat cells” Attempt 1: E=brown fat cell, Q=increased autodef: a brown fat cell which is increased Restate as: E=organism, Q=increased (granular) parts, E2=brown fat cell works better for “increased brown fat cells in upper body” OBD handles reasoning should annotations to above be returned for queries of PATO term “fatty”?
Relativity
PATO has terms like large increased Context is implicit strain species genus/order Extension to make explicit
In_comparison_to
Bigger than average for species/genus/etc E=brain,Q=large,In_comparison_to=
Ratio & relative_to
Use cases: Size of brain relative to size of skull Size of brain relative to size of skull in an individual when compared to size brain relative to size of skull in a typical individual of that species E=brain,Q=large,relative_to=skull, in_comparison_to=
Modifiers
E=bone,Q=notched,Mod=mild Standardised qualitative modifiers Meaning dependent on E and Q Can have multiple, cross-cutting scales qualitative and numeric/score based absent mildly realised 0 normal 0.00
1 0.01
0.1
1 strong 10 extreme 100
Modifiers modify meaning of Q
Influence of Mod on Q is subjective but the direction is objective Example: E=adult_human_body, during=sleep Q={low,high} temperature, Mod=mild,normal,moderate,extreme abn+ abnormal absent mildly realised NOT 0.00
1 0.01
0.1
normal normal 1 N/A 35 37 abnormal abn+ strong 10 extreme 100 39 word scale score scale temperature 37 36.5
36 35 low temperature 37 37.5
38 39 high temperature
Modifiers and PATO
Modifiers are not qualities Modifiers should not be in a true ontology But we can still give these PATO IDs kept separate from core PATO ontology Modifiers can be relational relatum may be implicit e.g. abnormal_with_respct_to
Modifiers serve similar purposes as Values in tripartite EAV model Difference: absent, low, high are way as genuine quality types like ‘notched’, ‘large’, ‘diploid’, ‘pink’
not
treated in the same they are ingredients in the representation language, and not types in an ontology
Heterozygous flies have very short and highly branched arista laterals.
E=arista lateral, EQuant=all, Q=short, Mod=extreme, in_comparison_to=Dmel E=arista lateral EQuant=all, Q=branched, Mod=extreme, in_comparison_to=Dmel
Measurements
Measurements are not qualities In the schema, representations of measurements are attached to the representations of qualities Separate measurement schema don’t need to discuss fine grained details here some data providers will require more detail than others here e.g. averages, error bars, …
E=tail, Q=length, Measurement=2cm E=tail, Q=length, Measurement=+.1cm, in_comparison_to=
Likeness
Shape likeness Homeotic transformations E=A2 segment,Q=morphology,Similar_to=A3 segment Interp: An A2 segment with the morphological features of an A3 segment but not “heart-shaped leaves”
Conditionals
Some phenotypes are only realised under certain conditions environment including chemical interactions, RNA interference etc we should separate conditionals (this phenotype only seen in this envirotype with this genotype) from data (on this occasion this phenotype seen in this envirotype with this genotype)
Schema elements
Phenotype character: E Q EQuant E2 Count Mod Relative_to In_comparison_to Similar_to Measurment Temporal Most of these elements are optional data providers pick and choose their level of expressivity
future extensions
boolean combinations conditional statements eg environment
++ + modifier .
--