NCBIO-Berkeley

Download Report

Transcript NCBIO-Berkeley

Some thoughts on PATO Chris Mungall BBOP Hinxton May 2006

Outline

   Motivation revisited The Ontology: PATO OBD & using PATO for annotation

Who should use PATO?

  Originally:  model organism mutant phenotypes But also:    ontology-based evolutionary systematics neuroscience; BIRN clinical uses   OMIM clinical records  to define terms in other ontologies  e.g.

diploid condensed

cell;

invasive

chromosome tumor,

engineered

gene,

Unifying goal: integration

 Integrating data  within and across these domains   across levels of granularity across different perspectives  Requires  Rigorous formal definitions in

both

ontologies and annotation schemas

Some thoughts on the ontology itself

 Outline  Definitions   how do we define PATO terms?

what exactly is it we’re defining?

is_a

 hierarchy what are the top-level distinctions?

 what are the finer grained distinctions?

 shapes and colors

It’s all about the definitions

   Everything is doomed to failure without rigorous definitions  even more so with PATO than other ontologies OBO Foundry Principle  Definitions should describe things in reality, not how terms are used  def should not use the word ‘describing’ Should we come up with a policy for definitions in PATO   currently: 19 defs (2.5 are circular) proposed breakout session: examine all these

consistency: the property of holding together and retaining shape amplitude: The size of the maximum displacement from the 'normal' position, when periodic motion is taking place placement: The spatial property of the way in which something is placed pointed value: A sharp or tapered end epinastic value: A downward bending of leaves or other plantnparts oblong value: Having a somewhat elongated form withnapproximately parallel sides elliptic value: Elliptic shapen hearted value: Heart shaped fasciated value: Abnormally flattened or coalescedn opacity: The property of not permitting the passage of electromagnetic radiatio opaque value: Not clear; not transmitting or reflecting light or radiant energy undulate value: Having a sinuate margin and rippled surface permeability: The property of something that can be pervaded by a liquid (as by osmosis or diffusion) porosity: The property of being porous; being able to absorb fluids porous value: able to absorb fluids viscosity: a property of fluids describing their internal resistance to flow viscous value: a relatively high resistance to flow.

latency: The time that elapses between a stimulus and the response to it power: The rate at which work is done

Proposal: genus-differentia definitions

   An S   is a G

which

D Each def should refine the is_a parent

Single

is_a parent Example: (non-PATO)  binucleate cell def= a cell

which

has two nuclei Example (proposed PATO def):  convex shape def= a shape

which

has no indentations  opacity def= an optical quality electromagnetic radiation

which

exists by virtue of the bearer’s capacity to block the passage of  v similar to existing def

This policy will reap benefits

  Advantages:    Helps avoid circularity Ensures precision Consistency in wording user-friendly Considerations:  Sometimes leads to awkward phrasing 

-ity

suffix “an opacity which…”  Solution:   allow shortened gerund form  having…, being…., ….

 most of the existing defs conform already implicit prefix “A G which exists by virtue of the bearer…”

From the top down

    First, the fake term ‘pato’ must be removed How do we define ‘attribute’?

Note: I prefer the term ‘quality’ or ‘property’     attribute implies attribution

length_in_centimetres

is an attribute we can of course continue to say ‘attribute’ but I use ‘quality’ in these slides most of new new pato defs are phrased as ‘a property of…’ which I like, but inconsistent with calling the root ‘attribute’ Well then, what is a quality/property?

What a quality is NOT

 Qualities are not measurements  Instances of qualities exist independently of their measurements  Qualities can have zero or more measurements  These are not the names of qualities:     percentage process abnormal high

Some examples of qualities

   The particular redness of the left eye of a single individual fly  An instance of a quality type The color ‘red’  A quality type Note: the eye does not instantiate ‘red’  PATO represents

quality types

 PATO definitions can be used to classify quality instances by the types they

instantiate

the type “red” instantiates the type “eye” instantiates the particular case of redness (of a particular fly eye) inheres in (is a quality of, has_bearer) an instance of an eye (in a particular fly)

Qualities are

dependent

entities

    Qualities require

bearers

 Bearers can be physical objects or processes Example:   A shape requires a physical object to bear it If the physical object ceases to exist (e.g. it decomposes), then the shape ceases to exist Some qualities are relational  they relate a bearer with other entities  e.g.

sensitivity

(to) Compare with: functions

The PATO hierarchy

  Proposal for a new top level division Proposal for granular divisions

Proposal 1: top level division

 Spatial quality  Definition: A quality

which

has a physical object as bearer  Examples: color, shape, temperature, velocity, ploidy, furriness, composition, texture  Spatiotemporal quality   Definition: A quality

which

has a process as bearer Examples: rate, periodicity, regularity, duration

Proposal 2: subsequent divisions

   Based on granularity (i.e. size scale)  a good account of granularity is vital for inferences from molecular (gene) level to organismal (disease) level  How do we partition the levels?

Some qualities are realised at certain levels of granularity Others can be realised across levels   shape, porosity Sum-of-parts vs emergent

Scale

Physical

Bearer

Cont.

Physical Phys/Che m Molecular Cellular Cellular Organism Cont.

Liquid Gene Cell Cell Tissue

Quality

Mass Opacity

Definition (proposed)

Equivalent to the sum of the mass of the parts of the bearer (mass at the particle level is primitive/outwith PATO) An optical quality manifest by the capacity of the bearer to block light Concentration splicing quality ploidy transformative potency??

A compositional relational quality manifest by the relative quantity of some chemical type contained by the bearer manifest by the splicing processes undergone by the bearer A cellular quality manifest by the number of genomes that are part of the bearer A cellular quality manifest by the capacity of the bearer cell to differentiate to different cell types tone

Scale Bearer

Cont.

Quality

morphology _ shape __ 2D shape __ 3D shape

Definition (proposed)

A morphological quality which is manifest

Granular hierarchy

 quality  spatial quality  spatial physical and physico-chemical quality 

mass, concentration

 spatial biological quality  spatial molecular quality   spatial cellular quality spatial organismal quality  spatial quality, multiple scales   morphology/form optical quality 

color, opacity, fluorescence

Advantages of dividing by granularity

    Modular  strategic question  should we focus on biological qualities and work with others on morphology, physics-based qualities etc?

Good for annotation  easy to constrain at high level  e.g. organismal qualities cannot be borne by molecules Mirrors GO and OBO Foundry divisions Easier to find terms  to be proved, but I believe so

Considerations

 Possible objection:   The upper level of an ontology is what the user sees first terms such as “cross-granular quality” may be perceived as undesirable and/or abstruse by some users  Counter-argument  Solvable using ontology views  aka subsets, slims

Relative and absolute

 Currently PATO terms often come in 3s:  e.g. mass, relative mass, absolute mass  Why do we need these?

PATO: One or two hierarchies?

 Currently two hierarchies   attribute value  My position:  there should be one hierarchy of qualities  My compromise:  it should be possible to transform PATO automatically into a single hierarchy

attribute

Current PATO

value color colorV is_a hue sat.

var.

hueV sat.V

var.V

blackV blueV darkV paleV range

is_a hue attribute

Proposed change

attribute color color sat.

var.

hue sat.

var.

black blue dark pale

Arguments for a single hierarchy

 Practical   elimination of redundancy no clear line for deciding what should be A and what should be V  shape, bumpy vs bumpiness  Ontological  what kind of thing is a ‘value’?

Diederich 1997: [quote here]

Arguments against

   Two hierarchies reflect cognitive and linguistic structures  e.g. the color of the rose changed from red to brown  3 cognitive artifacts  we want to present data in a way that is natural to users  …but this can be solved with a single collapsed hierarchy Two are useful for cross-products  see later - distinguish modifiers from values EAV is common database pattern  so…?

Compromise: transformations

  The Two Hierarchies approach is workable if they can be

automatically

collapsed Prerequisite: univocity  Each ‘value’ must be defined to mean exactly one thing only  i.e. Each ‘value’ must be the ‘range’ of a single attribute  Example  having a value ‘fast’ that could be applied to both the spatial quality ‘velocity’ and the process quality ‘duration’ would be forbidden

is_a hue attribute

Collapse on ‘ranges’

value color colorV sat.

var.

hueV sat.V

var.V

blackV blueV darkV paleV range

 Shapes and colors

How many types of shape are there?

    notched, T-shaped, Y-shaped, branched, unbranched, antrose, retrose, curled, curved, wiggly, squiggly, round, flat, square, oblong, elliptical, ovoid, cuboid, spherical, egg-shaped, rod shaped, heart shaped, … How do we define them?

How do we compare them?

Is it worth the effort?

Shape types need precise definitions to be useful

    Real shapes are not mathematical entities  but mathematical definitions can help Axes of classification:  Dimensionality  2 4D (process “shapes”)    concave vs convex angular vs non-angular number of   sides corners Primitive and composed shapes Work with morphometrics community?

Shape likeness

    We can post-coordinate some shape types    egg-shaped head-shaped A2-segment-shaped Dangers of circularity Only for genuine likeness (e.g. homeotic transformation)  not “heart-shaped leaf” See annotation section of this presentation

Color

   Keep PATO HSV model  but is black a color hue?

We should allow overlapping partitions of color space  different domains have ‘sub-terminologies’ of color  Is color relational?

 Humans vs tetrachromatic UV-seeing animals Composition  using has_part

Color hierarchy

 Physical quality  Optical quality: a physical quality which exists in virtue of the bearer interacting with visible electromagnetic radiation  Chromatic quality:  Color hue an optical quality which exists in virtue of the bearer emitting, transmitting or reflecting visible electromagnetic radiation    Color saturation Color variation Color  Opacity:  an optical quality which exists in virtue of the bearer aborbing visible electromagnetic radiation opaque   translucent transparent

Part 2: Annotation using PATO

   Annotation scheme desiderata OBD Dataflow Proposed annotation scheme

Annotation scheme desiderata

  Rigour There is a subset of the scheme which is simple  The entire scheme is expressive

It should have an

unambiguous

mapping to real world entities

  Even if PATO is completely unambiguous, an ill defined annotation scheme may leave room for ambiguity Example:  Annotation:  E=eye, Q=red  What does this mean?

       both eyes are red in this one fly instance at least one eye is red in this one fly instance a typical eye is red in this many-eyed spider both eyes are red in this one fly at some point in time both eyes are red in this one fly at all times all eyes are red in all flies in this experiment some eyes are red in some flies in this experiment

There should be a certain usable subset that is simple

  Rationale - MODs have limited resources:    building entry tools for simple subsets is easier building databases and query/search engines is easier curating with a less expressive formalism is easier, faster and requires less training  MODs primary use case is search, for which expressivity is less useful Specifics   Tools should have an (optional) simple facade Simple annotations should be expressible in a simple syntax that is understood by users with relatively little training  There should be an exchange format and/or database schemas that use traditional technology as might be used in a MOD  eg XML, relational tables

The scheme must be highly expressive

 Rationale    May be required by other NCBCs (BIRN) May be required for cbio 200 gene list Will be required in future  Specifics  Expressive superset will be optional  MODs can ‘pick and choose’ their subset  Native exchange and storage format will be logic based  Details outwith scope of this presentation

Dataflow

  How will various kinds of phenotypic data get into OBD?

 what kinds of data suppliers will use different formalisms?

3 scenarios… (more possible)

Example dataflow I

     generic MOD curators annotates phenotypes using Phenote Annotations stored directly in MOD’s central DB MOD periodically submits to OBD  eg using Phenote to create pheno-xml OBD converts pheno-xml to native logic based formalism Users can query MOD directly, or OBD  OBD will allow more expressive queries and have more data integrated

Example dataflow 2

   Non-MOD generates complex annotations and stores them locally  e.g. BIRN group?

Periodic submissions to OBD  e.g. as OWL or Obo-format instance data  OBD converts to native logic-based formalism Users can query OBD using more complex queries

Example dataflow 3

     cBio MOD curates 200 genes using Phenote Annotations may be stored outside normal MOD schema   schema may not be expressive enough for complicated phenotypes TBD - up to MOD Periodic submissions to OBD  Phenote can be used to submit pheno-xml, OWL or OBO  MOD doesn’t have to worry about format  OBD converts to native formalism Users can query OBD using relatively complex queries Is this (should it be) different from #1?

MOD A MOD B MOD C Non-MOD pheno-detailed XML file OBD

Proposed annotation schema

  The schema will be described

informally

using a simple syntax  I use ‘E’ for entity and ‘Q’ for quality  Pretend it is EAV if you like  with implicit superfluous ‘A’ The schema has (will have) a formal interpretation  aim: database exchange and removal of ambiguities   can be expressed using logical language OBD will use an internal logic-based representation

Outline of annotation schema

 ‘EAV’ or ‘EQ’ is not enough  Fine for (very) simple subset  Extensions:       time relational qualities post-coordination of entity types count qualities measurements …

Standard case: monadic qualities

 Examples   E=kidney, Q=hypertrophied autodef: a kidney which is hypertrophied  We assume that there is more contextual data (not shown)  e.g. genotype, environment, number of organisms in study that showed phenotype  Interpretation (with the rest of the database record):  all fish in this experiment with a particular genotype had a hypertrophied kidney at some point in time

Quantification

   long thick thoracic bristles  2 statements   E=thoracic bristle, Q=long E=thoracic bristle, Q=thick Default interpretation  A typical thoracic bristle is long and thick Optional entity quantifiers   EQuant={some,all,most,,} E=thoracic bristle, Q=long, EQuant=80%  80% of the thoracic bristles

in this one individual fly

OBD internal representation

Time

 Example:  E=brain,Q=small,during=stage  A E which has quality that instantiates Q during T  E has the quality Q for some extent of time, and that extent

overlaps

T 

during

and other temporal relations will come from the OBO Relations ontology

Relational qualities

  E.g. sensitivity E=eye, Q=sensitive, E2=red light

Post-coordinating entity types

    E=blood in head Q=pooled Problem:  The E may not be pre-defined (pre-coordinated, pre-composed) in the anatomy ontology We can

post-compose

a type representation (aka make a cross-product)  E=(blood  has_location(head)) The ability to post-coordinate may not be available in the ‘simple-subset’  can be expressed easily in pheno-xml, obo, owl, phenote(soon)  OBD will handle all required reasoning

Pre-coordinating phenotypes

   Mammalian phenotype ontology has pre coordinated phenotype terms  osteoporosis  pink fur OBD will be able to translate  post-coordinated queries to annotations on pre defined terms  queries on pre-defined terms to post-coordinated phenotypes Requirement  computable logical definitions are added to MP

Count qualities

   wingless polydactyly spermatocytes devoid of asters

Absence can never be instantiated

 wingless   E=wing, Q=absent autodef “an instance of wing which is absent”  Proposal: restate as:  E=mesothoracic segment, Q=missing part, E2=wing  This has other advantages  works better for “spermatocyte devoid of asters”

The quality of ‘being many’ does not inhere in a finger

   Polydactyly   E=finger, Q=supernumerary autodef: “a finger which is supernumerary” Restate as:  E=hand, Q=supernumerary parts, E2=finger  “a hand which has more fingers as parts than is typical” With count   extension E=hand, Q=supernumerary parts, E2=finger, Count=6  could also say +1 “a hand with 6 fingers, which is more than normal”

Proposed PATO sub-hierarchy

part count quality lacking parts having normal part count having extra parts lacking all lacking some

Mass count qualities

   furriness porosity Bearers possess these qualities by virtue of the number and qualities of their granular parts  hairiness by virtue of: number, width, length, spacing, orientation of hair-parts

What is the essence of hairy?

   Attempt 1:   E=skin,Q=hairy but what if we do not have ‘hairy’ pre-coordinated in PATO?

Alternate representation:  E=skin,Q=excess fine-grained parts,E2=hair  open Q: is this equivalent to, subsumed by, or related to representation 1?

Another representation:   E=hair, Q=long this is something different

increased brown fat cells

   “increased brown fat cells”  Attempt 1:  E=brown fat cell, Q=increased  autodef: a brown fat cell which is increased Restate as:   E=organism, Q=increased (granular) parts, E2=brown fat cell works better for “increased brown fat cells in upper body” OBD handles reasoning  should annotations to above be returned for queries of PATO term “fatty”?

Relativity

 PATO has terms like   large increased  Context is implicit    strain species genus/order  Extension to make explicit

In_comparison_to

 Bigger than average for species/genus/etc   E=brain,Q=large,In_comparison_to= default is same species as specified by genotype  Comparative phenotypes  E=brain,Q=large,In_comparison_to=  requires recording phenotype IDs  e.g. two experiments, same genotype, different environment, phenotype stronger in one

Ratio & relative_to

 Use cases:   Size of brain relative to size of skull Size of brain relative to size of skull in an individual when compared to size brain relative to size of skull in a typical individual of that species  E=brain,Q=large,relative_to=skull, in_comparison_to=  defaults to: whole organism

Modifiers

   E=bone,Q=notched,Mod=mild Standardised qualitative modifiers  Meaning dependent on E and Q Can have multiple, cross-cutting scales  qualitative and numeric/score based absent mildly realised 0 normal 0.00

1 0.01

0.1

1 strong 10 extreme 100

Modifiers modify meaning of Q

 Influence of Mod on Q is subjective but the direction is objective  Example: E=adult_human_body, during=sleep  Q={low,high} temperature, Mod=mild,normal,moderate,extreme abn+ abnormal absent mildly realised NOT 0.00

1 0.01

0.1

normal normal 1 N/A 35 37 abnormal abn+ strong 10 extreme 100 39 word scale score scale temperature 37 36.5

36 35 low temperature 37 37.5

38 39 high temperature

Modifiers and PATO

  Modifiers are not qualities Modifiers should not be in a true ontology  But we can still give these PATO IDs  kept separate from core PATO ontology  Modifiers can be relational  relatum may be implicit  e.g. abnormal_with_respct_to

 Modifiers serve similar purposes as Values in tripartite EAV model  Difference:  absent, low, high are way as genuine quality types like ‘notched’, ‘large’, ‘diploid’, ‘pink’

not

treated in the same  they are ingredients in the representation language, and not types in an ontology

 Heterozygous flies have very short and highly branched arista laterals.

 E=arista lateral, EQuant=all, Q=short, Mod=extreme, in_comparison_to=Dmel  E=arista lateral EQuant=all, Q=branched, Mod=extreme, in_comparison_to=Dmel

Measurements

 Measurements are not qualities  In the schema, representations of measurements are attached to the representations of qualities  Separate measurement schema  don’t need to discuss fine grained details here  some data providers will require more detail than others here  e.g. averages, error bars, …

  E=tail, Q=length, Measurement=2cm E=tail, Q=length, Measurement=+.1cm, in_comparison_to=

Likeness

   Shape likeness Homeotic transformations   E=A2 segment,Q=morphology,Similar_to=A3 segment Interp:  An A2 segment with the morphological features of an A3 segment but not “heart-shaped leaves”

Conditionals

 Some phenotypes are only realised under certain conditions  environment  including chemical interactions, RNA interference etc  we should separate conditionals (this phenotype only seen in this envirotype with this genotype) from data (on this occasion this phenotype seen in this envirotype with this genotype)

Schema elements

  Phenotype character:            E Q EQuant E2 Count Mod Relative_to In_comparison_to Similar_to Measurment Temporal Most of these elements are optional  data providers pick and choose their level of expressivity

future extensions

  boolean combinations conditional statements  eg environment

++ + modifier .

--