The Joy of Ontology
Download
Report
Transcript The Joy of Ontology
The Joy of Ontology
Suzanna Lewis
SMI Colloquium
April 20th, 2006
Sections
Why make an ontology
What is an ontology
How to create an ontology
Logically
Technically
Organizationally
National Center for Biomedical Ontology
Case study: Phenotypes, our current work
on OBD
Why make an ontology?
What is the motivation?
The Problem(s) with data
Inaccessibility of widely distributed
data
Over abundance of information
Speed and performance
Interpreting the data syntactically
Interpreting the data semantically
We started the GO
To develop a shared language adequate for the
annotation of molecular characteristics across
organisms.
To agree on a mutual understanding of the
definition and meaning of any word used. and
thus to support cross-database queries.
To provide database access via these common
terms to gene product annotations and
associated sequences.
Annotation of Yeast Microarray
Clusters Using GO
GENE
SDH1
PROCESS
tricarboxylic acid cycle
FUNCTION
succinate dehydrogenase
CELLULAR COMPONENT
mitochondrial inner membrane
NDI1
oxidative phosphorylation
NADH dehydrogenase
mitochondrial inner membrane
QCR7
electron transport
ubiquinol--cytochrome-c
reductase
mitochondrial inner membrane
COX6
oxidative phosphorylation
cytochrome-c oxidase
mitochondrial inner membrane
RIP1
electron transport
Rieske Fe-S protein
mitochondrial inner membrane
COX15
oxidative phosphorylation
cytochrome-c oxidase
mitochondrial inner membrane
CYT1
electron transport
mitochondrial inner membrane
COR1
electron transport
SDH3
tricarboxylic acid cycle
cytochrome-c1
ubiquinol--cytochrome-c
reductase
succinate dehydrogenase subunit
QCR6
electron transport
ubiquinol--cytochrome-c
reductase
mitochondrial inner membrane
CYT1
electron transport
cytochrome-c1
mitochondrial inner membrane
mitochondrial inner membrane
mitochondrial inner membrane
Microarray
data
from Figure 2K succinate
of Eisendehydrogenase
et al. (1998).
Cluster analysis and
SDH4
tricarboxylic acid cycle
subunit mitochondrial inner membrane
display
of genome-wide
expression
patterns, Proc. Natl. Acad. Sci. 95
SDH2
tricarboxylic acid cycle
succinate dehydrogenase subunit mitochondrial inner membrane
(25): 14863-14868.
MDH1
tricarboxylic acid cycle
malate dehydrogenase
mitochondrial matrix
The Challenge of Communication
Ontologies are essential to
make sense of biomedical data
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
A Portion of the OBO Library
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Motivation: to capture
biological reality
Inferences and decisions we make are
based upon what we know of the
biological reality.
An ontology is a computable
representation of this underlying
biological reality.
Enables a computer to reason over the
data in (some of) the ways that we do.
What is an ontology
Ontology (as a branch of
philosophy)
The science of what exists in every area of
reality
The classification of entities: what kinds of
things exist
The relations between these entities
Defines a scientific field's vocabulary and
the canonical formulations of its theories.
Seeks to solve problems which arise in
these domains.
A biological ontology is:
A machine interpretable representation
of some aspect of biological reality
what kinds
of things
exist?
what are the
relationships
between
these things?
eye disc
develops
from
sense organ
is_a
eye
part_of
ommatidium
Entity: a definition
anything which exists, including
things and processes, functions and
qualities, beliefs and actions,
software and images
Representation: a definition
An image, idea, map, picture,
name, description ... which refers to,
or is intended to refer to, some entity
or entities in reality
this ‘or is intended to refer to’ should
always be assumed
Ontologies represent types in
reality
ontology
reality
Scientific data represent
instances in reality
Two kinds of representational
artifact
Databases, inventories, images:
represent what is particular in reality =
instances
Ontologies, terminologies, catalogs:
represent what is general in reality
(exists in multiple instances) = types
(universals, kinds)
Ontologies are not for representing
concepts in people’s heads
reality
ontology
The researcher has a cognitive
representation of what is general,
based on his knowledge of the
science
Cognitive
representation
ontology
types
substance
organism
animal
mammal
cat
siamese
instances
frog
An ontology is like a scientific text;
it is a representation of types in reality
Cognitive
representation
Ontology = a
representation of types
reality
Atomic representational
unit: a definition
terms, icons, bar codes, alphanumeric
identifiers ... which
1. refer, or are intended to refer, to entities in
reality, and
2. are not built out of further subrepresentations
Representational units are the atoms in
the domain of representations
Modular representational
unit: a definition
A representation which is built out of
other representational units, which
together form a structure that mirrors
a corresponding structure in reality
Periodic Table
The Periodic Table
Ontology: a definition
A modular, representational
artifact whose representational
units are intended to represent
1. types in reality
2. the relations between these types
which are true universally (i.e. for all
instances)
lung is_a anatomical structure
lobe of lung part_of lung
How to create an ontology
Part 1
The logic and science
In computer science, there is an
information handling problem
Different groups of data-gatherers
develop their own idiosyncratic terms in
which they represent information.
To put this information together, methods
must be found to resolve terminological
and conceptual incompatibilities.
Again, and again, and again…
The Reality
Do not assume that data integration
can be brought about by somehow
‘mapping’ incompatible, low quality
ontologies built for different
purposes
Two flavors of ontology
1. Application ontology
2. Reference ontology
Application Ontology
An application ontology is
comparable to an engineering
artifact such as a software tool. It is
constructed for specific practical
purposes.
Reference Ontology
A reference ontology is analogous
to a scientific theory; it seeks to
optimize representational adequacy
to its subject matter
Assumptions
There are best practices in ontology
development which should be followed
to create stable high-quality ontologies
Shared high quality ontologies foster
cross-disciplinary and cross-domain re-use
of data, and create larger communities
Why do we need rules/standards
for good ontology?
Ontologies must be intelligible both to humans
(for annotation) and to machines (for reasoning
and error-checking)
Unintuitive rules lead to errors in classification
Simple, intuitive rules facilitate training of
curators and annotators
Common rules allow alignment with other
ontologies (and thus cross-domain exploitation
of data)
Logically coherent rules enhance harvesting of
content through automatic reasoning systems
Ontologies built according to
common logically coherent rules
Will make entry easier and yield a safer
growth path
You can start small, annotating your data
with initial fragments of a well-founded
ontology, confident that the results will still
be usable when the ontology grows
larger and richer
TheOBO
OBO
Foundry
Foundry
A subset of OBO ontologies whose developers
agree in advance to accept a common set of
principles designed to assure
intelligibility to biologist curators, annotators, users
formal robustness
stability
compatibility
interoperability
support for logic-based reasoning
The OBO Foundry
1. The ontology is open and available to be used by
all.
2. The developers of the ontology agree in
advance to collaborate with developers of other
OBO Foundry ontology where domains overlap.
3. The ontology is in, or can be instantiated in, a
common formal language.
4. The ontology possesses a unique identifier space
within OBO.
5. The ontology provider has procedures for
identifying distinct successive versions.
The OBO Foundry
6. The ontology has a clearly specified and clearly
delineated content.
7. The ontology includes textual definitions for all
terms.
8. The ontology is well-documented.
9. The ontology has a plurality of independent
users.
10. The ontology uses relations which are
unambiguously defined following the pattern of
definitions laid down in the OBO Relation
Ontology.
Orthogonality
Orthogonality
Ontology groups who choose to be
part of the OBO Foundry thereby
commit themselves to collaborating
to resolve disagreements which arise
where their respective domains
overlap
agreed on relations
The success of ontology alignment demands
that ontological relations (is_a, part_of, ...)
have the same meanings in the different
ontologies to be aligned.
See “Relations in Biomedical Ontologies”,
Genome Biology May 2005, Barry Smith ,
Werner Ceusters, Bert Klagges, Jacob
Köhler, Anand Kumar, Jane Lomax, Chris
Mungall, Fabian Neuhaus, Alan L Rector,
and Cornelius Rosse
Three fundamental dichotomies
continuants vs. occurrents
dependent vs. independent
types vs. instances
ONTOLOGIES ARE
REPRESENTATIVES OF TYPES
IN REALITY
For example in the GO
Molecules, cell components , organisms are
independent continuants which have functions
Functions are dependent continuants which
become realized through special sorts of
processes we call functionings
Processes are occurrents include: functionings,
side-effects, stochastic processes
Continuants (aka endurants)
have continuous existence in time
preserve their identity through
change
exist in toto whenever they exist at
all
Occurrents (aka processes)
have temporal parts
unfold themselves in successive
phases
exist only in their phases
Continuants vs. Occurrents
Anatomy vs. Physiology
Snapshot vs. Video
Stocks vs. Flows
Commodities vs. Services
Products vs. Processes
Dependent entities
require independent continuants as
their bearers
There is no grin without a cat
Dependent vs.
independent continuants
Independent continuants
(organisms, cells, molecules,
environments)
Dependent continuants (qualities,
shapes, roles, propensities, functions)
E.g. the acidity of this gut
All occurrents are dependent
entities
They are dependent on those
independent continuants which are
their participants (agents, patients,
media ...)
GO’s three ontologies
occurent
molecular
function
dependent
biological
process
continuant
cellular
component
independent
molecular
process
functioning
molecular
function
cellular
process
organismlevel
biological
process
functioning
functioning
cellular
function
organismlevel
biological
function
Pumping
blood
To pump
blood
molecule
cellular
component
organism
heart
UBO
Upper Biomedical Ontology
Continuant: 3D
Independent
Continuant
Quality
Dependent
Continuant
Function
Occurrent: 4D
Functioning
Side-Effect,
Stochastic
Process, ...
Spatial
Region
instances (in space and time)
How to create an ontology
Part 2
The technical aspects
Separate the Database from
the Ontology
For extensibility
For generality
For reasonability
For interoperability
Why ontologies are worth it
Minimize database maintenance costs
Communication between researchers
As well as
Better query facilities
Ability to draw inferences
Detect correlations
Facilitate computational interpretation of text
And more…
Before: domain knowledge is
embedded in the db schema
Gene
table
Exon
table
RNA
table
Protein
table
Embedding domain knowledge
in the db schema is expensive
The logical description and the physical
database description of the biology are comingled
Therefore new biological knowledge will force:
Schema changes: e.g. new tables
Query changes: that explicitly refer to tables
Middleware changes: to retrieve and format
GUI changes: to display
After: domain knowledge is
embedded in the ontology
feature
table
Ontology driven db schema is
less expensive to maintain
The logical description and the physical
database description of the biology are
developed independently
Therefore new biological knowledge will
only require:
Ontology changes: e.g. new terms
GUI changes: display
No schema changes
No query changes
No middleware changes
Reality:
ultimately
this is what
the ontology
must reflect
Step 1:
Build an ontology
that reflects reality
Observations
and
experimentation
Step 2: Data capture
Database:
UIDs serving
as proxies for
instances
Patient IDs
Sequence accessions
Genetic strain IDs
…
Step 1:
Build an ontology
that reflects reality
Step 2: Data capture
Database:
UIDs serving
as proxies for
instances
Step 3:
Classify data
using the
ontology
Ontology is a contract
between researchers
A common language that allows us to share
knowledge
Researchers have a stake in it:
Every individual will benefit by being able to
accurately interpret someone else’s data
No more pre-scrubbing
No more time spent translating
Rules on types
Don’t confuse types with instances
Don’t confuse instances with leaf nodes
Don’t confuse types with ideas
Don’t confuse types with ways of getting
to know types
Don’t confuse types with ways of talking
about types
Don’t confuse types with data about
types
An astronomy ontology
should not include
'Buzz Aldrin'
Rules on terms
Terms should be in the singular
Avoid abbreviations even when it is
clear in context what they mean
(‘breast’ for ‘breast tumor’)
Think of each term ‘A ’ in an
ontology is shorthand for a term of
the form:
‘the type A ’
Rules on Definitions
The terms used in a definition should be
simpler (more intelligible) than the term to
be defined; otherwise the definition
provides no assistance
Definitions should be intelligible to both
machines and humans
to human understanding
Humans need clarity and modularity
to machine processing
Machines can cope with the full formal
representation
Confusing non-useful
definitions
Swimming
Swimming is healthy and has 8 letters
Poland
The name of Poland
When defining terms use
Aristotelian definitions
The definition of ‘A’ takes the form:
an A =def. is a B which ...
where B is A’s parent in the hierarchy
A human being =def. an animal which is
rational
A helicase =def. an enzyme which
catalyzes the hydrolysis of ATP to unwind
the DNA helix
Use of Aristotelian
definitions
Makes defining terms easier
Each definition encapsulates in modular
form the entire parentage of the defined
term
The entire information content of the
FMA’s term hierarchy and definitions can
be translated very cleanly into a
computer representation
Now accepted by GO
Summary: How to build your
ontology
Keep It Simple:
lowest possible barrier to entry
Technology independence
Clarity of definitions and scope
“With new data, we change our minds”
An ontology must adapt to reflect current
understanding of reality
Mechanisms to support change
How to build an ontology
Part 3
The sociology and
organizational aspects
Elements for Success: GO
A Community with a common vision
A pool of talented and motivated
developers/scientists
A mix of academic and commercial
An organized, light weight approach to
product development
A leadership structure
Communication
A well-defined scope, (our “business”)
Adopted from “Open Source Menu for Success”
Gene Ontology
Community annotation production
“Extreme Programming” techniques
to distributed ontology generation
Revision control
Nightly conflict resolution
Users are integral to the team
Rapid iterations
Why
Survey
Domain
covered
?
Public
?
Active
?
Community?
Salvage
Develop
Applied
?
yes
Improve
no
Collaborate & Learn
Is your domain covered?
Due diligence & background
research
Step 1: Learn what is out there
The most comprehensive list is on the
OBO site. http://obo.sourceforge.net
Assess ontologies critically and
realistically.
It is privately held?
Ontologies must be shared
Communities form scientific theories
that seek to explain all of the existing evidence
and can be used for prediction
These communities are all directed to the same
biological reality, but have their own perspective
The computable representation must be shared
Ontology development is inherently
collaborative
Open ontologies become connected to
instance data & this feeds back on ontology
development
Is it active?
Pragmatic assessment of an
ontology
Is there access to help, e.g.:
[email protected] ?
Does a warm body answer help mail
within a ‘reasonable’ time—say 2
working days ?
Is it applied?
Toy ontologies are not useful
Every ontology improves when it is applied to
actual instances of data
It improves even more when these data are
used to answer research questions
There will be fewer problems in the ontology and
more commitment to fixing remaining problems
when important research data is involved that
scientists depend upon
Be very wary of ontologies that have never been
applied
Work with that community
To improve (if you found one)
To develop (if you did not)
How?
Improve
Collaborate
and Learn
Community vs. Committee
?
Many people have (understandably)
reservations about collaborative
development, because it can easily be
confused with design-by-committee
projects.
Members of a committee represent
themselves.
Members of a community represent their
community.
Design for purpose - not in
abstract
Who will use it?
If no one is interested, then go back to
bed
What will they use it for?
Define the domain
Who will maintain it?
Be pragmatic and modest
Start with a concrete
proposal —not a blank slate.
But do not commit your ego to it.
Distribute to a small group you respect:
With a shared commitment.
With broad domain knowledge.
Who will engage in vigorous debate without
engaging their egos (or, at least not too
much).
Who will do concrete work.
Step 1:
Alpha0: the first proposal - broad in breadth but
shallow in depth. By one person with broad
domain knowledge.
Distribute to a small group (<6).
Get together for two days and engage in vigorous
discussion. Be open and frank. Argue, but do not be
dogmatic.
Reiterate over a period of months. Do as much
as possible face-to-face, rather than by
phone/email. Meet for 2 days every 3 months or
so.
Step 2:
Distribute Alpha1 to your group.
All now test this Alpha1 in real life
By classifying representations of
instances with these types
Do not worry that (at this stage) you
if do not have tools.
Step 3:
Reconvene as a group for two days.
Share experiences from
implementation:
Can your Alpha1 be implemented in a
useful way ?
What are the conceptual problems ?
What are the structural problems ?
Step 4:
Establish a mechanism for change.
Use CVS or Subversion.
Limit the number of editors with write
permission (ideally to one person).
Unique stable identifiers
History tracking of changes
Release a Beta1.
Seriously implement Beta1 in real life.
Build the ontology in depth.
Step 5:
After about 6 months reconvene
and evaluate.
Is the ontology suited to its purpose ?
Is it, in practice, usable ?
Are we happy about its broad
structure and content ?
Step 6:
Go public.
Release ontology to community.
Release the products of the
annotations.
Invite broad community input and
establish a mechanism for this (e.g.
SourceForge).
Step 7:
Proselytize
Publish in a high profile journal
Engage new user groups
Emphasize openness
Write a grant
Step 8:
Iterate
Improvements come
in two forms
It is impossible to get it
right the 1st (or 2nd, or
3rd, …) time.
What we know about
reality is continually
growing
Improve
Collaborate
and Learn
Step 9:
Bon appetit
Use the power of
combination and collaboration
as far as possible don’t reinvent
ontologies are like telephones: they are
valuable only to the degree that they are
used and networked with other
ontologies
but choose working telephones
most telephones were broken when the
technology was first being developed
The National Center for
Biomedical Ontology
Stanford – Berkeley
Mayo – Victoria – Buffalo
UCSF – Oregon – Cambridge
Bioinformatics and Computational Biology
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
National Centers for
Biomedical Computing—2005
1. National Center for Integrative
Biomedical Informatics (Michigan)
2. National Center for Multi-Scale
Study of Cellular Networks
(Columbia)
3. National Center for Biomedical
Ontology
Stanford: Tools for ontology alignment,
indexing, and management (Cores 1, 4–7:
Mark Musen)
Lawrence–Berkeley Labs: Tools to use
ontologies for data annotation (Cores 2, 5–7:
Suzanna Lewis)
Mayo Clinic: Tools for access to large
controlled terminologies (Core 1: Chris Chute)
Victoria: Tools for ontology and data
visualization (Cores 1 and 2: Margaret-Anne
Story)
University at Buffalo: Dissemination of best
practices for ontology engineering (Core 6:
Barry Smith)
Driving Biological Projects
Trial Bank: UCSF, Ida Sim
Quic kTi me™ and a
TIFF (LZW) decompr es sor
ar e needed to see this picture.
Flybase: Cambridge, Michael
Ashburner
ZFIN: Oregon, Monte Westerfield
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Qu ickT ime™ and a
TIF F (U ncom pres sed) deco mpre ssor
are nee ded t o see this pictu re.
Case study: Phenotypes,
our current work on OBD
Animal disease models
Animal models
Mutant Gene
Mutant or
missing Protein
Mutant Phenotype
Animal disease models
Humans
Mutant Gene
Animal models
Mutant Gene
Mutant or
missing Protein
Mutant or
missing Protein
Mutant Phenotype
(disease)
Mutant Phenotype
(disease model)
Animal disease models
Humans
Mutant Gene
Animal models
Mutant Gene
Mutant or
missing Protein
Mutant or
missing Protein
Mutant Phenotype
(disease)
Mutant Phenotype
(disease model)
Animal disease models
Humans
Mutant Gene
Animal models
Mutant Gene
Mutant or
missing Protein
Mutant or
missing Protein
Mutant Phenotype
(disease)
Mutant Phenotype
(disease model)
SHH-/+
SHH-/-
shh-/+
shh-/-
Phenotype
(clinical sign) = entity
+ quality
Phenotype
(clinical sign) = entity
P1
= eye
+ quality
+ hypoteloric
Phenotype
(clinical sign) = entity
P1
P2
+ quality
= eye
+ hypoteloric
= midface + hypoplastic
Phenotype
(clinical sign) = entity
P1
P2
P3
+
= eye
+
= midface +
= kidney
+
quality
hypoteloric
hypoplastic
hypertrophied
Phenotype
(clinical sign) = entity
P1
P2
P3
+
= eye
+
= midface +
= kidney
+
ZFIN:
eye
midface
kidney
+
quality
hypoteloric
hypoplastic
hypertrophied
PATO:
hypoteloric
hypoplastic
hypertrophied
Phenotype
(clinical sign) = entity
+ quality
Anatomical ontology
Cell & tissue ontology
Developmental ontology
Gene ontology
biological process
cellular component
+
PATO
(phenotype and trait ontology)
Phenotype
(clinical sign) = entity
P1
P2
P3
+
= eye
+
= midface +
= kidney
+
quality
hypoteloric
hypoplastic
hypertrophied
Syndrome = P1 + P2 + P3
(disease)
= holoprosencephaly
Human holoprosencephaly
Zebrafish
shh
Zebrafish
oep
OMIM
gene
ZFIN
gene
LAMB1
lamb1
FECH
FlyBase
gene
FlyBase
mut pub
ZFIN
mut pub
LanB1
5
15
fech
Ferrochelatase
2
5
GLI2
gli2a
ci
388
SLC4A1
slc4a1
CG8177
MYO7A
myo7a
ALAS2
mouse
rat
SNO
MED
OMIM disease
39
-
29
Protoporphyria, Erythropoietic
41
22
-
7
7
19
Renal Tubular Acidosis, RTADR
ck
84
5
16
Deafness; DFNB2; DFNA11
alas2
Alas
1
7
14
Anemia, Sideroblastic, X-Linked
KCNH2
kcnh2
sei
27
3
12
-
MYH6
myh6
Mhc
166
3
1
12
Cardiomyopathy, Familial
Hypertrophic; CMH
TP53
tp53
p53
64
3
3
19
11
Breast Cancer
ATP2A1
atp2a1
Ca-P60A
32
6
1
11
Brody Myopathy
EYA1
eya1
eya
251
5
4
6
Branchiootorenal Dysplasia
SOX10
sox10
Sox100B
1
17
4
4
Waardenburg-Shah Syndrome
2
9
3
National Center for Biomedical Ontology
Capture and index experimental results
Open
Biomedical
Ontologies
(OBO)
Revise
biomedical
understanding
Open
Biomedical
Data (OBD)
BioPortal
Relate
experimental
data to results
from other
sources
Phenotype as an
observation
context
The class of
thing observed
environment
evidence
publication
figure
assay
genetic
sequence ID
ontology
Review of proposed EAV EQ
model
A phenotype is described using an
Entity-Quality double
Entities are drawn from various OBO
ontologies—cell, anatomies, GO, …
Qualities are drawn from one
ontology—PATO
Separation of concerns
Not phenotypes:
Genotype
Environment
Assay, measurement systems
Images
schema:
Association = Genotype Phenotype Environment Assay
Phenotype = Entity Quality
Entity = OBOClassID
Quality = PATOClassID
2003 Pilot study
Trial of EAV model on small collection of
genotypes
FlyBase
ZFIN
Genes were non-orthologous
New curations - in progress
orthologous genes with clinical relevance
Use the same data model and exchange format?
Example data records
Genotype Entity Value
npo
r210
gut
dysplastic
gut
small
retina irregular
brain
fused
ZFIN schema extension: stages
Genotype Stage
Entity Quality
npo
Hatching:Pec-fin
gut
dysplastic
Hatching:Pec-fin
gut
small
r210
Hatching:Long-pec retina irregular
Larval:Protruding brain
-mouth
fused
Stages
Association = Genotype Phenotype Environment Assay
Phenotype = Stage* Entity Quality
Entity = OBOClassID
Stage = OBOAnatomicalStageClassID
Quality = PATOClassID
* means zero or more
Monadic and relational
qualities
Monadic:
the quality inheres in a single entity
Relational:
the quality inheres in two or more entities
sensitivity of an organism to a kind of drug
sensitivity of an eye to a wavelength of light
can turn relational qualities into cross-product monadic
qualities
e.g. sensitivityToRedLight
better to use relational qualities
avoids redundancy with existing ontologies
Incorporating relational
qualities
Association = Genotype Phenotype Environment Assay
Phenotype = Stage* Entity Quality Entity*
Entity = OBOClassID
Quality = PATOVersion2ClassID
Example data record:
Phenotype = “organism” sensitiveTo “puromycin”
Measurable qualities
Some qualities are inexact and implicitly relative to a
wild-type or normal quality
relatively short, relatively long, relatively reduced
easier than explicitly representing:
this tail length shorter-than ‘mouse’ wild-type tail length
Some qualities are determinable
use a measure function
unit, value, {time}
this tail has length L
measure(L, cm) = 2
Keep measurements separate from (but linked to)
quality ontology
Incorporating measurements
Association = Genotype Phenotype Environment Assay
Phenotype = Stage* Entity Quality Entity* Measurement*
Measurement = Unit Value (Time)
Entity = OBOClassID
Quality = PATOVersion2ClassID
Example data record:
Phenotype = “gut” “acidic” Measurement = “pH” 5
The Methodology of Annotations
Scientific curators use experimental
observations reported in the biomedical
literature to link gene products with GO
terms in annotations.
The gene annotations taken together
yield a slowly growing computerinterpretable map of biological reality,
The process of annotating literature also
leads to improvements and extensions of
the ontology itself, which institutes a
virtuous cycle of improvement in the
quality and reach of future annotations
and of future versions of the ontology.
When we annotate the
record of an experiment
we use terms representing types to
capture what we learn about the
instances
this experiment as a whole (a process)
these instances experimented upon
the instances are typical
they are representatives of a type
Ontology
A thing of beauty is a joy forever
With acknowledgement and thanks to: Mark
Musen, Barry Smith, Sima Misra, Chris Mungall,
Daniel Rubin, and David Hill
Interpretation of the schema
How should EAV data records be
interpreted by a computer?
What are the instances?
Is the EAV schema just to improve
database searching?
Can it be used for meaningful crossspecies comparisons?
What is the Entity slot used for?
Genotype
Entity
Quality
npo
gut
dysplastic
gut
small
retina
irregular
brain
fused
d/v pattern formation
abnormal
blood islands
number increased
elongation of arista
literal
arrested
r210
tm84
Bsb[2]
C-alpha[1D] adult behaviour
uncoordinated
2003 trial data: FB & ZFIN
What is the Entity slot used for?
In practical terms:
An ID from one of the following ontologies
GO CC, BP, and MF
Species-specific anatomical ontology
OBO Cell
Or a cross-product
e.g. acidification GO:0045851
which has_locationOBOREL midgut FBbt00005383
[example from FBal0062296: Acidification in the
midgut of homozygous larvae is often less than in
wild-type larvae]
But what does it mean in the context of
an annotation?
Universals and particulars
An ontology consist of universals (classes)
Fruitfly, wing, flight
Experimental data generally concerns
particulars (instances) that instantiate
universals
this particular wing of this particular fruitfly
this particular fruitfly participating in this
particular flight from here to there
In annotation we often use a class ID as a
proxy for an (unnamed) instance (or
collection of instances)
It is important to always keep this
distinction in mind
What is the Quality slot used
for?
Genotype
Entity
Quality
Quality
npo
gut
structure
dysplastic
gut
relative size
small
retina
pattern
irregular
brain
structure
fused
d/v pattern
formation
qualitative
abnormal
blood islands
relative
number
number
increased
Bsb[2]
elongation of
arista literal
process
arrested
C-alpha[1D]
adult behaviour
behavioral
activity
uncoordinate
d
r210
tm84
2003 trial data: FB & ZFIN
Qualities
Treat as Qualities
a dependent entity
a quality must have independent entity(s) as bearer
the quality inheres_in the bearer
Examples
The particular shape of this ball
The particular structure of this wing
The particular length of this tail
The particular rate of synaptic transmission
between these two neurons
Attribute universals vs
Attribute particulars
In an EAV annotation, a PATO class ID typically serves as a proxy for
an unnamed attribute instance
Universals (classes) must always be defined in terms of their instances
A Formal Theory of Substances, Qualities, and Universals
Fabian NEUHAUS Pierre GRENON Barry SMITH
How is the attribute slot used?
Genotype
Entity
Attribute
Attribute
npo
gut
structure
dysplastic
gut
relative size
small
retina
pattern
irregular
brain
structure
fused
d/v pattern
formation
qualitative
abnormal
blood islands
relative number
number
increased
Bsb[2]
elongation of
arista literal
process
arrested
C-alpha[1D]
adult behaviour
behavioral activity uncoordinated
r210
tm84
Current Model
Association = Genotype Phenotype Environment Assay
Phenotype = Stage* Entity Attribute
Entity = OBOClassID
Stage = OBOAnatomicalStageClassID
Attribute = PATOClassID
* means zero or more
Composite phenotype classes
Mammalian phenotype has composite
phenotype classes
e.g. “reduced B cell number”
Compose at annotation time or ontology
curation time?
False dichotomy
Core 2 will help map between composite
class based annotation and EAV
annotation
Interpreting annotations
Annotations are data records
typically use class IDs
implicitly refer to instances
How do we map an annotation to
instances?
Important for using annotations
computationally
Interpreting annotations (1)
What does an EA (or EAV) annotation mean?
Annotation:
Genotype=“FBal00123” E=“brain” A=“fused”
presumed implied meaning:
this organism
has_part x, where
x instance_of “brain”
x has_quality “fused”
or in natural language:
“this organism has a fused brain”
Various built-in assumptions
Interpreting annotations (II)
What does this mean:
annotation:
Genotype=“FBal00123” E=“wing” A=“absent”
using same mapping as annotation I:
fly98 has_part x, where
x instance_of “wing”
x has_quality “absent”
or in natural language:
this fly has a wing which is not there!
What we really intend:
NOT(this organism has_part x, where x
instance_of “wing”)
Interpreting annotations (II)
What does this mean:
annotation:
Genotype=“FBal00123” E=“wing” A=“absent”
using same mapping as annotation I:
this organism has_part x, where
x instance_of “wing”
x has_quality “absent”
or in natural language:
this fly has a wing which is not there!
What we really intend:
this organism has_quality “wingless”
“wingless” = the property of having count(has_part “wing”)=0
Are our computational
representations intended to
capture reality?
Does this matter?
If we simply use the colloquial expression
“absent”
What are the consequences?
Basic search will be fine
e.g. “find all wing phenotypes”
Logical reasoners will compute incorrect results
Computers will not be able to reason correctly
We must explicitly provide specific rules for certain
attributes such as “absent”
Interpreting annotations (III)
What does this mean:
annotation:
E=“digit” A=“supernumery”
using same interpretation as annotation I:
this organism has_part x, where
x instance_of “digit”
x has_quality “supernumery”
or in natural language:
this organism has a particular finger which is supernumery!!
What we really intend:
this person has_quality “supernumery finger”
“supernumery finger” = the property of
having count(has_part “digit”) > wild-type”
Interpreting annotations (IV)
What does this mean:
annotation:
Gt=“mp001” E=“brown fat cell”
A=“increased quantity”
using same mapping as annotation I:
this organism has_part x, where
x instance_of “brown fat cell”
x has_quality “increased quantity”
or in natural language:
this organism has a particular brown fat cell which is
increased in quantity
What we really intend:
this organism has_part
population_of(“brown fat cell”) which
has_quality increased size
Other use cases
Spermatocyte devoid of asters
Homeotic transformations
Increased distance between wing
veins
Some vs. all
Alternate perspectives
process vs. state
regulatory processes:
acidification of midgut has_quality reduced rate
midgut has_quality low acidity
development vs. behavior
wing development has_quality abnormal
flight has_quality intermittent
granularity (scale)
chemical vs. molecular vs. cell vs. tissue vs. anatomical part