The Joy of Ontology

Download Report

Transcript The Joy of Ontology

The Joy of Ontology
Suzanna Lewis
SMI Colloquium
April 20th, 2006
Sections
 Why make an ontology
 What is an ontology
 How to create an ontology
 Logically
 Technically
 Organizationally
 National Center for Biomedical Ontology
 Case study: Phenotypes, our current work
on OBD
Why make an ontology?
What is the motivation?
The Problem(s) with data
Inaccessibility of widely distributed
data
Over abundance of information
Speed and performance
Interpreting the data syntactically
Interpreting the data semantically
We started the GO
 To develop a shared language adequate for the
annotation of molecular characteristics across
organisms.
 To agree on a mutual understanding of the
definition and meaning of any word used. and
thus to support cross-database queries.
 To provide database access via these common
terms to gene product annotations and
associated sequences.
Annotation of Yeast Microarray
Clusters Using GO
GENE
SDH1
PROCESS
tricarboxylic acid cycle
FUNCTION
succinate dehydrogenase
CELLULAR COMPONENT
mitochondrial inner membrane
NDI1
oxidative phosphorylation
NADH dehydrogenase
mitochondrial inner membrane
QCR7
electron transport
ubiquinol--cytochrome-c
reductase
mitochondrial inner membrane
COX6
oxidative phosphorylation
cytochrome-c oxidase
mitochondrial inner membrane
RIP1
electron transport
Rieske Fe-S protein
mitochondrial inner membrane
COX15
oxidative phosphorylation
cytochrome-c oxidase
mitochondrial inner membrane
CYT1
electron transport
mitochondrial inner membrane
COR1
electron transport
SDH3
tricarboxylic acid cycle
cytochrome-c1
ubiquinol--cytochrome-c
reductase
succinate dehydrogenase subunit
QCR6
electron transport
ubiquinol--cytochrome-c
reductase
mitochondrial inner membrane
CYT1
electron transport
cytochrome-c1
mitochondrial inner membrane
mitochondrial inner membrane
mitochondrial inner membrane
Microarray
data
from Figure 2K succinate
of Eisendehydrogenase
et al. (1998).
Cluster analysis and
SDH4
tricarboxylic acid cycle
subunit mitochondrial inner membrane
display
of genome-wide
expression
patterns, Proc. Natl. Acad. Sci. 95
SDH2
tricarboxylic acid cycle
succinate dehydrogenase subunit mitochondrial inner membrane
(25): 14863-14868.
MDH1
tricarboxylic acid cycle
malate dehydrogenase
mitochondrial matrix
The Challenge of Communication
Ontologies are essential to
make sense of biomedical data
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
A Portion of the OBO Library
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Motivation: to capture
biological reality
 Inferences and decisions we make are
based upon what we know of the
biological reality.
 An ontology is a computable
representation of this underlying
biological reality.
 Enables a computer to reason over the
data in (some of) the ways that we do.
What is an ontology
Ontology (as a branch of
philosophy)
 The science of what exists in every area of
reality
 The classification of entities: what kinds of
things exist
 The relations between these entities
 Defines a scientific field's vocabulary and
the canonical formulations of its theories.
 Seeks to solve problems which arise in
these domains.
A biological ontology is:
A machine interpretable representation
of some aspect of biological reality
what kinds
of things
exist?
what are the
relationships
between
these things?
eye disc
develops
from
sense organ
is_a
eye
part_of
ommatidium
Entity: a definition
anything which exists, including
things and processes, functions and
qualities, beliefs and actions,
software and images
Representation: a definition
An image, idea, map, picture,
name, description ... which refers to,
or is intended to refer to, some entity
or entities in reality
this ‘or is intended to refer to’ should
always be assumed
Ontologies represent types in
reality
ontology
reality
Scientific data represent
instances in reality
Two kinds of representational
artifact
Databases, inventories, images:
represent what is particular in reality =
instances
Ontologies, terminologies, catalogs:
represent what is general in reality
(exists in multiple instances) = types
(universals, kinds)
Ontologies are not for representing
concepts in people’s heads
reality
ontology
The researcher has a cognitive
representation of what is general,
based on his knowledge of the
science
Cognitive
representation
ontology
types
substance
organism
animal
mammal
cat
siamese
instances
frog
An ontology is like a scientific text;
it is a representation of types in reality
Cognitive
representation
Ontology = a
representation of types
reality
Atomic representational
unit: a definition
 terms, icons, bar codes, alphanumeric
identifiers ... which
1. refer, or are intended to refer, to entities in
reality, and
2. are not built out of further subrepresentations
 Representational units are the atoms in
the domain of representations
Modular representational
unit: a definition
A representation which is built out of
other representational units, which
together form a structure that mirrors
a corresponding structure in reality
Periodic Table
The Periodic Table
Ontology: a definition
 A modular, representational
artifact whose representational
units are intended to represent
1. types in reality
2. the relations between these types
which are true universally (i.e. for all
instances)
 lung is_a anatomical structure
 lobe of lung part_of lung
How to create an ontology
Part 1
The logic and science
In computer science, there is an
information handling problem
 Different groups of data-gatherers
develop their own idiosyncratic terms in
which they represent information.
 To put this information together, methods
must be found to resolve terminological
and conceptual incompatibilities.
 Again, and again, and again…
The Reality
Do not assume that data integration
can be brought about by somehow
‘mapping’ incompatible, low quality
ontologies built for different
purposes
Two flavors of ontology
1. Application ontology
2. Reference ontology
Application Ontology
An application ontology is
comparable to an engineering
artifact such as a software tool. It is
constructed for specific practical
purposes.
Reference Ontology
A reference ontology is analogous
to a scientific theory; it seeks to
optimize representational adequacy
to its subject matter
Assumptions
 There are best practices in ontology
development which should be followed
to create stable high-quality ontologies
 Shared high quality ontologies foster
cross-disciplinary and cross-domain re-use
of data, and create larger communities
Why do we need rules/standards
for good ontology?
 Ontologies must be intelligible both to humans
(for annotation) and to machines (for reasoning
and error-checking)
 Unintuitive rules lead to errors in classification
 Simple, intuitive rules facilitate training of
curators and annotators
 Common rules allow alignment with other
ontologies (and thus cross-domain exploitation
of data)
 Logically coherent rules enhance harvesting of
content through automatic reasoning systems
Ontologies built according to
common logically coherent rules
 Will make entry easier and yield a safer
growth path
 You can start small, annotating your data
with initial fragments of a well-founded
ontology, confident that the results will still
be usable when the ontology grows
larger and richer
TheOBO
OBO
Foundry
Foundry
A subset of OBO ontologies whose developers
agree in advance to accept a common set of
principles designed to assure
 intelligibility to biologist curators, annotators, users
 formal robustness
 stability
 compatibility
 interoperability
 support for logic-based reasoning
The OBO Foundry
1. The ontology is open and available to be used by
all.
2. The developers of the ontology agree in
advance to collaborate with developers of other
OBO Foundry ontology where domains overlap.
3. The ontology is in, or can be instantiated in, a
common formal language.
4. The ontology possesses a unique identifier space
within OBO.
5. The ontology provider has procedures for
identifying distinct successive versions.
The OBO Foundry
6. The ontology has a clearly specified and clearly
delineated content.
7. The ontology includes textual definitions for all
terms.
8. The ontology is well-documented.
9. The ontology has a plurality of independent
users.
10. The ontology uses relations which are
unambiguously defined following the pattern of
definitions laid down in the OBO Relation
Ontology.
Orthogonality
Orthogonality
Ontology groups who choose to be
part of the OBO Foundry thereby
commit themselves to collaborating
to resolve disagreements which arise
where their respective domains
overlap
agreed on relations
 The success of ontology alignment demands
that ontological relations (is_a, part_of, ...)
have the same meanings in the different
ontologies to be aligned.
 See “Relations in Biomedical Ontologies”,
Genome Biology May 2005, Barry Smith ,
Werner Ceusters, Bert Klagges, Jacob
Köhler, Anand Kumar, Jane Lomax, Chris
Mungall, Fabian Neuhaus, Alan L Rector,
and Cornelius Rosse
Three fundamental dichotomies
continuants vs. occurrents
dependent vs. independent
types vs. instances
ONTOLOGIES ARE
REPRESENTATIVES OF TYPES
IN REALITY
For example in the GO
 Molecules, cell components , organisms are
independent continuants which have functions
 Functions are dependent continuants which
become realized through special sorts of
processes we call functionings
 Processes are occurrents include: functionings,
side-effects, stochastic processes
Continuants (aka endurants)
have continuous existence in time
preserve their identity through
change
exist in toto whenever they exist at
all
Occurrents (aka processes)
have temporal parts
unfold themselves in successive
phases
exist only in their phases
Continuants vs. Occurrents
Anatomy vs. Physiology
Snapshot vs. Video
Stocks vs. Flows
Commodities vs. Services
Products vs. Processes
Dependent entities
require independent continuants as
their bearers
There is no grin without a cat
Dependent vs.
independent continuants
Independent continuants
(organisms, cells, molecules,
environments)
Dependent continuants (qualities,
shapes, roles, propensities, functions)
E.g. the acidity of this gut
All occurrents are dependent
entities
They are dependent on those
independent continuants which are
their participants (agents, patients,
media ...)
GO’s three ontologies
occurent
molecular
function
dependent
biological
process
continuant
cellular
component
independent
molecular
process
functioning
molecular
function
cellular
process
organismlevel
biological
process
functioning
functioning
cellular
function
organismlevel
biological
function
Pumping
blood
To pump
blood
molecule
cellular
component
organism
heart
UBO
Upper Biomedical Ontology
Continuant: 3D
Independent
Continuant
Quality
Dependent
Continuant
Function
Occurrent: 4D
Functioning
Side-Effect,
Stochastic
Process, ...
Spatial
Region
instances (in space and time)
How to create an ontology
Part 2
The technical aspects
Separate the Database from
the Ontology
For extensibility
For generality
For reasonability
For interoperability
Why ontologies are worth it
 Minimize database maintenance costs
 Communication between researchers
 As well as
 Better query facilities
 Ability to draw inferences
 Detect correlations
 Facilitate computational interpretation of text
 And more…
Before: domain knowledge is
embedded in the db schema
Gene
table
Exon
table
RNA
table
Protein
table
Embedding domain knowledge
in the db schema is expensive
 The logical description and the physical
database description of the biology are comingled
 Therefore new biological knowledge will force:




Schema changes: e.g. new tables
Query changes: that explicitly refer to tables
Middleware changes: to retrieve and format
GUI changes: to display
After: domain knowledge is
embedded in the ontology
feature
table
Ontology driven db schema is
less expensive to maintain
 The logical description and the physical
database description of the biology are
developed independently
 Therefore new biological knowledge will
only require:
 Ontology changes: e.g. new terms
 GUI changes: display
 No schema changes
 No query changes
 No middleware changes
Reality:
ultimately
this is what
the ontology
must reflect
Step 1:
Build an ontology
that reflects reality
Observations
and
experimentation
Step 2: Data capture
Database:
UIDs serving
as proxies for
instances
Patient IDs
Sequence accessions
Genetic strain IDs
…
Step 1:
Build an ontology
that reflects reality
Step 2: Data capture
Database:
UIDs serving
as proxies for
instances
Step 3:
Classify data
using the
ontology
Ontology is a contract
between researchers
 A common language that allows us to share
knowledge
 Researchers have a stake in it:
 Every individual will benefit by being able to
accurately interpret someone else’s data
 No more pre-scrubbing
 No more time spent translating
Rules on types
 Don’t confuse types with instances
 Don’t confuse instances with leaf nodes
 Don’t confuse types with ideas
 Don’t confuse types with ways of getting
to know types
 Don’t confuse types with ways of talking
about types
 Don’t confuse types with data about
types
An astronomy ontology
should not include
'Buzz Aldrin'
Rules on terms
Terms should be in the singular
Avoid abbreviations even when it is
clear in context what they mean
(‘breast’ for ‘breast tumor’)
Think of each term ‘A ’ in an
ontology is shorthand for a term of
the form:
‘the type A ’
Rules on Definitions
 The terms used in a definition should be
simpler (more intelligible) than the term to
be defined; otherwise the definition
provides no assistance
 Definitions should be intelligible to both
machines and humans
 to human understanding
 Humans need clarity and modularity
 to machine processing
 Machines can cope with the full formal
representation
Confusing non-useful
definitions
 Swimming
 Swimming is healthy and has 8 letters
 Poland
 The name of Poland
When defining terms use
Aristotelian definitions
The definition of ‘A’ takes the form:
an A =def. is a B which ...
where B is A’s parent in the hierarchy
A human being =def. an animal which is
rational
A helicase =def. an enzyme which
catalyzes the hydrolysis of ATP to unwind
the DNA helix
Use of Aristotelian
definitions
 Makes defining terms easier
 Each definition encapsulates in modular
form the entire parentage of the defined
term
 The entire information content of the
FMA’s term hierarchy and definitions can
be translated very cleanly into a
computer representation
 Now accepted by GO
Summary: How to build your
ontology
 Keep It Simple:
 lowest possible barrier to entry
 Technology independence
 Clarity of definitions and scope
 “With new data, we change our minds”
 An ontology must adapt to reflect current
understanding of reality
 Mechanisms to support change
How to build an ontology
Part 3
The sociology and
organizational aspects
Elements for Success: GO
 A Community with a common vision
 A pool of talented and motivated
developers/scientists
 A mix of academic and commercial
 An organized, light weight approach to
product development
 A leadership structure
 Communication
 A well-defined scope, (our “business”)
Adopted from “Open Source Menu for Success”
Gene Ontology
Community annotation production
“Extreme Programming” techniques
to distributed ontology generation
Revision control
Nightly conflict resolution
Users are integral to the team
Rapid iterations
Why
Survey
Domain
covered
?
Public
?
Active
?
Community?
Salvage
Develop
Applied
?
yes
Improve
no
Collaborate & Learn
Is your domain covered?
Due diligence & background
research
Step 1: Learn what is out there
The most comprehensive list is on the
OBO site. http://obo.sourceforge.net
Assess ontologies critically and
realistically.
It is privately held?
Ontologies must be shared
 Communities form scientific theories
 that seek to explain all of the existing evidence
 and can be used for prediction
 These communities are all directed to the same
biological reality, but have their own perspective
 The computable representation must be shared
 Ontology development is inherently
collaborative
 Open ontologies become connected to
instance data & this feeds back on ontology
development
Is it active?
Pragmatic assessment of an
ontology
Is there access to help, e.g.:
[email protected] ?
Does a warm body answer help mail
within a ‘reasonable’ time—say 2
working days ?
Is it applied?
Toy ontologies are not useful
 Every ontology improves when it is applied to
actual instances of data
 It improves even more when these data are
used to answer research questions
 There will be fewer problems in the ontology and
more commitment to fixing remaining problems
when important research data is involved that
scientists depend upon
 Be very wary of ontologies that have never been
applied
Work with that community
To improve (if you found one)
To develop (if you did not)
How?
Improve
Collaborate
and Learn
Community vs. Committee
?
 Many people have (understandably)
reservations about collaborative
development, because it can easily be
confused with design-by-committee
projects.
 Members of a committee represent
themselves.
 Members of a community represent their
community.
Design for purpose - not in
abstract
Who will use it?
If no one is interested, then go back to
bed
What will they use it for?
Define the domain
Who will maintain it?
Be pragmatic and modest
Start with a concrete
proposal —not a blank slate.
 But do not commit your ego to it.
 Distribute to a small group you respect:
 With a shared commitment.
 With broad domain knowledge.
 Who will engage in vigorous debate without
engaging their egos (or, at least not too
much).
 Who will do concrete work.
Step 1:
 Alpha0: the first proposal - broad in breadth but
shallow in depth. By one person with broad
domain knowledge.
 Distribute to a small group (<6).
 Get together for two days and engage in vigorous
discussion. Be open and frank. Argue, but do not be
dogmatic.
 Reiterate over a period of months. Do as much
as possible face-to-face, rather than by
phone/email. Meet for 2 days every 3 months or
so.
Step 2:
Distribute Alpha1 to your group.
All now test this Alpha1 in real life
By classifying representations of
instances with these types
Do not worry that (at this stage) you
if do not have tools.
Step 3:
Reconvene as a group for two days.
Share experiences from
implementation:
Can your Alpha1 be implemented in a
useful way ?
What are the conceptual problems ?
What are the structural problems ?
Step 4:
 Establish a mechanism for change.
 Use CVS or Subversion.
 Limit the number of editors with write
permission (ideally to one person).
 Unique stable identifiers
 History tracking of changes
 Release a Beta1.
 Seriously implement Beta1 in real life.
 Build the ontology in depth.
Step 5:
After about 6 months reconvene
and evaluate.
Is the ontology suited to its purpose ?
Is it, in practice, usable ?
Are we happy about its broad
structure and content ?
Step 6:
Go public.
Release ontology to community.
Release the products of the
annotations.
Invite broad community input and
establish a mechanism for this (e.g.
SourceForge).
Step 7:
Proselytize
Publish in a high profile journal
Engage new user groups
Emphasize openness
Write a grant
Step 8:
 Iterate
 Improvements come
in two forms
 It is impossible to get it
right the 1st (or 2nd, or
3rd, …) time.
 What we know about
reality is continually
growing
Improve
Collaborate
and Learn
Step 9:
Bon appetit
Use the power of
combination and collaboration
 as far as possible don’t reinvent
 ontologies are like telephones: they are
valuable only to the degree that they are
used and networked with other
ontologies
 but choose working telephones
 most telephones were broken when the
technology was first being developed
The National Center for
Biomedical Ontology
Stanford – Berkeley
Mayo – Victoria – Buffalo
UCSF – Oregon – Cambridge
Bioinformatics and Computational Biology
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
National Centers for
Biomedical Computing—2005
1. National Center for Integrative
Biomedical Informatics (Michigan)
2. National Center for Multi-Scale
Study of Cellular Networks
(Columbia)
3. National Center for Biomedical
Ontology
 Stanford: Tools for ontology alignment,
indexing, and management (Cores 1, 4–7:
Mark Musen)
 Lawrence–Berkeley Labs: Tools to use
ontologies for data annotation (Cores 2, 5–7:
Suzanna Lewis)
 Mayo Clinic: Tools for access to large
controlled terminologies (Core 1: Chris Chute)
 Victoria: Tools for ontology and data
visualization (Cores 1 and 2: Margaret-Anne
Story)
 University at Buffalo: Dissemination of best
practices for ontology engineering (Core 6:
Barry Smith)
Driving Biological Projects
Trial Bank: UCSF, Ida Sim
Quic kTi me™ and a
TIFF (LZW) decompr es sor
ar e needed to see this picture.
Flybase: Cambridge, Michael
Ashburner
ZFIN: Oregon, Monte Westerfield
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Qu ickT ime™ and a
TIF F (U ncom pres sed) deco mpre ssor
are nee ded t o see this pictu re.
Case study: Phenotypes,
our current work on OBD
Animal disease models
Animal models
Mutant Gene
Mutant or
missing Protein
Mutant Phenotype
Animal disease models
Humans
Mutant Gene
Animal models
Mutant Gene
Mutant or
missing Protein
Mutant or
missing Protein
Mutant Phenotype
(disease)
Mutant Phenotype
(disease model)
Animal disease models
Humans
Mutant Gene
Animal models
Mutant Gene
Mutant or
missing Protein
Mutant or
missing Protein
Mutant Phenotype
(disease)
Mutant Phenotype
(disease model)
Animal disease models
Humans
Mutant Gene
Animal models
Mutant Gene
Mutant or
missing Protein
Mutant or
missing Protein
Mutant Phenotype
(disease)
Mutant Phenotype
(disease model)
SHH-/+
SHH-/-
shh-/+
shh-/-
Phenotype
(clinical sign) = entity
+ quality
Phenotype
(clinical sign) = entity
P1
= eye
+ quality
+ hypoteloric
Phenotype
(clinical sign) = entity
P1
P2
+ quality
= eye
+ hypoteloric
= midface + hypoplastic
Phenotype
(clinical sign) = entity
P1
P2
P3
+
= eye
+
= midface +
= kidney
+
quality
hypoteloric
hypoplastic
hypertrophied
Phenotype
(clinical sign) = entity
P1
P2
P3
+
= eye
+
= midface +
= kidney
+
ZFIN:
eye
midface
kidney
+
quality
hypoteloric
hypoplastic
hypertrophied
PATO:
hypoteloric
hypoplastic
hypertrophied
Phenotype
(clinical sign) = entity
+ quality
Anatomical ontology
Cell & tissue ontology
Developmental ontology
Gene ontology
biological process
cellular component
+
PATO
(phenotype and trait ontology)
Phenotype
(clinical sign) = entity
P1
P2
P3
+
= eye
+
= midface +
= kidney
+
quality
hypoteloric
hypoplastic
hypertrophied
Syndrome = P1 + P2 + P3
(disease)
= holoprosencephaly
Human holoprosencephaly
Zebrafish
shh
Zebrafish
oep
OMIM
gene
ZFIN
gene
LAMB1
lamb1
FECH
FlyBase
gene
FlyBase
mut pub
ZFIN
mut pub
LanB1
5
15
fech
Ferrochelatase
2
5
GLI2
gli2a
ci
388
SLC4A1
slc4a1
CG8177
MYO7A
myo7a
ALAS2
mouse
rat
SNO
MED
OMIM disease
39
-
29
Protoporphyria, Erythropoietic
41
22
-
7
7
19
Renal Tubular Acidosis, RTADR
ck
84
5
16
Deafness; DFNB2; DFNA11
alas2
Alas
1
7
14
Anemia, Sideroblastic, X-Linked
KCNH2
kcnh2
sei
27
3
12
-
MYH6
myh6
Mhc
166
3
1
12
Cardiomyopathy, Familial
Hypertrophic; CMH
TP53
tp53
p53
64
3
3
19
11
Breast Cancer
ATP2A1
atp2a1
Ca-P60A
32
6
1
11
Brody Myopathy
EYA1
eya1
eya
251
5
4
6
Branchiootorenal Dysplasia
SOX10
sox10
Sox100B
1
17
4
4
Waardenburg-Shah Syndrome
2
9
3
National Center for Biomedical Ontology
Capture and index experimental results
Open
Biomedical
Ontologies
(OBO)
Revise
biomedical
understanding
Open
Biomedical
Data (OBD)
BioPortal
Relate
experimental
data to results
from other
sources
Phenotype as an
observation
context
The class of
thing observed
environment
evidence
publication
figure
assay
genetic
sequence ID
ontology
Review of proposed EAV EQ
model
A phenotype is described using an
Entity-Quality double
Entities are drawn from various OBO
ontologies—cell, anatomies, GO, …
Qualities are drawn from one
ontology—PATO
Separation of concerns
Not phenotypes:
Genotype
Environment
Assay, measurement systems
Images
schema:
Association = Genotype Phenotype Environment Assay
Phenotype = Entity Quality
Entity = OBOClassID
Quality = PATOClassID
2003 Pilot study
 Trial of EAV model on small collection of
genotypes
 FlyBase
 ZFIN
 Genes were non-orthologous
 New curations - in progress
 orthologous genes with clinical relevance
 Use the same data model and exchange format?
Example data records
Genotype Entity Value
npo
r210
gut
dysplastic
gut
small
retina irregular
brain
fused
ZFIN schema extension: stages
Genotype Stage
Entity Quality
npo
Hatching:Pec-fin
gut
dysplastic
Hatching:Pec-fin
gut
small
r210
Hatching:Long-pec retina irregular
Larval:Protruding brain
-mouth
fused
Stages
Association = Genotype Phenotype Environment Assay
Phenotype = Stage* Entity Quality
Entity = OBOClassID
Stage = OBOAnatomicalStageClassID
Quality = PATOClassID
* means zero or more
Monadic and relational
qualities
 Monadic:
 the quality inheres in a single entity
 Relational:
 the quality inheres in two or more entities
 sensitivity of an organism to a kind of drug
 sensitivity of an eye to a wavelength of light
 can turn relational qualities into cross-product monadic
qualities
 e.g. sensitivityToRedLight
 better to use relational qualities
 avoids redundancy with existing ontologies
Incorporating relational
qualities
Association = Genotype Phenotype Environment Assay
Phenotype = Stage* Entity Quality Entity*
Entity = OBOClassID
Quality = PATOVersion2ClassID
Example data record:
Phenotype = “organism” sensitiveTo “puromycin”
Measurable qualities
 Some qualities are inexact and implicitly relative to a
wild-type or normal quality
 relatively short, relatively long, relatively reduced
 easier than explicitly representing:
 this tail length shorter-than ‘mouse’ wild-type tail length
 Some qualities are determinable
 use a measure function
 unit, value, {time}
 this tail has length L
 measure(L, cm) = 2
 Keep measurements separate from (but linked to)
quality ontology
Incorporating measurements
Association = Genotype Phenotype Environment Assay
Phenotype = Stage* Entity Quality Entity* Measurement*
Measurement = Unit Value (Time)
Entity = OBOClassID
Quality = PATOVersion2ClassID
Example data record:
Phenotype = “gut” “acidic” Measurement = “pH” 5
The Methodology of Annotations
 Scientific curators use experimental
observations reported in the biomedical
literature to link gene products with GO
terms in annotations.
 The gene annotations taken together
yield a slowly growing computerinterpretable map of biological reality,
 The process of annotating literature also
leads to improvements and extensions of
the ontology itself, which institutes a
virtuous cycle of improvement in the
quality and reach of future annotations
and of future versions of the ontology.
When we annotate the
record of an experiment
we use terms representing types to
capture what we learn about the
instances
this experiment as a whole (a process)
these instances experimented upon
the instances are typical
they are representatives of a type
Ontology
A thing of beauty is a joy forever
With acknowledgement and thanks to: Mark
Musen, Barry Smith, Sima Misra, Chris Mungall,
Daniel Rubin, and David Hill
Interpretation of the schema
How should EAV data records be
interpreted by a computer?
What are the instances?
Is the EAV schema just to improve
database searching?
Can it be used for meaningful crossspecies comparisons?
What is the Entity slot used for?
Genotype
Entity
Quality
npo
gut
dysplastic
gut
small
retina
irregular
brain
fused
d/v pattern formation
abnormal
blood islands
number increased
elongation of arista
literal
arrested
r210
tm84
Bsb[2]
C-alpha[1D] adult behaviour
uncoordinated
2003 trial data: FB & ZFIN
What is the Entity slot used for?
 In practical terms:
 An ID from one of the following ontologies
 GO CC, BP, and MF
 Species-specific anatomical ontology
 OBO Cell
 Or a cross-product
 e.g. acidification GO:0045851
 which has_locationOBOREL midgut FBbt00005383
 [example from FBal0062296: Acidification in the
midgut of homozygous larvae is often less than in
wild-type larvae]
 But what does it mean in the context of
an annotation?
Universals and particulars
 An ontology consist of universals (classes)
 Fruitfly, wing, flight
 Experimental data generally concerns
particulars (instances) that instantiate
universals
 this particular wing of this particular fruitfly
 this particular fruitfly participating in this
particular flight from here to there
 In annotation we often use a class ID as a
proxy for an (unnamed) instance (or
collection of instances)
 It is important to always keep this
distinction in mind
What is the Quality slot used
for?
Genotype
Entity
Quality
Quality
npo
gut
structure
dysplastic
gut
relative size
small
retina
pattern
irregular
brain
structure
fused
d/v pattern
formation
qualitative
abnormal
blood islands
relative
number
number
increased
Bsb[2]
elongation of
arista literal
process
arrested
C-alpha[1D]
adult behaviour
behavioral
activity
uncoordinate
d
r210
tm84
2003 trial data: FB & ZFIN
Qualities
 Treat as Qualities
 a dependent entity
 a quality must have independent entity(s) as bearer
 the quality inheres_in the bearer
 Examples
 The particular shape of this ball
 The particular structure of this wing
 The particular length of this tail
 The particular rate of synaptic transmission
between these two neurons
Attribute universals vs
Attribute particulars
 In an EAV annotation, a PATO class ID typically serves as a proxy for
an unnamed attribute instance
 Universals (classes) must always be defined in terms of their instances
A Formal Theory of Substances, Qualities, and Universals
Fabian NEUHAUS Pierre GRENON Barry SMITH
How is the attribute slot used?
Genotype
Entity
Attribute
Attribute
npo
gut
structure
dysplastic
gut
relative size
small
retina
pattern
irregular
brain
structure
fused
d/v pattern
formation
qualitative
abnormal
blood islands
relative number
number
increased
Bsb[2]
elongation of
arista literal
process
arrested
C-alpha[1D]
adult behaviour
behavioral activity uncoordinated
r210
tm84
Current Model
Association = Genotype Phenotype Environment Assay
Phenotype = Stage* Entity Attribute
Entity = OBOClassID
Stage = OBOAnatomicalStageClassID
Attribute = PATOClassID
* means zero or more
Composite phenotype classes
 Mammalian phenotype has composite
phenotype classes
 e.g. “reduced B cell number”
 Compose at annotation time or ontology
curation time?
 False dichotomy
 Core 2 will help map between composite
class based annotation and EAV
annotation
Interpreting annotations
Annotations are data records
typically use class IDs
implicitly refer to instances
How do we map an annotation to
instances?
Important for using annotations
computationally
Interpreting annotations (1)
 What does an EA (or EAV) annotation mean?
 Annotation:
 Genotype=“FBal00123” E=“brain” A=“fused”
 presumed implied meaning:
 this organism
 has_part x, where
 x instance_of “brain”
 x has_quality “fused”
 or in natural language:
 “this organism has a fused brain”
 Various built-in assumptions
Interpreting annotations (II)
 What does this mean:
 annotation:
 Genotype=“FBal00123” E=“wing” A=“absent”
 using same mapping as annotation I:
 fly98 has_part x, where
 x instance_of “wing”
 x has_quality “absent”
 or in natural language:
 this fly has a wing which is not there!
 What we really intend:
 NOT(this organism has_part x, where x
instance_of “wing”)
Interpreting annotations (II)
 What does this mean:
 annotation:
 Genotype=“FBal00123” E=“wing” A=“absent”
 using same mapping as annotation I:
 this organism has_part x, where
 x instance_of “wing”
 x has_quality “absent”
 or in natural language:
 this fly has a wing which is not there!
 What we really intend:
 this organism has_quality “wingless”
 “wingless” = the property of having count(has_part “wing”)=0
Are our computational
representations intended to
capture reality?
Does this matter?
 If we simply use the colloquial expression
“absent”
 What are the consequences?
 Basic search will be fine
 e.g. “find all wing phenotypes”
 Logical reasoners will compute incorrect results
 Computers will not be able to reason correctly
 We must explicitly provide specific rules for certain
attributes such as “absent”
Interpreting annotations (III)
 What does this mean:
 annotation:
 E=“digit” A=“supernumery”
 using same interpretation as annotation I:
 this organism has_part x, where
 x instance_of “digit”
 x has_quality “supernumery”
 or in natural language:
 this organism has a particular finger which is supernumery!!
 What we really intend:
 this person has_quality “supernumery finger”
 “supernumery finger” = the property of
having count(has_part “digit”) > wild-type”
Interpreting annotations (IV)
 What does this mean:
 annotation:
Gt=“mp001” E=“brown fat cell”
A=“increased quantity”
 using same mapping as annotation I:
 this organism has_part x, where
 x instance_of “brown fat cell”
 x has_quality “increased quantity”
 or in natural language:
 this organism has a particular brown fat cell which is
increased in quantity
 What we really intend:
 this organism has_part
population_of(“brown fat cell”) which
has_quality increased size
Other use cases
Spermatocyte devoid of asters
Homeotic transformations
Increased distance between wing
veins
Some vs. all
Alternate perspectives
 process vs. state
 regulatory processes:
 acidification of midgut has_quality reduced rate
 midgut has_quality low acidity
 development vs. behavior
 wing development has_quality abnormal
 flight has_quality intermittent
 granularity (scale)
 chemical vs. molecular vs. cell vs. tissue vs. anatomical part