NCBIO-Berkeley - bioontology.org
Download
Report
Transcript NCBIO-Berkeley - bioontology.org
Core 2: Bioinformatics
NCBO-Berkeley
Berkeley Drosophila Genome
Project
Finish the sequence of the euchromatic
genome of Drosophila melanogaster
Annotated biological important features of this
sequence
Produced gene disruptions using P elementmediated mutagenesis
Full length sequencing and expression
characterization of a cDNA for every gene
Developing informatics tools
Who is here from NCBOBerkeley
Chris
Shu
Mark
Sima
Chris
GadFly database
schema
GO database schema
Chado database
schema
Perl libraries for all
OBD data architect
Shu
AmiGO,ImaGO &
database
OBD dev & Data
flow
Compute Pipeline
Mark
Apollo Genome
Annotation Editor
Phenote and other
OBD interfaces
Sima
Adh region
annotation
Annotation of entire
Drosophila Genome
Project manager
and coordinator
nonpareil
Associate Director
OBD Outline
Core 2 aims, refresher
Data models for OBD
phenotypes
clinical trials
others
Modeling frameworks
exchange formats
database system
SQL based vs ‘SemWeb’ dbs
Progress
Demo
Core 2 Specific Aims
1. Apply ontologies
Software toolkit for describing and classifying
data
2. Capture, manage, and view data
annotations
Database (OBD) and interfaces to store and view
annotations
3. Investigate and compare implications
Linking human diseases to model systems
4. Maintain
Ongoing reconciliation of ontologies with
annotations
Core 3 Driving Biological
Projects
DBPs
phenotypes: Fly and Zebrafish to human
clinical trials
Core 2 Aims
1.
2.
3.
4.
Apply ontologies to describe data
Capture, manage, and view data annotations
Link disease genes to model systems
Reconcile annotation and ontology changes
Apply ontologies to describe
data
Requirements
Data capture tools
phenote
demo tomorrow
no tool requirements from UCSF
Data model
Database (OBD)
--aim 2
data
flow
user’s
view
Data models
Common/shared domain specific models
Aim 3
linking disease genes
model must support this
granularity
comparability
Domain specific data models
FB, ZFIN
genotype to phenotype
‘EAV’
qualities inhere in entities
orthologs
phenotype to disease
core 2 will help define common model
UCSF
clinical trials
existing ontology-friendly schema - trialbank
Phenotype data model
Qualities inhere in entities
Entity term; PATO term
brain FBbt:00005095; fused PATO:0000642
gut MA:0000917; dysplastic PATO:0000640
tail fin ZDB:020702-16; ventralized PATO:0000636
kidney ZDB:020702-16; hypertrophied PATO:0000636
midface ZDB:020702-16; hypoplastic PATO:0000636
Pre-composed phenotype terms
Mammalian Phenotype Ontology
“increased activated B-cell number” MPO:0000319
“pink fur hue” MPO:0000374
Extensions to simple model
What about
Relational attributes
Quantative vs qualitative
Post-composing entity and attribute terms
Relative states/values
Variation in place, space and time
A better treatment of absence
See CSHL Pheno meeting talk
also, more detailed formal presentation (available)
Not to mention genotypes, environments,
provenance, etc
Modeling clinical trials
Model already described using framebased schema
Further modeling required?
abstraction
to integrate more with other OBD datatypes
views
to only show parts relevant to OBD/BioPortal
Future DBPs and use cases
OBD will contain a variety of general
types of data
Modeling is expensive
use existing models where appropriate
but whole must be cohesive and integrated
Most of this talk focuses on the pheno
DBPs for illustrative purposes
Modeling frameworks
language
technology
Modeling data: underlying
formalism
Model is expressed with modeling language
Options
Relational/SQL
Semi-structured, XML
Object-centric (UML, frame-based?)
Logic based
description logic: e.g. OWL
first-order logic: e.g. CL
Natural language descriptions
Model should be independent of language it
is expressed in
Data exchange language:
XML
Simple
XML is suited for data exchange
XML can drive software spec
constrains programmatic data model
XSD can generate UML
closed world assumption is useful
cf Ruttenberg et al
Mature technology
well understood by developers, MODs
standards
How OBD uses XML
obd-geno-pheno-xml (aka pheno-xml)
actually multiple modular components
genotype schema
phenotype schema: ‘EAV’
environment schema
provenance schema
used as
exchange format
cf: gene ontology association files
no need for ClinicalTrials-XML
Example pheno-xml
<genotype id="ZFIN:tm84">
<name>ZFIN:tm84</name>
<genotype_phenotype_association>
<phenotype>
<entity type="ZDB-ANAT-010921-528">
<quality type=“PATO:……” >
<state type="PATO:0000636">
<time_range type="ZDB-STAGE-010723-12"/>
</state>
</quality>
</entity>
</phenotype>
</genotype_phenotype_association>
SQL Databases
Data storage, management and
querying
all MODs use SQL dbs
Lots of advantages
scalable, standard QL, mature, APIs, etc
pure relational model is reasonably formal
XML/SQL more or less compatible
low impedance mismatch
Schemas for geno-pheno data
We already have schema: Chado
Used by many MODs (eg FB)
others are ‘chado compliant’ (eg ZFIN)
Modular
ontologies
genomic
genotype
phenotype
phylogenies
…etc
Phenotype module needs updating
will be driven by pheno-xml
Problem solved?
We have two mature, complementary
technologies, and can define schemas
for our model in an appropriate
formalism for each
Is this enough to work with?
Issues
OBD will be much more than geno-pheno
clinical trials
future DBPs, other NCBCs
any data expressed in an ontology language
Software and schema development
expensive
fragility in face of schema evolution
development gets bogged down in data exchange
issues
Major issue
SQL and XPath work great for
‘traditional’ data…
…but are too low level for ontologycentric data
lack of inference
no way to directly express ontology
constraints
Use cases from previous
experience: AmiGO
GO
“find all TF genes” (is_a closure)
“find all gene products localised to endoplasmic
reticulum” (part_of closure, over is_a)
Our solution (AmiGO & go-sqldb)
pre-compute transitive closure over all relations in
db
(sort of) works for GO (for now)
refresh problem
explosive for tangled DAGs
OBD requires more
ontological awareness
Other relations
ontogenic (eg derives_from)
transitive_over
Other types of data
Pre- versus post- composed terms
E.g. MPO versus AO+PATO
E.g. Entity+Spatial qualifier
queries over either should be interchangeable
Solution: more expressive
formalisms
QLs and APIs should provide and
abstract away common ontology
operations
ease of programming, optimisation
Choices
‘Semweb’ databases
RDF + RDFS + Owl [ lite + DL ] + extra
lots to choose from, emerging standards
compatible with Obo v1.2 spec
Deductive databases
superset of relational databases
Modeling phenotypes as
RDF/OWL or Obo instances
classes/
terms
instances
entity
quality
Example query in SeRQL
find mutations affecting the shape of the wing vein:
SELECT DISTINCT
EI, ET, OrgI, QI, QT, QN
FROM {EI} rdf:type {ET} rdfs:label {EN},
{EI} OBO_REL_part_of {OrgI} rdf:type {Tax} rdfs:label {TaxN},
{EI} OBO_REL_has_quality {QI} rdf:type {QT} rdfs:label {QN}
WHERE
label(EN) = "wing vein" AND
label(TaxN) = ”Arthropoda" AND
label(QN) = "ShapeValue"
results of query on OBD-sesame:
one annotation to “wing vein L2”, “branched”
Advantages of ‘SemWeb’ dbs
Advantages over pure SQL
The ontology is the model
constraints encoded in ontology
e.g. certain quality types only applicable to certain entity
types
agile development - fast database integration
Rich modeling constructs
transitivity, subsumption, intersection, etc
powerful QLs and APIs
More (technical) interoperation ‘for free’
URIs
proven?
Open World Assumption (maybe a hindrance?)
Disadvantages of ‘SemWeb’
dbs
Disadvantages
speed
may be slower than SQL
..but in-memory execution is fast
lack of maturity
new technology.. but has a LOT of momentum
foundations
are RDF triples appropriate?
inherent difficulties modeling time
SQL allows n-ary relations/predicates
Hybrid model
SemWeb dbs are commonly layered over
SQL DBs
We can have the best of both worlds
Data View layers
mapping between Obo/OWL model and
domain-specific relational schema
(optionally) materialized for speed
different applications use appropriate layer
Current progress: OBDSesame
Sesame
open source ‘triple store’
based on Jena
also used in Protégé-OWL
storage layer options
mysql/postgresql generic schema
in-memory
disk-based
OBD in Sesame: current
datasets
Pheno
ZFIN & FB : EAV trial 2003 data
Test ortholog set
FB ‘simple phenotype’ alleles
ZFIN legacy phenotype data, automatically parsed to
EAV
Ontologies: AOs, PATO, Cell, GO
Method
excel & flatfiles->pheno-xml->owl
OWL from http://www.fruitfly.org/~cjm/obo-download
Trialbank
Method: ocelot->obo-xml->owl
Soon
Technology Evaluation:
Sesame
Use case query set
Benchmarks
preliminary conclusions
SQL layering is terrible
in-memory is fast
optimisations?
other triple stores?
up to date results on wiki
http://smi.stanford.edu/projects/cbio/mwiki-internal/index.php/RDF_Sesame_Demo_Benchmark
Need to test OWL-DL entailment
Bigger dataset required for full evaluations
Community effort: pub-semweb-lifesci list
Parallel development: an OBD
Prototype
Initiated prior to OBD-Sesame
Simple deductive database
prolog-based
chado-like schema
can be views on Obo/OWL predicates
amigo-clone user interface
Rapid prototyping
Current dataset
as obd-sesame, plus CT
trivial to drop in more
Example logic query
find mutations affecting the shape of some part
of the head capsule
inheres(QI,EI)
&
inst(QI,QT)
&
label(QT,shape)
&
inst(EI,ETP)
&
part_of*(ETP,ET)
&
label(ET,’head capsule’)
results of query on OBD-prolog:
one annotation to “arista lateral”, “irregular shape”
OBD TODO
Pheno-xml
finalise release version
finalise Obo/OWL mapping
logic specification
Data
orthologies
OBD - BioPortal integration
how will it work?
Versioning and reconciling changes
decide on ontology versioning first
OBD dependencies
PATO development
UMLS into OBO-site
Ontologies
FMA accessibility?
species-centric AO alignments (XSPAN?)
Sept meeting on AO development
Nov meeting on disease ontologies
Data
MOD pheno annotation
OMIM annotation
Bioportal
Misc
NLP for phenote
Obol
trial on evolutionary phenotype characters
cambridge NLP project
can be used to ‘prime’ phenote
Decomposing MPO
pink fur def= fur, has_quality: pink
Discussion
Will SemWeb dbs work?
experiment
Ontology-based modeling
the ontology is the model
importance of
relations ontology
upper ontology
Demos
http://yuri.lbl.gov/amigo/ct
http://yuri.lbl.gov/amigo/obd
http://spade.lbl.gov:8080/sesame/actionFram
eset.jsp?repository=mem-rdfs-db