NCBIO-Berkeley - bioontology.org

Download Report

Transcript NCBIO-Berkeley - bioontology.org

Core 2: Bioinformatics
NCBO-Berkeley
Berkeley Drosophila Genome
Project
 Finish the sequence of the euchromatic
genome of Drosophila melanogaster
 Annotated biological important features of this
sequence
 Produced gene disruptions using P elementmediated mutagenesis
 Full length sequencing and expression
characterization of a cDNA for every gene
 Developing informatics tools
Who is here from NCBOBerkeley
Chris
Shu
Mark
Sima
Chris
 GadFly database
schema
 GO database schema
 Chado database
schema
 Perl libraries for all
 OBD data architect
Shu
 AmiGO,ImaGO &
 database
OBD dev & Data
 flow
Compute Pipeline
Mark
 Apollo Genome
Annotation Editor
 Phenote and other
OBD interfaces
Sima
 Adh region
annotation
 Annotation of entire
Drosophila Genome
 Project manager
and coordinator
nonpareil
 Associate Director
OBD Outline
 Core 2 aims, refresher
 Data models for OBD
 phenotypes
 clinical trials
 others
 Modeling frameworks
 exchange formats
 database system
 SQL based vs ‘SemWeb’ dbs
 Progress
 Demo
Core 2 Specific Aims
1. Apply ontologies

Software toolkit for describing and classifying
data
2. Capture, manage, and view data
annotations

Database (OBD) and interfaces to store and view
annotations
3. Investigate and compare implications

Linking human diseases to model systems
4. Maintain

Ongoing reconciliation of ontologies with
annotations
Core 3 Driving Biological
Projects

DBPs



phenotypes: Fly and Zebrafish to human
clinical trials
Core 2 Aims
1.
2.
3.
4.
Apply ontologies to describe data
Capture, manage, and view data annotations
Link disease genes to model systems
Reconcile annotation and ontology changes
Apply ontologies to describe
data
 Requirements
 Data capture tools
 phenote
 demo tomorrow
 no tool requirements from UCSF
 Data model
 Database (OBD)
 --aim 2
data
flow
user’s
view
Data models
 Common/shared domain specific models
 Aim 3
 linking disease genes
 model must support this
 granularity
 comparability
Domain specific data models
 FB, ZFIN
 genotype to phenotype
 ‘EAV’
 qualities inhere in entities
 orthologs
 phenotype to disease
 core 2 will help define common model
 UCSF
 clinical trials
 existing ontology-friendly schema - trialbank
Phenotype data model
 Qualities inhere in entities
 Entity term; PATO term





brain FBbt:00005095; fused PATO:0000642
gut MA:0000917; dysplastic PATO:0000640
tail fin ZDB:020702-16; ventralized PATO:0000636
kidney ZDB:020702-16; hypertrophied PATO:0000636
midface ZDB:020702-16; hypoplastic PATO:0000636
 Pre-composed phenotype terms
 Mammalian Phenotype Ontology
 “increased activated B-cell number” MPO:0000319
 “pink fur hue” MPO:0000374
Extensions to simple model
 What about






Relational attributes
Quantative vs qualitative
Post-composing entity and attribute terms
Relative states/values
Variation in place, space and time
A better treatment of absence
 See CSHL Pheno meeting talk
 also, more detailed formal presentation (available)
 Not to mention genotypes, environments,
provenance, etc
Modeling clinical trials
 Model already described using framebased schema
 Further modeling required?
 abstraction
 to integrate more with other OBD datatypes
 views
 to only show parts relevant to OBD/BioPortal
Future DBPs and use cases
 OBD will contain a variety of general
types of data
 Modeling is expensive
 use existing models where appropriate
 but whole must be cohesive and integrated
 Most of this talk focuses on the pheno
DBPs for illustrative purposes
Modeling frameworks
 language
 technology
Modeling data: underlying
formalism
 Model is expressed with modeling language
 Options




Relational/SQL
Semi-structured, XML
Object-centric (UML, frame-based?)
Logic based
 description logic: e.g. OWL
 first-order logic: e.g. CL
 Natural language descriptions
 Model should be independent of language it
is expressed in
Data exchange language:
XML
 Simple
 XML is suited for data exchange
 XML can drive software spec
 constrains programmatic data model
 XSD can generate UML
 closed world assumption is useful
 cf Ruttenberg et al
 Mature technology
 well understood by developers, MODs
 standards
How OBD uses XML
 obd-geno-pheno-xml (aka pheno-xml)
 actually multiple modular components




genotype schema
phenotype schema: ‘EAV’
environment schema
provenance schema
 used as
 exchange format
 cf: gene ontology association files
 no need for ClinicalTrials-XML
Example pheno-xml
<genotype id="ZFIN:tm84">
<name>ZFIN:tm84</name>
<genotype_phenotype_association>
<phenotype>
<entity type="ZDB-ANAT-010921-528">
<quality type=“PATO:……” >
<state type="PATO:0000636">
<time_range type="ZDB-STAGE-010723-12"/>
</state>
</quality>
</entity>
</phenotype>
</genotype_phenotype_association>
SQL Databases
 Data storage, management and
querying
 all MODs use SQL dbs
 Lots of advantages
 scalable, standard QL, mature, APIs, etc
 pure relational model is reasonably formal
 XML/SQL more or less compatible
 low impedance mismatch
Schemas for geno-pheno data
 We already have schema: Chado
 Used by many MODs (eg FB)
 others are ‘chado compliant’ (eg ZFIN)
 Modular






ontologies
genomic
genotype
phenotype
phylogenies
…etc
 Phenotype module needs updating
 will be driven by pheno-xml
Problem solved?
 We have two mature, complementary
technologies, and can define schemas
for our model in an appropriate
formalism for each
 Is this enough to work with?
Issues
 OBD will be much more than geno-pheno
 clinical trials
 future DBPs, other NCBCs
 any data expressed in an ontology language
 Software and schema development
expensive
 fragility in face of schema evolution
 development gets bogged down in data exchange
issues
Major issue
 SQL and XPath work great for
‘traditional’ data…
 …but are too low level for ontologycentric data
 lack of inference
 no way to directly express ontology
constraints
Use cases from previous
experience: AmiGO
 GO
 “find all TF genes” (is_a closure)
 “find all gene products localised to endoplasmic
reticulum” (part_of closure, over is_a)
 Our solution (AmiGO & go-sqldb)
 pre-compute transitive closure over all relations in
db
 (sort of) works for GO (for now)
 refresh problem
 explosive for tangled DAGs
OBD requires more
ontological awareness
 Other relations
 ontogenic (eg derives_from)
 transitive_over
 Other types of data
 Pre- versus post- composed terms
 E.g. MPO versus AO+PATO
 E.g. Entity+Spatial qualifier
 queries over either should be interchangeable
Solution: more expressive
formalisms
 QLs and APIs should provide and
abstract away common ontology
operations
 ease of programming, optimisation
 Choices
 ‘Semweb’ databases
 RDF + RDFS + Owl [ lite + DL ] + extra
 lots to choose from, emerging standards
 compatible with Obo v1.2 spec
 Deductive databases
 superset of relational databases
Modeling phenotypes as
RDF/OWL or Obo instances
classes/
terms
instances
entity
quality
Example query in SeRQL
find mutations affecting the shape of the wing vein:
SELECT DISTINCT
EI, ET, OrgI, QI, QT, QN
FROM {EI} rdf:type {ET} rdfs:label {EN},
{EI} OBO_REL_part_of {OrgI} rdf:type {Tax} rdfs:label {TaxN},
{EI} OBO_REL_has_quality {QI} rdf:type {QT} rdfs:label {QN}
WHERE
label(EN) = "wing vein" AND
label(TaxN) = ”Arthropoda" AND
label(QN) = "ShapeValue"
results of query on OBD-sesame:
one annotation to “wing vein L2”, “branched”
Advantages of ‘SemWeb’ dbs
 Advantages over pure SQL
 The ontology is the model
 constraints encoded in ontology
 e.g. certain quality types only applicable to certain entity
types
 agile development - fast database integration
 Rich modeling constructs
 transitivity, subsumption, intersection, etc
 powerful QLs and APIs
 More (technical) interoperation ‘for free’
 URIs
 proven?
 Open World Assumption (maybe a hindrance?)
Disadvantages of ‘SemWeb’
dbs
 Disadvantages
 speed
 may be slower than SQL
 ..but in-memory execution is fast
 lack of maturity
 new technology.. but has a LOT of momentum
 foundations
 are RDF triples appropriate?
 inherent difficulties modeling time
 SQL allows n-ary relations/predicates
Hybrid model
 SemWeb dbs are commonly layered over
SQL DBs
 We can have the best of both worlds
 Data View layers
 mapping between Obo/OWL model and
domain-specific relational schema
 (optionally) materialized for speed
 different applications use appropriate layer
Current progress: OBDSesame
 Sesame
 open source ‘triple store’
 based on Jena
 also used in Protégé-OWL
 storage layer options
 mysql/postgresql generic schema
 in-memory
 disk-based
OBD in Sesame: current
datasets
 Pheno
 ZFIN & FB : EAV trial 2003 data
 Test ortholog set
 FB ‘simple phenotype’ alleles
 ZFIN legacy phenotype data, automatically parsed to
EAV
 Ontologies: AOs, PATO, Cell, GO
 Method
 excel & flatfiles->pheno-xml->owl
 OWL from http://www.fruitfly.org/~cjm/obo-download
 Trialbank
 Method: ocelot->obo-xml->owl
 Soon
Technology Evaluation:
Sesame
 Use case query set
 Benchmarks
 preliminary conclusions
 SQL layering is terrible
 in-memory is fast
 optimisations?
 other triple stores?
 up to date results on wiki

http://smi.stanford.edu/projects/cbio/mwiki-internal/index.php/RDF_Sesame_Demo_Benchmark
 Need to test OWL-DL entailment
 Bigger dataset required for full evaluations
 Community effort: pub-semweb-lifesci list
Parallel development: an OBD
Prototype
 Initiated prior to OBD-Sesame
 Simple deductive database
 prolog-based
 chado-like schema
 can be views on Obo/OWL predicates
 amigo-clone user interface
 Rapid prototyping
 Current dataset
 as obd-sesame, plus CT
 trivial to drop in more
Example logic query
find mutations affecting the shape of some part
of the head capsule
inheres(QI,EI)
&
inst(QI,QT)
&
label(QT,shape)
&
inst(EI,ETP)
&
part_of*(ETP,ET)
&
label(ET,’head capsule’)
results of query on OBD-prolog:
one annotation to “arista lateral”, “irregular shape”
OBD TODO
 Pheno-xml
 finalise release version
 finalise Obo/OWL mapping
 logic specification
 Data
 orthologies
 OBD - BioPortal integration
 how will it work?
 Versioning and reconciling changes
 decide on ontology versioning first
OBD dependencies
 PATO development
 UMLS into OBO-site
 Ontologies




FMA accessibility?
species-centric AO alignments (XSPAN?)
Sept meeting on AO development
Nov meeting on disease ontologies
 Data
 MOD pheno annotation
 OMIM annotation
 Bioportal
Misc
 NLP for phenote




Obol
trial on evolutionary phenotype characters
cambridge NLP project
can be used to ‘prime’ phenote
 Decomposing MPO
 pink fur def= fur, has_quality: pink
Discussion
 Will SemWeb dbs work?
 experiment
 Ontology-based modeling
 the ontology is the model
 importance of
 relations ontology
 upper ontology
Demos
 http://yuri.lbl.gov/amigo/ct
 http://yuri.lbl.gov/amigo/obd
 http://spade.lbl.gov:8080/sesame/actionFram
eset.jsp?repository=mem-rdfs-db