Beyond Federation of Data Collections

Download Report

Transcript Beyond Federation of Data Collections

Brain Data & Knowledge Grid
(or: Towards Services for Knowledge-Based
Mediation of Neuroscience Information Sources)
Data-Intensive Computing Environments
San Diego Supercomputer Center (SDSC)
National Center for Microscopy and
Imaging Research (NCMIR)
Reagan Moore
Mark Ellisman
Chaitan Baru
Amarnath Gupta
Bertram Ludäscher
Richard Marciano
Arcot Rajasekar
Ilya Zaslavsky
...
Maryann Martone
Steve Peltier
Steve Lamont
...
University of California, San Diego
Infrastructure for Sharing Neuroscience Data
SOURCES:
•
•
•
•
•
•
•
•
NCMIR, U.C. San Diego
Caltech Neuroimaging
Center for Imaging Science, John Hopkins
Center for Computational Biology, Montana State
Laboratory of Neuro Imaging (LONI), UCLA
Computatuonal Neurobiology Laboratory, Salk Inst.
Van Essen Laboratory, Washington University
…
Data Management Infrastructure (DICE/NPACI)
•
•
•
•
•
MIX
Mediation in XML
MCAT information discovery
SRB
data handling
HPSS storage
...
Surface atlas, Van Essen Lab
Knowledge-based
GRID infrastructure
? ?
?
?
Data Management Infrastructure (“Data Grid”)
GTOMO, Telemicroscopy, Globus, SRB/MCAT, HPSS
stereotaxic atlas LONI
MCell, CNL, Salk
NCMIR, UCSD
CCB, Montana SU
Sharing Resources on the Brain Data Grid
• Scientific groups ...
–
–
–
–
create data products (e.g., text data, images, simulation data …)
put them in collections
add metadata (who created it, what is the data about …)
make it available for sharing (on the web, in data caches, in HPSS, …)
• Technical challenges ...
– size & packaging of data
– heterogeneity: data types, storage technologies, transport mechanisms,
authentication, ...
– access levels: collection, object, fragment; data-specific functions
(“data blades”)
• Data Grid technologies can help ...
– distributed data management, e.g., Storage Request Broker/Metadata
Catalog (SRB/MCAT), computing (Globus), ...
– focus is on resource sharing (data, networks, cycles)
Integration Issue: Semantic Integration/Mediation
SYNTACTIC/STRUCTURAL Integration
• Integrated Views
MIX
(Src-XML => Intgr-XML)
Mediation of
(DTD =>DTD)
Information
• Schema Integration
• Wrapping, Data Extraction
Globus JDBC DOM CORBA
TCP/IP grid-ftp HTTP SYSTEM
(Text => XML)
INTEGRATION
using XML
SRB/MCAT
Distributed
Query Processing
??? SEMANTIC INTEGRATION ???
storage, query capabilities
protocols & services
Standard Mediator/Wrapper Architecture
Client/User-Query
XML Q/A
INTEGRATED VIEW
domain
semantics ???
GRID federation
services ???
Integration logic
SRB/MCAT, DOM, X(ML)Query
Wrapper
Wrapper
DB
Files
Lab1
Lab2
Wrapper
WWW
(Neuro)Science (Re)Sources
Lab3
}
protocol translation
structure
syntax
transport
storage
The Need for Semantic Integration
Cross-source queries
What is the cerebellar distribution of rat proteins with more than 70%
homology with human NCS-1? Any structure specificity?
How about other rodents?
Cross-source
relationships are
modeled
??? Integrated
View Definition ???
Wrapper
Semantic (knowledgebased) mediation
services
??? Integrated
View ???
???Mediator ???
Wrapper
Data, relationships,
constraints are
modeled (CMs)
Wrapper
Wrapper
Web
protein localization
morphometry
neurotransmission
CaBP, Expasy
Hidden Semantics: Protein Localization
Purkinje Cell layer of
Cerebellar Cortex
<protein_localization>
<neuron type=“purkinje cell” />
<protein channel=“red”>
<name>RyR</>
….
</protein>
<region h_grid_pos=“1” v_grid_pos=“A”>
<density>
<structure fraction=“0.8”>
<name>spine</>
<amount name=“RyR”>0</>
Molecular layer of
</>
Cerebellar Cortex
<structure fraction=“0.2”>
<name>branchlet</>
Fragment of dendrite
<amount name=“RyR”>30</>
</>
Hidden Semantics: Morphometry
Must be dendritic
because Purkinje cells
<neuron name=“purkinje cell”>
<branch level=“10”>
Branch level beyond 4
<shaft>
…
is a branchlet
</shaft>
<spine number=“1”>
<attachment x=“5.3” y=“-3.2” z=“8.7” />
<length>12.348</>
<min_section>1.93</>
<max_section>4.47</>
<surface_area>9.884</>
<volume>7.930</>
<head>
<width>4.47</>
<length>1.79</>
</head>
</spine>
…
don’t have somatic spines
Knowledge-Based (Semantic) Mediation
• Multiple Worlds Integration Problem:
– compatible terms not directly joinable
– complex, indirect associations among attributes
– unstated integrity constraints
• Approach:
– a “theory” under which terms can be “semantically joined”
=> lift mediation to the level of conceptual models (CMs)
=> formalize domain knowledge, ICs become rules over CMs
=> Knowledge-Based/Model-Based (Semantic) Mediation
XML-Based vs. Model-Based Mediation
CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …}
Integrated-DTD :=
XML-QL(Src1-DTD,...)
CM-QL ~ {F-Logic, OIL, DAML, …}
Integrated-CM :=
DOMAIN
MAP
No Domain
Constraints
CM-QL(Src1-CM,...)
IF
 THEN 
IF
IFTHEN
THEN 
Structural Constraints (DTDs),
Parent, Child, Sibling, ...
A = (B*|C),D
B = ...
C1
C2
....
XML
Elements
XML Models
Raw
Raw
Data
RawData
Data
C3
R
....
. . ....
....
Logical
Domain
Constraints
Classes,
Relations,
is-a,
has-a, ...
(XML)
Objects
Conceptual Models
Knowledge-Based Mediator Prototype
USER/Client
CM (Integrated View)
Domain Map
DM
Mediator
Engine
Integrated View
Definition IVD
XSB Engine
FL rule proc.
LP rule proc.
Graph proc.
GCM
GCM
GCM
CM S1
CM S2
CM S3
CM Plug-In
CM Queries & Results
(exchanged in XML)
Logic API
(capabilities)
CM-Wrapper
CM-Wrapper
CM-Wrapper
XML-Wrapper
XML-Wrapper
XML-Wrapper
S1
S2
S3
Mediation Services:
Source Registration (System Issues)
Source
Data Type
Result Delivery
table tree
Query Capability
Access Protocol
ARC SQL XML DOOD
QL
file
Tuple-at-a-time
Stream
SRB HTTP JDBC
Set-at-a-time
Binary for Viewer
Selections
SPJ
Mediation Services:
Source Registration (Semantics Issues)
• Domain Map Registration
– provide concept space/ontology
• … as a private object (“myANATOM”)
• … merge with others (give “semantic bridges”)
• … and check for conflicts
• Conceptual Model Registration
– schema: classes, associations, attributes
– domain constraints
– “put data into context” (linking data to the domain
map)
Next
ANATOM Domain Map
ANATOM
Back
Senselab (Yale) and NCMIR (UCSD) “Semantic Bridge”
anatom_dom(X) :- (ucsd_has_a(X,_) ; ucsd_has_a(_,X) ; ucsd_isa(X,_) ; ucsd_isa(_,X)).
senselab_dom(X) :- (sl_has_a(X,_) ; sl_has_a(_,X) ; sl_isa(X,_) ; sl_isa(_,X)).
% map Senselab anatom terms to equivalent UCSD ANATOM
sl2ucsd(X,X) :- senselab_dom(X), anatom_dom(X).
sl2ucsd('A',axon).
sl2ucsd('AH',axon).
sl2ucsd('Dad',spiny_branchlet). % should map to a PATH not just the end of the path
sl2ucsd('Dam',main_branches). % some of the main_branches based on the branch level
sl2ucsd('Dap',main_branches).
sl2ucsd('Dbd',spiny_branchlet).
sl2ucsd('Dbm',main_branches).
sl2ucsd('Dbp',main_branches).
sl2ucsd('Ded',spiny_branchlet).
sl2ucsd('Dem',main_branches).
sl2ucsd('Dep',main_branches).
sl2ucsd('T',axon).
% keep has_a edge if at least one node is known from UCSD
has_a(X,Y) :- sl2ucsd(_,X), ucsd_has_a(X,Y).
has_a(X,Y) :- sl2ucsd(_,Y), ucsd_has_a(X,Y).
% keep all and only UCSD is_a rels
isa(X,Y) :- ucsd_isa(X,Y).
Back
Refinement of a Domain Map (Ontology):
Putting Data in Context via Registration of new Classes & Relationships
Neuron
Neostriatum
MyNeuron
Compartment
Spiny Neuron
ALL:has
Soma
Axon
Dendrite
Medium Spiny Neuron
Neurotransmitter
MyDendrite
exp
=
OR
GABA
AND
Substance P
exp
Dopamine R
Globus Pallidus Int.
Substantia Nigra Pc
Substantia Nigra Pr
Globus Pallidus Ext.
Mediation Services:
Integrated View Definition
DERIVE
protein_distribution(Protein, Organism, Brain_region, Feature_name,
Anatom, Value)
FROM
I:protein_label_image[ proteins ->> {Protein}; organism -> Organism;
anatomical_structures ->>
{AS:anatomical_structure[name->Anatom]}] ,
% from PROLAB
NAE:neuro_anatomic_entity[name->Anatom;
% from ANATOM
located_in->>{Brain_region}],
AS..segments..features[name->Feature_name; value->Value].
• provided by the domain expert and mediation engineer
• declarative language (here: Frame-logic)
Example Query Evaluation (I)
• Example: protein_distribution
– given: organism, protein, brain_region
– Use DOMAIN-KNOWLEDGE-BASE:
• recursively traverse the has_a_star paths under brain_region collect
all anatomical_entities
– Source PROLAB:
• join with anatomical structures and collect the value of attribute
“image.segments.features.feature.protein_amount” where
“image.segments.features.feature.protein_name” = protein and
“study_db.study.animal.name” = organism
– Mediator:
• aggregate over all parents up to brain_region
• report distribution
Example Query Evaluation (II)
"How does the parallel fiber output (Yale/SENSELAB) relate to the
distribution of Ryanodine Receptors (UCSD/NCMIR)?"
@SENSELAB: X1 := select output from parallel fiber ;
@MEDIATOR: X2 := “hang off” X1 from Domain Map;
@MEDIATOR: X3 := subregion-closure(X2);
@NCMIR:
X4 := select PROT-data(X3, Ryanodine Receptors);
@MEDIATOR: X5 := compute aggregate(X4);
Mediation Services:
Client Registration
Client
Update Client
Thin Result Viewer
Check Merge
Data Before
Insert
Query Client
Fat Result Viewer
Query on
Schema
Derive
Before
Insert
Send Full Data
Context
Server-side Buffer
Sensitive
Server-Push/
Client-Pull
Client-side
Buffer
Client-side
Processing
Navigate/
Query
Ad-hoc Capability
Example Client:
Query Formulation and Result Display
• combination of ad hoc and navigational queries
• client side visualization (left)
• results are shown in semantic context (right)
Mediation Services: Semantic Annotation Tools
line drawing ==annotation==> (spatial) database for mediation
Mediator Architecture Blueprint
Mediation Services
Mediator Layer
Query formulation:
Source model lifting:
• user query
• integrated view definition
• domain knowledge reconciliation
• model transformation
Query processing:
Source registration:
• view unfolding
• semantic optimization
• capability-based rewriting
• domain knowledge
• model & schema
• query & computation capabilities
Deductive
Engine
Model
Reasoner
Optimizer
Wrapper Layer
Query interface (down API):
Result delivery interface (up API):
• SDLIP, SOAP, ...
• (subsets of) SQL, X(ML)-Query, CPL,...
• DOM
• SRB-based access
• SDLIP, SOAP, ...
• pull (tuple/set-at-a-time, DOM) vs. push (stream)
• synchronous/asynchronous
• direct data/data reference
File
Sources
HTML
Sources
XML
Sources
RDB
Sources
Montana Boston Yale NCMIR
Univ. Univ. Univ. UCSD
Spatial
Sources
ARC
IMS
Digital Libraries
(Collections)
SDLIP
Coming up: Knowledge-Based/Semantic Mediation of
Brain Data
Result (XML/XSLT)
PROTLOC
Result (VML/SVG)
ANATOM

Knowledge-Based Mediation
CCB, Montana SU
Surface atlas, Van Essen Lab
stereotaxic atlas LONI
MCell, CNL, Salk
NCMIR, UCSD
Some Open Issues
• Data/Knowledge Modeling
– Extensibility: how to handle a source with new data types and
operations?
• Temporal Data: instrument readings, video microscopy
• Spatial Data: Integrating with spatial database systems
• Image database systems
– Conflict Management
• Grades of certainty
• Alternate Hypothesis
• Integrating Services
– Registration and warping of my image slice to a reference
• Integrating into Larger Applications
– M-Cell simulation
– Telemicroscopy
– Visualization
References
• Model-Based Mediation with Domain Maps, Bertram Ludäscher, Amarnath
Gupta, Maryann Martone, Intl. Conference on Data Engineering (ICDE),
Heidelberg, 2001
• Knowledge-Based Mediation of Heterogeneous Neuroscience Information
Sources, Amarnath Gupta, Bertram Ludäscher, Maryann Martone, Intl. Conference
on Scientific and Statistical Databases (SSDBM), Berlin, 2000.
• Model-Based Information Integration in a Neuroscience Mediator System,
Bertram Ludäscher, Amarnath Gupta, Maryann Martone, Intl. Conference on Very
Large Data Bases (VLDB), Cairo, 2000.