Scientific Data Integration and Mediation

Download Report

Transcript Scientific Data Integration and Mediation

Tutorial #5:
Scientific Data Integration and Mediation
Bertram Ludäscher
Ilkay Altintas
Amarnath Gupta
Kai Lin
San Diego Supercomputer Center
U.C. San Diego
1
• National Science Foundation (NSF)
Acknowledgements
– www.nsf.gov
• GEOsciences Network (NSF)
– www.geongrid.org
• Biomedical Informatics Research Network (NIH)
– www.nbirn.net
• Science Environment for Ecological Knowledge (NSF)
– seek.ecoinformatics.org
• Scientific Data Management Center (DOE)
– sdm.lbl.gov/sdmcenter/
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
2
Outline
• 8:30 – 10:30am: Tutorial: Data Integration & Mediation
– Introduction to database mediation:
• motivation and architecture
• XML-based data integration
– Database mediation theory primer:
• logic view definitions, view unfolding, computing feasible plans
– From XML-based to Knowledge-based mediation:
• use of ontologies in data integration, ...
• 10:30 – 10:45am: BREAK
• 10:45 – 12:00: Applications and Demos
–
–
–
–
10:45 – 11:05 Mediator Demo
11:05 – 11:20 Queries w/ Ontology Support
11:20 – 11:40 Scientific Workflows
11:40 – 12:00 KNOW-ME Ontology Tool
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
3
Information Integration Challenges
• System aspects: “Grid” Middleware
– distributed data & computing
– Web Services, WSDL/SOAP, …
– sources = functions, files, databases, …
Semantics
Structure
• Syntax & Structure:
XML-Based Mediators
Syntax
– wrapping, restructuring
– XML queries and views
– sources = XML databases
System aspects
 reconciling S4
heterogeneities
 “gluing” together
multiple data sources
 bridging information
and knowledge gaps
computationally
Scientific Data-Mediation AHM'03
• Semantics:
Model-Based/Semantic Mediators
– conceptual models and declarative views
– SemanticWeb/KnowledgeGrid stuff:
ontologies, description logics (RDF(S),
DAML+OIL, OWL ...)
– sources = knowledge bases (DB+CMs+ICs)
National Partnership for Advanced Computational Infrastructure
4
Information Integration from a DB Perspective
• Information Integration Problem
– Given: data sources S1, ..., Sk (DBMS, web sites, ...) and user
questions Q1,..., Qn that can be answered using the Si
– Find: the answers to Q1, ..., Qn
• The Database Perspective: source = “database”
 Si has a schema (relational, XML, OO, ...)
 Si can be queried
 define virtual (or materialized) integrated views V over
S1 ,..., Sk using database query languages (SQL, XQuery,...)
 questions become queries Qi against V(S1,..., Sk)
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
5
Standard (XML-Based) Mediator Architecture
USER/Client
Query Q ( G (S1,..., Sk) )
Integrated Global
(XML) View G
Integrated View
Definition
MEDIATOR
G(..) S1(..)…Sk(..)
XML Queries & Results
Scientific Data-Mediation AHM'03
(XML) View
(XML) View
(XML) View
Wrapper
Wrapper
Wrapper
S1
S2
Sk
wrappers implemented
as web services
National Partnership for Advanced Computational Infrastructure
6
Some BIRNing Data Integration
Questions
Biomedical Informatics
Research Network
http://nbirn.net
• Data Integration Approaches:
–
–
–
–
Let’s just share data, e.g., link everything from a web page!
... or better put everything into an relational or XML database
... and do remote access using the Grid
... or just use Web services!
• Nice try. But:
– “Find the files where the amygdala was segmented.”
– “Which other structures were segmented in the same files?”
– “Did the volume of any of those structures differ much from
normal?”
– “What is the cerebellar distribution of rat proteins with more
than 70% homology with human NCS-1? Any structure
specificity? How about other rodents?”
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
7
An Online Shopper’s Information Integration Problem
El Cheapo: “Where can I get the cheapest copy (including shipping cost) of
Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”
addall.com
?
Information
Integration
public library
amazon.com
barnes&noble.com
“One-World”
Mediation
WWW
half.com
A1books.com
A Home Buyer’s Information Integration Problem
What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms,
a nearby school ranking in the upper third, in a neighborhood
with below-average crime rate and diverse population?
?
Information
Integration
Realtor
Crime Stats
School Rankings
“Multiple-Worlds”
Mediation
Demographics
A Geoscientist’s Information
Integration Problem
What is the distribution and U/ Pb zircon ages of A-type plutons in VA?
How about their 3-D geometry ?
How does it relate to host rock structures?
?
Information
Integration
Geologic Map
(Virginia)
GeoChemical
“Complex
Multiple-Worlds”
Mediation
GeoPhysical GeoChronologic
(gravity contours) (Concordia)
Foliation Map
(structure DB)
A Neuroscientist’s Information
Integration Problem
Biomedical Informatics
Research Network
http://nbirn.net
What is the cerebellar distribution of rat proteins with more than 70%
homology with human NCS-1? Any structure specificity?
How about other rodents?
?
Information
Integration
protein localization
sequence info
(NCMIR)
(CaPROT)
“Complex
Multiple-Worlds”
Mediation
morphometry
neurotransmission
(SYNAPSE)
(SENSELAB)
Structural / XML-Based Mediation
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
12
Abstract XML-Based Mediator Architecture
USER/Client
Query Q o V (S_1,...,S_k)
Integrated
XML View V
Integrated View
Definition
IVD(S1,...,Sn)
MEDIATOR
XML Queries & Results
XML View
XML View
XML View
Wrapper
Wrapper
Wrapper
S_1
Scientific Data-Mediation AHM'03
S_2
S_k
National Partnership for Advanced Computational Infrastructure
13
Extensible Markup Language (XML)
... in their wonderful book called SemWeb
<title>SemWeb
Tractat
Tractat
Tractat</title>
</title>
by
<author>B.
B.Lee,
Schatz
Schatz</author>
T.B. Lee,
by
B. Schatz
andby
T.B.
the and
authors
showthe
how ...
<book>
authors
and
<author>
show how
T.B....Tractat</title>
Lee</author>, the authors
<title>SemWeb
show how ...
<author>B. Schatz</author>
<author>T.B. Lee</author>
</book>
book
title
author
author
“SemWeb Tractat” “B. Schatz” “T.B. Lee”
book:
title:
“SemWeb Tractat”
author:
“B. Schatz”
author:
“T.B. Lee”
• (meta)language for marking up text & data
with user-definable tags
– (X)HTML, XSLT, XML Schema, ...
– MathML, BioML, GeoML, NeuroML, ...
– XML-RPC, SOAP, ...
• semistructured tree data model
– flexible: marked-up text, web-pages,
databases, ...
• container model:
– “boxes within boxes”
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
14
Example: Relational Data => XML
R
A
B
C
a1
b1
c1
a2
a3
b2
b3
c2
c3
R
tuple
tuple
tuple
A B C A B C A B C
a1 b1 c1 a2 b2 c2 a3 b3 c3
Scientific Data-Mediation AHM'03
R
tuple
A a1 /A
B b1 /B
C c1 /C
/tuple
tuple
A a2 /A
B b2 /B
C c2 /C
/tuple
…
/R
National Partnership for Advanced Computational Infrastructure
15
Tag Names & Nesting => XML DTDs (Grammars)
Grammar Rules
bibliography
paper
authors
paper*
authors fullPaper? title booktitle
author+
XML DTD
<!ELEMENT bibliography paper*>
<!ELEMENT paper
(authors,fullPaper?,title,booktitle)>
<!ELEMENT authors
author+>
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
16
XML DTDs vs. XML Schema
• XML DTDs
– set of allowed tag names
– their nesting structure (via grammar rules)
• XML Schema
–
–
–
–
–
tag names and nesting structure
user-defined complex data types
subtyping (no multiple inheritance): RESTRICT and EXTEND
separate “namespace” for type names and tag (=element) names
...
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
17
XML Schema: User-Defined Type/Class Hierarchy
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
18
XML Schema Declarations (“home-style” syntax)
Complex Type Declarations
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
19
XML Schema (“home-style”)
Simple Type Declarations
Complex Types
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
20
XML Schema: Substitution Groups
Elements of a substitution group (hexagons) and
associated complex types (boxes)
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
21
XML Schema Declarations (W3C syntax)
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
22
XML Query Languages
• XPath:
–
root//books/book[cover_style=“paperback”][price<80]
• XQuery
– the W3C XML query language
• XSLT
– XML transformations (XML=>HTML, XML=>XML)
• ...
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
23
Transforming and Rendering XML: XSLT
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
24
XMAS: XML Matching And Structuring language
CONSTRUCT <books>
<book>
$a1
$t
<pubs>
$p { $p }
</pubs>
</book> { $a1, $t }
</books>
WHERE <books.book>
$a1 : <author />
$t : <title />
</> IN "amazon.com"
AND
<authors.author>
$a2 : <author />
<pubs> $p : <pub/> </>
</> IN "www...DBLP… "
AND value( $a1 ) = value( $a2 )
XMAS
Scientific Data-Mediation AHM'03
Integrated View Definition:
“Find books from amazon.com
and DBLP, join on author,
group by authors and title”
XMAS Algebra
National Partnership for Advanced Computational Infrastructure
25
Database Mediation Theory Primer
26
Mediator Query Processing
Query Q
Integrated View
Definition V
Translator
parsed plan
Composition (Q o V)
composed plan
Compile-time
Run-time
Rewriter/Optimizer
optimized plan
Plan Execution
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
27
Logic View Definitions (Global-as-View)
or
Querying and Reasoning with the Family ...
• Warm up: Who says this?
– “Your are my son, but I’m not your father!”
• The mother!
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
28
Logic View Definitions (Global-as-View)
• Globals-as-View (GAV)
– Integrated view V is defined in terms of the sources Src_1, ... , Src_k
• Given the following source databases:
– Src_1 schema = { father(Father,Child), mother(Mother,Child) }
– Src_2 schema = { spouse(Spouse, Spouse) }
– Src_3 schema = { male(Person), female(Person) }
• Can you define integrated views V for ... ?
– parent(Parent,Child)
• short: parent/2, i.e., table/relation name is ‘parent’, arity (#columns) is 2
– son/2, daughter/2
– brother/2, sister/2
– brother_in_law/2, sister_in_law/2
– aunt/2, uncle/2
– married/2, bachelor/2
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
29
Logic View Definitions (Global-as-View)
Source relations: father/2, mother/2, spouse/2, male/1, female/1
 = “,” = conjunction (and)  = “ ; ” = disjunction (or)
• parent(C,P) 
father(C,P) ; mother(C,P) .
• son(P,S) 
parent(S,P) , male(S) .
• brother(X,B) 
parent(X,P), son(P,B), X  B .
• brother_in_law(X,B) 
sister(X, Z), spouse(Z, B)
; spouse(X, Z), brother(Z, B) .
Scientific Data-Mediation AHM'03
 = “not” = negation
National Partnership for Advanced Computational Infrastructure
30
Logic View Definitions (Global-as-View)
Source relations: father/2, mother/2, spouse/2, male/1, female/1
 = “,” = conjunction (and)  = “ ; ” = disjunction (or)
 = “not” = negation
• uncle(X, U) 
parent(X, Z), brother(Z, U)
; parent(X, Z), brother_in_law(Z, U) .
• aunt(X, A) 
parent(X, Z), sister(Z, A)
; parent(X, Z), sister_in_law(Z, A) .
• married(X) 
spouse(X, _) .
• bachelor(X) 
[person(X)] , not married(X) .
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
31
Query Rewriting and Query Evaluation
• Query Rewriting:
- Given a user query Q in terms of virtual views V...
- Find an equivalent query Q’ in terms of the sources Src_1,...,Src_k
• Query Evaluation:
- Given a query Q’, evaluate Q’ over the source databases
D := Src_1  ...  Src_k
• Examples:
– Q_uncle/2 = { (X,Y) | uncle(X,Y) holds in D }
– Q_tom’s_uncle/1 = { X | uncle(tom, X) holds in D }
– Q_whose_uncle_is_tom/1 = { X | uncle(X, tom) holds in D }
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
32
Query Rewriting (for GAV)
• Query rewriting:
- Given a user query Q in terms of virtual views V...
- Find an equivalent query Q’ in terms of the sources Src_1,...,Src_k
•
Query Q, views V, source schemas S
• View unfolding:
– starting with Q, repeatedly replace view predicates by the definition
• Creating a feasible plan:
– here: compute disjunctive normal form (DNF)
– DNF = disjunction of conjunctions (= “union of joins”)
– order goals within each conjunction according to sources’ query capabilities
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
33
Example
• ?- plan(brother(X0,X1)) .
brother(X0, X1)
== LQP ==>
(father(X0, X2) v mother(X0, X2))
& (father(X1, X2) v mother(X1, X2)) & male(X1) & neq(X0, X1)
brother(X0, X1)
==NNF LQP==>
(father(X0, X2) v mother(X0, X2))
& (father(X1, X2) v mother(X1, X2)) & male(X1) & neq(X0, X1)
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
34
Example (Cont’d)
• ?- plan(brother(X0,X1)) .
brother(X0, X1)
==DNF LQP==>
father(X0, X2)&father(X1, X2)&male(X1)&neq(X0, X1)
v mother(X0, X2)&father(X1, X2)&male(X1)&neq(X0, X1)
v father(X0, X2)&mother(X1, X2)&male(X1)&neq(X0, X1)
v mother(X0, X2)&mother(X1, X2)&male(X1)&neq(X0, X1)
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
35
Example (Cont’d)
• ?- plan(brother(X0,X1)) .
brother(X0, X1)
==Bp ordered LQP==>
parentDb(father(X1, X2) & father(X0, X2))
& genderDb(male(X1)) & mediator(neq(X0, X1))
v parentDb(father(X1, X2) & mother(X0, X2))
& genderDb(male(X1)) & mediator(neq(X0, X1))
v parentDb(mother(X1, X2)&father(X0,X2))
& genderDb(male(X1)) & z_mediator(neq(X0, X1))
v parentDb(mother(X1, X2)&mother(X0, X2))
& genderDb(male(X1))&z_mediator(neq(X0, X1))
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
36
Computing Feasible Plans (Goal Ordering)
• A conjunctive query Q is an expression of the form
–
q( X )  p1( X1 ) , ..., pn( Xn )
– order of subgoals p_i is irrelevant
• An ordered plan P is an expression of the form
–
q( X )  [p1( X1 ) , ..., pn( Xn )]
– order of subgoals p_i is important
• Problem:
– given Q, compute P which is feasible, i.e., observes the limited
query capabilities of sources
– Here: binding patterns, i.e., predicates’ arguments can be
• “b” – bound
• “f” – free
• “_” – bound or free
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
37
A Simple Algorithm for Ordering Goals
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
38
Query Containment
• A query Q1 is contained in Q2, denoted Q1 Q2
– if for all possible database instances, the set of answers
to Q1 is contained in the set of answers to Q2.
• Q1 and Q2 are called equivalent
– if Q1  Q2 and Q2  Q1.
• Query containment is undecidable for many
languages, e.g., for the relational calculus (SQL).
• For conjunctive queries, the problem is NPcomplete (and thus decidable)
– Since query sizes tend to be “small” (in particular, when
compared to database sizes), query containment is still of
use in practice (indeed, it is one of the most fundamental
tools for logic-based query optimization).
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
39
Query Containment
• Q1(Xs,Ys) is contained in Q2(Xs,Zs) iff
ALL Xs: (EXISTS Ys: Q1(Xs,Ys))  (EXISTS Zs:
Q2(Xs,Zs))
• iff we can refute its negation
• iff
NOT ALL Xs:
(EXISTS Ys: Q1(Xs,Ys))  (EXISTS Zs: Q2(Xs,Zs)) |= []
• iff
EXISTS Xs: (EXISTS Ys: Q1(Xs,Ys))
AND NOT (EXISTS Zs: Q2(Xs,Zs)) |= []
• iff
– canonical_db(Q1) AND  Q2(Xs,Zs) |= []
• create database from Q1, then run Q2 as a query...
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
40
Query Containment Algorithm (in Prolog)
• Applications:
– query minimization (conjunctive query is minimal if not
conjunct can be dropped)
– semantic query optimization
• Q  denial
• here: denial is an integrity constraint and states what must not hold
• example: denial =
false  mother(X,M), father(Y,M)
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
41
Example
• 50% of the clauses of the executable plan are irrelevant ...
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
42
Mediator Demo
• Computer Science Challenges:
– Given a query Q over virtual integrated database V, how to come up with Q’
over the source schemas? (cf. Garlic, DiscoveryLink, ...)
• query rewriting of Q(V) into Q’(SRCs) using unfolding and normalization
• computation of feasible orders (NP-complete!?) while minimizing number of
“chunks” sent to sources
• semantic query optimization (reasoning over plans!); e.g. conjunctive query
containment is NP-complete [Chandra-Merlin-77]
• A Quick Demo of the current prototype:
– Find 3D reconstructions of cells found in ‘cerebellar cortex’:
•
•
•
•
•
?- ccdbData('cerebellar cortex').
Join everything reachable along ‘cerebellar-cortex’.(has-a)* in UMLS
....with concept markup in CCDB
... retrieve (links to) results
... also show on SmartAtlas tool
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
43
Mediator Demo
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
44
From XML-Based to Logic and ModelBased (“Semantic”) Mediation
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
45
What’s the Problem with XML & Complex Multiple-Worlds?
• XML is Syntax
– DTDs talk about element nesting
– XML Schema schemas give you data types
– need anything else? => write comments!
• Domain Semantics is complex:
– implicit assumptions, hidden semantics
 sources seem unrelated to the non-expert
• Need Structure and Semantics beyond XML trees!
 employ richer OO models
 make domain semantics and “glue knowledge” explicit
 use ontologies to fix terminology and conceptualization
 avoid ambiguities by using formal semantics
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
46
From XML-Based to Model-Based Mediation
• Data and Knowledge Sharing Potential:
Database Mediation
+ Knowledge Representation
________________________
= Model-Based Mediation
• Basic Ideas:
– turn primary data sources into knowledge sources
– employ secondary glue knowledge sources
• generic: UMLS, ...
• specific: community/laboratory ontologies
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
47
Information Integration Landscape
conceptual
complexity/depth
high
Model-Based Mediation
GO EcoCyc
Ontologies
KR formalisms
RiboWeb
UMLS
Bioinformatics
Geoinformatics
Tambis
BLAST
MIA Entrez
Cyc
WordNet
DB mediation
techniques
low
addall
book-buyer
one-world
Scientific Data-Mediation AHM'03
home-buyer
24x7 consumer
conceptual distance
multiple-worlds
National Partnership for Advanced Computational Infrastructure
48
Knowledge Representation:
Relating Theory to the World via Formal Models
John F. Sowa, Knowledge Representation: Logical, Philosophical, and Computational Foundations
“All models are wrong, but some are useful!”
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
49
XML-Based vs. Model-Based Mediation
CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …}
Integrated-DTD :=
Glue Maps
XML-QL(Src1-DTD,...)
DMs, PMs
CM-QL ~ {F-Logic, DAML+OIL, …}
Integrated-CM :=
CM-QL(Src1-CM,...)
No Domain
Constraints
IF
 THEN 
IF
IFTHEN
THEN 
Structural Constraints (DTDs),
Parent, Child, Sibling, ...
A = (B*|C),D
B = ...
C1
C2
....
XML
Elements
XML Models
Raw
Raw
Data
RawData
Data
C3
R
....
. . ....
....
Logical
Domain
Constraints
Classes,
Relations,
is-a,
has-a, ...
(XML)
Objects
Conceptual Models
What’s the Glue? What’s in a Link?

Y
X
• Syntactic Joins
– (X,Y) := X.SSN = Y.SSN
– (X,Y) := X.UMLS-ID = Y.UID
equality
• “Speciality” Joins
– (X,Y,Score) := BLAST(X,Y,Score)
similarity
• Semantic/Rule-Based Joins
– (X,Y,C) :=
X isa C, Y isa C, BLAST(X,Y,S), S>0.8
homology, lub
– (X,Y,[produces,B,increased_in]) :=
X produces B, B increased_in Y.
rule-based
e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease
• Challenge:
– compile semantic joins into efficient syntactic ones
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
51
Model-Based Mediation Methodology ...
• Lift Sources to export CMs:
CM(S) = OM(S) + KB(S) + CON(S)
• Object Model OM(S):
– complex objects (frames), class hierarchy, OO constraints
• Knowledge Base KB(S):
– explicit representation of (“hidden”) source semantics
– logic rules over OM(S)
• Contextualization CON(S):
– situate OM(S) data using “glue maps” (GMs):
 domain maps DMs (ontology)
= terminological knowledge: concepts + roles
 process maps PMs
= “procedural knowledge”: states + transitions
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
52
... Model-Based Mediation Methodology
• Integrated View Definition (IVD)
– declarative (logic) rules with object-oriented features
– defined over CM(S), domain maps, process maps
– needs “mediation engineers” = domain + KRDB experts
• Knowledge-Based Querying and Browsing (runtime):
– mediator composes the user query Q with the IVD
... rewrites (Q o IVD), sends subqueries to sources
... post-processes returned results (e.g., situate in context)
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
53
Model-Based Mediator Architecture
USER/Client
“Glue” Maps
GMs
CM (Integrated View)
DomainMaps
Maps
Domain
Domain
Maps
DMs
DMs
DMs
Mediator
Engine
Integrated View
Definition IVD
LP rule proc.
XSB Engine
DomainMaps
Maps
Domain
Process
Maps
DMs
DMs
PMs
semantic
context
CON(S)
Graph proc.
GCM
GCM
GCM
First results & Demos:
CM S1
CM S2
CM S3
KIND prototype, formal
DM semantics, PMs
[SSDBM00] [VLDB00]
[ICDE01] [NIH-HB01]
(w/ Gupta, Martone)
CM Queries & Results
(exchanged in XML)
CM(S) =
OM(S)+KB(S)+CON(S)
CM-Wrapper
CM-Wrapper
CM-Wrapper
(XML-Wrapper)
(XML-Wrapper)
(XML-Wrapper)
S1
Scientific Data-Mediation AHM'03
FL rule proc.
S2
S3
National Partnership for Advanced Computational Infrastructure
54
Domain Maps (Ontologies) as Glue Knowledge
Sources
• Domain Map = Ontology
– representation of terminological knowledge
• Use in Model-Based Mediation
– (derived) concepts as “drop points”, “anchor points”, “context”
for source classes
– compile-time use: view definition, subsumption,
classification,...
– runtime use: querying/deduction, path queries, ....
• Formalisms:
– Semantic nets, Thesauri, Frame-logic, Description logics, ...
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
55
Ontologies
• So what is an Ontology?
–
–
–
–
–
–
definition of things that are relevant to your application
representation of terminological knowledge (“TBox”)
explicit specification of a conceptualization
concept hierarchy (“is-a”)
further semantic relationships between concepts
abstractions of relational schemas, (E)ER, UML classes, XML
Schemas
• Examples:
–
–
–
–
NCMIR ANATOM
GO (Gene Ontology)
UMLS (Unified Medical Language System
CYC
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
56
Formalism for Ontologies: Description Logic
• DL definition of “Happy Father”
(Example from Ian Horrocks, U Manchester, UK)
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
57
Description Logic Statements as Rules
• In first-order logic (rule form):
happyFather(X) 
man(X), child(X,C1), child(X,C2), blue(C1), green(C2),
not ( child(X,C3), poorunhappyChild(C3) ).
poorunhappyChild(C) 
not rich(C), not happy(C).
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
58
Description Logics
• Terminological Knowledge (TBox)
– Concept Definition (naming of concepts):
– Axiom (constraining of concepts):
=> a mediators “glue knowledge source”
• Assertional Knowledge (ABox)
– the marked neuron in image 27
=> the concrete instances/individuals of the concepts/classes that
your sources export
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
59
Querying vs. Reasoning
• Querying:
– given a DB instance I (= logic interpretation), evaluate a query
expression (e.g. SQL, FO formula, Prolog program, ...)
– boolean query: check if I |= 
(i.e., if I is a model of )
– (ternary) query: { (X, Y, Z) | I |=  (X,Y,Z) }
=> check happyFathers in a given database
• Reasoning:
– check if I |=  implies I |=  for all databases I,
– i.e., if  => 
– undecidable for FO, F-logic, etc.
– Descriptions Logics are decidable fragments
 concept subsumption, concept hierarchy, classification
 semantic tableaux, resolution, specialized algorithms
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
60
What’s in an Answer?
(What’s in a Link? revisited)

Y
X
• Semantic/Rule-Based Joins
– (X,Y,[produces,B,increased_in]) :=
X produces B, B increased_in Y.
rule-based
e.g., X=-secretase, B=beta amyloid, Y=Alzheimer’s disease
• What is the Erdoes number of person P?
– 3
• Really? Why?
– authority based: <VIP> said so
– faith based: don’t know but firmly believe
– query statement Q = ... derived it from DB I
– query Q = ... derived it from DB I and KB T using derivation D
=> logic-based systems often “come with explanations”
(“computations as proofs”)
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
61
Formalizing Glue Knowledge:
Domain Map for SYNAPSE and NCMIR
Domain Map
= labeled graph with
concepts ("classes") and
roles ("associations")
• additional semantics: expressed
as logic rules (F-logic)
Purkinje cells and Pyramidal cells have dendrites
that have higher-order branches that contain spines.
Dendritic spines are ion (calcium) regulating components.
Spines have ion binding proteins. Neurotransmission
involves ionic activity (release). Ion-binding proteins
control ion activity (propagation) in a cell. Ion-regulating
components of cells affect ionic activity (release).
Domain Expert Knowledge
Domain Map (DM)
Scientific Data-Mediation AHM'03
DM in Description Logic
National Partnership for Advanced Computational Infrastructure
62
Source Contextualization & DM Refinement
In addition to registering
(“hanging off”) data relative to
existing concepts, a source
may also refine the mediator’s
domain map...
 sources can register new
concepts at the mediator ...
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
63
Example:
ANATOM Domain Map
Browsing Registered Data with Domain Maps
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
65
Process Maps with Abstractions and Elaborations:
From Terminological to Procedural Glue
• nodes ~ states
• edges ~ processes, transitions
• blue/red edges:
• processes in Src1/Src2
• general form of edges:
related formalisms
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
66
Summary: Mediation Scenarios & Techniques
Federated Databases
One-World
Common Schema
XML-Based Mediation
Model-Based Mediation
One-/Multiple-Worlds
Complex Multiple-Worlds
Mediated Schema
Common Glue Maps
SQL, rules
XML query languages
DOOD query languages
Schema Transformations
Syntax-Aware Mappings
Syntactic Joins
Syntactic Joins
DB expert
Scientific Data-Mediation AHM'03
DB expert
Semantics-Aware Mappings
“Semantic” Joins via Glue Maps
KRDB + domain expert
National Partnership for Advanced Computational Infrastructure
67
Semantic (Community) Webs
“Within the next decade, computing
technology will transform the Internet into
the Interspace, an information
infrastructure that supports semantics
indexing and concept navigation across
widely distributed community
repositories.”
Bruce Schatz, IEEE Computer, Jan. 2002
"The Semantic Web is an extension of the
current web in which information is given
well-defined meaning, better enabling
computers and people to work in
cooperation."
Tim Berners-Lee et al., 2001
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
68
Combine Everything:
Die eierlegende Wollmilchsau:
• Database Federation/Mediation
– query rewriting under GAV/LAV
– w/ binding pattern constraints
– distributed query processing
• Semantic Mediation
– semantic integrity constraints, reasoning w/ plans, automated
deduction
– deductive database/logic programming technology, AI “stuff”...
– Semantic Web technology
• Scientific Workflow Management
– more procedural than database mediation (often the scientist is
the query planner)
– deployment using web services
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
69
B R EAK
... followed by demos ...
70
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
71
GEON SMART Metadata: Multihierarchical Rock Classification for
“Thematic Queries” (GSC)
Genesis
Fabric
Composition
“smart discovery & querying” via
multiple, independent concept
hierarchies (controlled
vocabularies)
• data at different description
levels can be found and processed
Texture
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
72
GEON SMART Metadata:Multihierarchical Rock Classification for
“Thematic Queries”
http://klin-pc.sdsc.edu:8080/examples/jsp/geon/composition.jsp
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
73
GEON Ontology Demo
• http://klin-pc.sdsc.edu:8080/examples/jsp/geon/old-rock.jsp
• http://klin-pc.sdsc.edu:8080/examples/jsp/geon/rock.jsp
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
74
Architecture of Ontology Based Map Integration
Global Web Map Server
Ontology Mapping
Web Map Server
Web Map Server
Web Map Server
Database
Database
Database
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
75
DOE Scientific Datamanagement Center
• Scientific Workflow Demo
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
76
Example: A Scientific Workflow
Microarray
analysis
A
Database search for
promoter identification
cDNA Cluster
B C
Promoter model
Common promoter alignment
Promoter sequences
*
*
*
*
Database search
Scientific Data-Mediation AHM'03
*- New candidate
target genes
*
Adapted from Thomas
Werner Biomolecular Engineering, 17: 87-94 (2001)
National Partnership for Advanced Computational
Infrastructure
77
Conceptual Workflow
Compute clusters
(min. distance)
For each
promoter
Select gene-set
(cluster-level)
For each gene
Retrieve matching
cDNA
Retrieve genomic
Sequence
Extract promoter
Region(begin, end)
Scientific Data-Mediation AHM'03
Retrieve
Transcription factors
Compute
Subsequence labels
Arrange
Transcription factors
With all
Promoter Models
Align promoters
Create consensus
sequence
National Partnership for Advanced Computational Infrastructure
Compute Joint
Promoter Model
78
Mapping This Workflow To Web Sites
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
79
Customized CGI Application
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
80
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
81
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
82
ClustalW
Transfac
Output
Query Results
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
83
SDM-SciDAC
System Architecture
User
WF-Pilot
Design
Execution monitoring
WF-Engine
Scheduling and execution
AWF
EWF
WF-Compiler
AWF  EWF Translation
query
rewriting
web service
matching
semantic type
checking
data type
conversion
web service
invocation
web service
invocation
ET
ET
Genbank
BLAST
C
AAV
rules
ET
schemas
Abstract Task Executable Task
(AT) Repository (ET) Repository
Scientific Data-Mediation AHM'03
C
C
Data &
Parameter
Ontologies
conversion
rules
Datatype &
Conversion
Repository
National Partnership for Advanced Computational Infrastructure
84
AWF to EWF
Declarative specification
For each gene
Retrieve matching
cDNA
Retrieve genomic
Sequence
Extract promoter
Region(begin, end)
User supplied
GetGenomicSequence (+{selectedGene}, -{{GenomicSequence}}) :GENBANK (+{selectedGene}, -{cDNASequence}),
BLAST (+{cDNASequence}, +dbName, +format, {rankedGenomicSequenceList}).
GetGenomicSequence (+{selectedGene}, -{{GenomicSequence}}) :GENBANK (+{selectedGene}, -{cDNASequence}),
BLAT (+{cDNASequence}, +QueryType, +SortCriteria, +OutputType , {rankedGenomicSequenceList}).
IdentifyPromoterElements (+{rankedGenomicSequenceList}, -{element}) :PromoterSequences (+{rankedGenomicSequenceList},
getBeginEnd(+Species, -Begin, -End), -{element}).
Need extra
domain knowledge
Translation to EWF needs
Same functionality, different
creation of iterators
operational constraints and
Scientific Data-Mediation AHM'03availability National Partnership for Advanced Computational Infrastructure
85
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
86
Abstract Task (AT) Registration
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
87
Abstract Task (AT)
View and Delete
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
88
Abstract Task (AT)
Update
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
89
AWF Design
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
90
EWF Planning and Compilation
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
91
EWF Execution
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
92
BIRN Tools Demo
93
Some References (starting points)
• XML
– General: http://xml.coverpages.org/xml.html
– XQuery: http://www.w3.org/XML/Query
– XSLT: http://xml.coverpages.org/xsl.html
• Query Rewriting:
– database research literature
• Logic Programming
– Learn Prolog Now! http://www.coli.uni-sb.de/~kris/learn-prolog-now/
– SWI-Prolog (nice free Prolog system): http://www.swi-prolog.org/
• Ontologies
– Ontology Web language: http://www.w3.org/TR/owl-features/
– http://www-ksl.stanford.edu/kst/what-is-an-ontology.html
– http://www.cs.utexas.edu/users/mfkb/related.html
• Model-Based Mediation:
– http://www.sdsc.edu/~ludaesch/Paper/icde01.html
• Semantic Web:
– http://www.w3.org/2001/sw/
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
94
References: Project Web Sites
• GEOsciences Network (NSF)
– www.geongrid.org
• Biomedical Informatics Research Network (NIH)
– www.nbirn.net
• Science Environment for Ecological Knowledge (NSF)
– seek.ecoinformatics.org
• Scientific Data Management Center (DOE)
– sdm.lbl.gov/sdmcenter/
Scientific Data-Mediation AHM'03
National Partnership for Advanced Computational Infrastructure
95