Transcript Slides
iTrails: Pay-as-you-go Information Integration
in Dataspaces
Marcos Vaz Salles Jens Dittrich
Shant Karakashian
Olivier Girard
Lukas Blunschi
ETH Zurich
VLDB 2007
September 26, 2007
Outline
Motivation
iTrails
Experiments
Conclusions and Future Work
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
2
Problem: Querying Several Sources
Query
What is the impact of global warming
in Zurich?
?
?
?
?
Systems
Data
Sources
Laptop
September 26, 2007
Email
Server
Web
Server
Marcos Vaz Salles / ETH Zurich / [email protected]
DB
Server
3
Solution 1: Use a Search Engine
Query
Job!
global warming zurich
Graph IR
Search Engine
System
TopX [VLDB05],
FleXPath semantics
[SIGMOD04],
Drawback:
Query
are not precise!
XSearch [VLDB03], XRank [SIGMOD03]
text,
links
text,
links
text,
links
text,
links
Data
Sources
Laptop
September 26, 2007
Email
Server
Web
DB
Server Server
Marcos Vaz Salles / ETH Zurich / [email protected]
4
Solution 2: Use an Information
Integration System
//Temperatures/*[city =
“zurich”]
...
...
...
Temps
Cities
Query
Information
Integration
System
Drawback: Too much
effort to provide
...
System
schema
mappings!
GAV (e.g.
[ICDE95]), LAV (e.g. [VLDB96]),
CO2
Sunspots
GLAV [AAAI99], P2P (e.g. [SIGMOD04])
missing
schema
mapping
missing
schema
mapping
schema
mapping
schema
mapping
Data
Sources
Laptop
September 26, 2007
Email
Server
Web
Server
Marcos Vaz Salles / ETH Zurich / [email protected]
DB
Server
5
Research Challenge:
Is There an Integration Solution in-between
These Two Extremes?
global warming zurich
//Temperatures/*[city =
“zurich”]
global warming zurich
?
Graph IR
Search Engine
Pay-as-you-go
text,
Information
links
Integration
text,
links
text,
links
Dataspace ...
System
...
text,
links
text,
links
Data
Sources
Laptop
September 26, 2007
Email
Server
Web
DB
Server Server
Marcos Vaz Salles / ETH Zurich / [email protected]
...
...
Temps
Cities
CO2
Sunspots
Information
Integration
System
full-blown
schema
mappings
Data
Sources
Dataspace Vision by
Franklin, Halevy, and Maier
[SIGMOD Record 05]
6
Outline
Motivation
iTrails
Experiments
Conclusions and Future Work
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
7
iTrails Core Idea: Add Integration Hints
Incrementally
Step 1: Provide a search service over all the data
Use a general graph data model (see VLDB 2006)
Works for unstructured documents, XML, and relations
Step 2: Add integration semantics via hints (trails) on top
of the graph
Works across data sources, not only between sources
Step 3: If more semantics needed, go back to step 2
Impact:
Smooth transition between search and data integration
Semantics added incrementally improve precision / recall
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
8
iTrails: Defining Trails
Basic Form of a Trail
Queries: NEXI-like keyword and
path expressions
QL [.CL] → QR [.CR]
Attribute projections
Intuition:
When I query for QL [.CL], you should also query for QR [.CR]
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
9
Trail Examples: Global Warming Zurich
Trail for Implicit Meaning:
global warming zurich
“When I query for global
warming, you should also
query for Temperature data
above 10 degrees”
Temperatures
date
city
DB
Server
region celsius
24-Sep Bern
BE
20
24-Sep Uster
25-Sep Zurich
ZH
15
ZH
14
26-Sep Zurich
ZH
9
global warming →
//Temperatures/*[celsius > 10]
Trail for an Entity: “When I
query for zurich, you
should also query for
references of zurich as a
region”
zurich → //*[region = “ZH”]
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
10
Trail Example: Deep Web Bookmarks
Web
Server
train home
Trail for a Bookmark: “When I
query for train home, you should
also query for the TrainCompany’s
website with origin at ETH Uni
and destination at Seilbahn
Rigiblick”
train home →
//trainCompany.com//*[origin=“ETH Uni”
and dest =“Seilbahn Rigiblick”]
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
11
Trail Examples: Thesauri, Dictionaries,
Language-agnostic Search
car
auto
Email
Server
Laptop
Trail for Thesauri: “When I
query for car, you should
also query for auto”
car → auto
car
carro
Trails for Dictionary:
“When I query for car, you
should also query for carro
and vice-versa”
car → carro
carro → car
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
12
Trail Examples: Schema Equivalences
DB
Server
Trail for schema match on
Employee
empId empName salary
//Employee//*.tuple.empName →
//Person//*.tuple.name
Person
SSN
names: “When I query for
Employee.empName, you should
also query for Person.name”
name
age income
Trail for schema match on
salaries: “When I query for
Employee.salary, you should
also query for Person.income”
//Employee//*.tuple.salary →
//Person//*.tuple.income
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
13
Core Idea
Trail Examples
How are Trails Created?
Uncertainty and Trails
iTrails
Rewriting Queries with Trails
Experiments
Recursive Matches
Outline
Motivation
Conclusion and Future Work
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
14
How are Trails Created?
Given by the user
Explicitly
Via Relevance Feedback
(Semi-)Automatically
Information extraction techniques
Automatic schema matching
Ontologies and thesauri (e.g., wordnet)
User communities (e.g., trails on gene data, bookmarks)
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
15
Uncertainty and Trails
Probabilistic Trails:
model uncertain trails
probabilities used to rank trails
QL [.CL] → QR [.CR], 0 ≤ p ≤ 1
p
Example: car → auto
p = 0.8
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
16
Certainty and Trails
Scored Trails:
give higher value to certain trails
scoring factors used to boost scores of query results obtained
by the trail
QL [.CL] →
Q
[.C
],
sf
>
1
R
R
sf
Examples:
- T1: weather → //Temperatures/*
p = 0.9, sf = 2
- T2: yesterday → //*[date = today() – 1]
p = 1, sf = 3
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
17
Rewriting Queries with Trails
U
U
Query
weather
yesterday
T2
matches
Trail
U
yesterday
//*[date = today() – 1]
T2: yesterday → //*[date = today() – 1]
(1) Matching
September 26, 2007
weather
(3) Merging
(2) Transformation
Marcos Vaz Salles / ETH Zurich / [email protected]
18
Replacing Trails
Trails that use replace instead of union
semantics
U
U
Query
weather
yesterday
weather
(3) Merging
//*[date = today() – 1]
T2
matches
Trail
T2: yesterday
//*[date = today() – 1]
(1) Matching
September 26, 2007
(2) Transformation
Marcos Vaz Salles / ETH Zurich / [email protected]
19
Problem: Recursive Matches (1/2)
U
weather
U
yesterday
New query
still matches T2,
so T2 could be applied
again
//*[date = today() – 1]
T2
matches
U
weather
T2: yesterday →
//*[date = today() – 1]
U
... U
U
T2
yesterday
matches
September 26, 2007
U
//*[date = today() – 1]
//*[date = today() – 1]
//*[date = today() – 1]
...
//*[date = today() – 1]
Infinite recursion
Marcos Vaz Salles / ETH Zurich / [email protected]
20
Problem: Recursive Matches (2/2)
U
weather
U
yesterday
T3 matches
Trails may be
mutually recursive
//*[date = today() – 1]
U
weather
//*.tuple.modified
T10: //*.tuple.modified →
//*.tuple.date
yesterday
T10 matches
U
//*[modified = today() – 1]
//*[date = today() – 1]
U
T3: //*.tuple.date →
U
weather
U
yesterday
U
We again match T3
and enter an infinite loop
U //*[date = today() – 1]
//*[date = today() – 1]
//*[modified = today() – 1]
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
21
Solution: Multiple Match Coloring
Algorithm
T3, T4 match
U
U
First
Level
//*[date = today() – 1]
U
yesterday
U
weather //Temperatures/*
T1
matches
T2
matches
U
weather
yesterday
Second
Level
U
U
yesterday
weather
//Temperatures/*
U
U
//*[date = today() – 1]
//*[received = today() – 1]
T1:
T 2:
T 3:
T 4:
weather → //Temperatures/*
//*[modified = today() – 1]
yesterday → //*[date = today() – 1]
//*.tuple.date → //*.tuple.modified
//*.tuple.date → //*.tuple.received
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
22
Multiple Match Coloring Algorithm
Analysis
Problem: MMCA is exponential in number of levels
Solution: Trail Pruning
Prune by number of levels
Prune by top-K trails matched in each level
Prune by both top-K trails and number of levels
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
23
Outline
Motivation
iTrails
Experiments
Conclusion and Future Work
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
24
iTrails Evaluation in iMeMex
iMeMex Dataspace System: Open-source prototype
available at http://www.imemex.org
Main Questions in Evaluation
Quality: Top-K Precision and Recall
Performance: Use of Materialization
Scalability: Query-rewrite Time vs. Number of Trails
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
25
iTrails Evaluation in iMeMex
Scenario 1: Few High-quality Trails
Closer to information integration use cases
Obtained real datasets and indexed them
18 hand-crafted trails
14 hand-crafted queries
Scenario 2: Many Low-quality Trails
Closer to search use cases
Generated up to 10,000 trails
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
26
iTrails Evaluation in iMeMex: Scenario 1
Configured iMeMex to act in three modes
Baseline: Graph / IR search engine
iTrails: Rewrite search queries with trails
Perfect Query: Semantics-aware query
Data: shipped to central index
sizes in MB
Laptop
September 26, 2007
Web
Server
Email
Server
Marcos Vaz Salles / ETH Zurich / [email protected]
DB
Server
27
Quality: Top-K Precision and Recall
perfect query
K = 20
Scenario 1:
few high-quality
trails
(18 trails)
Search
Engine
misses
relevant
results
Queries
Search
Query is
partially
semantics-aware
Q13: to =
Q3: pdf
raimund.grube@
enron.com
yesterday
September 26, 2007
Perfect Query always
has precision and recall
equal to 1
Marcos Vaz Salles / ETH Zurich / [email protected]
28
Performance: Use of Materialization
Scenario 1:
few high-quality
trails
(18 trails)
Trail merging adds
overhead to
query execution
Trail Materialization
provides
interactive times
for all queries
response times in sec.
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
29
Scalability: Query-rewrite Time vs.
Number of Trails
Scenario 2:
many low-quality
trails
Query-rewrite time
can be controlled
with pruning
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
30
Conclusion: Pay-as-you-go Information
Integration
global warming zurich
Dataspace
System
Step 1: Provide a search service over all
the data
text,
links
Step 2: Add integration semantics via trails
Data
Sources
Step 3: If more semantics needed, go back to step 2
Our Contributions
iTrails: generic method to model semantic relationships
(e.g. implicit meaning, bookmarks, dictionaries, thesauri,
attribute matches, ...)
We propose a framework and algorithms for Pay-as-yougo Information Integration
Smooth transition between search and data integration
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
31
Future Work
Trail Creation
Use collections (ontologies, thesauri, wikipedia)
Work on automatic mining of trails from the dataspace
Other types of trails
Associations
Lineage
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
32
Questions?
Thanks in advance for your feedback!
[email protected]
http://www.imemex.org
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
33
Backup Slides
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
34
Related Work: Search vs. Data Integration
vs. Dataspaces
Integration Solution
Features
September 26, 2007
Search
Dataspaces
Data
Integration
Integration
Effort
Low
Pay-as-yougo
High
Query
Semantics
Precision /
Recall
Precision /
Recall
Precise
Need for
Schema
Schemanever
Schemalater
Schemafirst
Marcos Vaz Salles / ETH Zurich / [email protected]
35
Personal Dataspaces Literature
Dittrich, Salles, Kossmann, Blunschi. iMeMex: Escapes from the
Personal Information Jungle (Demo Paper). VLDB, September
2005.
Dittrich, Salles. iDM: A Unified and Versatile Data Model for
Personal Dataspace Management. VLDB, September 2006
Dittrich. iMeMex: A Platform for Personal Dataspace
Management. SIGIR PIM, August 2006.
Blunschi, Dittrich, Girard, Karakashian, Salles. A Dataspace
Odyssey: The iMeMex Personal Dataspace Management
System (Demo Paper). CIDR, January 2007.
Dittrich, Blunschi, Färber, Girard, Karakashian, Salles. From
Personal Desktops to Personal Dataspaces: A Report on
Building the iMeMex Personal Dataspace Management System.
BTW 2007, March 2007
Salles, Dittrich, Karakashian, Girard, Blunschi. iTrails: Pay-as-yougo Information Integration in Dataspaces. VLDB, September
2007
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
36
iDM: iMeMex Data Model
Our approach: get the data model closer to personal
information – not the other way around
Supports:
Unstructured, semi-structured and structured data, e.g.,
files&folders, XML, relations
Clearly separation of logical and physical representation of data
Arbitrary directed graph structures, e.g., section references in
LaTeX documents, links in filesystems, etc
Lazily computed data, e.g., ActiveXML (Abiteboul et. al.)
Infinite data, e.g., media and data streams
See VLDB 2006
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
37
Data Model Options
Data Models
Bag of Words
Relational
XML
Support for
Graph data
Specific
schema
Extension:
XLink/
XPointer
Support for
Lazy
Computation
View
mechanism
Extension:
ActiveXML
Extension:
Relational
streams
Extension:
XML
streams
iDM
Nonschematic
data
Serialization
independent
Support
for
Personal
Data
Support for
Infinite data
September 26, 2007
Extension:
Document
streams
Marcos Vaz Salles / ETH Zurich / [email protected]
38
Data Models for Personal Information
Abstraction Level
lower
higher
Relational
iDM
Physical
Level
Personal
Information
XML
Document /
Bag of Words
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
39
Application Layer
Architectural Perspective
of iMeMex
Search & Browse
Indexes&Replicas access
(warehousing)
Email
Office Tools
...
iMeMex PDSMS
iQL
Query
Processor
Complex operators
(query algebra)
Data source access
(mediation)
...
iDM
Query
Processor
Data Source
Query
Processor
Operators
Physical
Algebra
Catalog
Data Store
Result
Cache
Operators
Data
Cleaning
Catalog
Data Store
Indexes
&
Replicas
Operators
Content
Converters
Catalog
Data Source
Plugins
Data Source Layer
...
September 26, 2007
File System
Marcos Vaz Salles / ETH Zurich / [email protected]
IMAP
...
DBMS
40