Transcript Slides

iTrails: Pay-as-you-go Information Integration
in Dataspaces
Marcos Vaz Salles Jens Dittrich
Shant Karakashian
Olivier Girard
Lukas Blunschi
ETH Zurich
VLDB 2007
September 26, 2007
Outline
 Motivation
 iTrails
 Experiments
 Conclusions and Future Work
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
2
Problem: Querying Several Sources
Query
What is the impact of global warming
in Zurich?
?
?
?
?
Systems
Data
Sources
Laptop
September 26, 2007
Email
Server
Web
Server
Marcos Vaz Salles / ETH Zurich / [email protected]
DB
Server
3
Solution 1: Use a Search Engine
Query
Job!
global warming zurich
Graph IR
Search Engine
System
TopX [VLDB05],
FleXPath semantics
[SIGMOD04],
Drawback:
Query
are not precise!
XSearch [VLDB03], XRank [SIGMOD03]
text,
links
text,
links
text,
links
text,
links
Data
Sources
Laptop
September 26, 2007
Email
Server
Web
DB
Server Server
Marcos Vaz Salles / ETH Zurich / [email protected]
4
Solution 2: Use an Information
Integration System
//Temperatures/*[city =
“zurich”]
...
...
...
Temps
Cities
Query
Information
Integration
System
Drawback: Too much
effort to provide
...
System
schema
mappings!
GAV (e.g.
[ICDE95]), LAV (e.g. [VLDB96]),
CO2
Sunspots
GLAV [AAAI99], P2P (e.g. [SIGMOD04])
missing
schema
mapping
missing
schema
mapping
schema
mapping
schema
mapping
Data
Sources
Laptop
September 26, 2007
Email
Server
Web
Server
Marcos Vaz Salles / ETH Zurich / [email protected]
DB
Server
5
Research Challenge:
Is There an Integration Solution in-between
These Two Extremes?
global warming zurich
//Temperatures/*[city =
“zurich”]
global warming zurich
?
Graph IR
Search Engine
Pay-as-you-go
text,
Information
links
Integration
text,
links
text,
links
Dataspace ...
System
...
text,
links
text,
links
Data
Sources
Laptop
September 26, 2007
Email
Server
Web
DB
Server Server
Marcos Vaz Salles / ETH Zurich / [email protected]
...
...
Temps
Cities
CO2
Sunspots
Information
Integration
System
full-blown
schema
mappings
Data
Sources
Dataspace Vision by
Franklin, Halevy, and Maier
[SIGMOD Record 05]
6
Outline
 Motivation
 iTrails
 Experiments
 Conclusions and Future Work
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
7
iTrails Core Idea: Add Integration Hints
Incrementally
 Step 1: Provide a search service over all the data

Use a general graph data model (see VLDB 2006)
 Works for unstructured documents, XML, and relations
 Step 2: Add integration semantics via hints (trails) on top
of the graph

Works across data sources, not only between sources
 Step 3: If more semantics needed, go back to step 2
 Impact:

Smooth transition between search and data integration

Semantics added incrementally improve precision / recall
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
8
iTrails: Defining Trails
 Basic Form of a Trail
Queries: NEXI-like keyword and
path expressions
QL [.CL] → QR [.CR]
Attribute projections
 Intuition:
When I query for QL [.CL], you should also query for QR [.CR]
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
9
Trail Examples: Global Warming Zurich
 Trail for Implicit Meaning:
global warming zurich
“When I query for global
warming, you should also
query for Temperature data
above 10 degrees”
Temperatures
date
city
DB
Server
region celsius
24-Sep Bern
BE
20
24-Sep Uster
25-Sep Zurich
ZH
15
ZH
14
26-Sep Zurich
ZH
9
global warming →
//Temperatures/*[celsius > 10]
 Trail for an Entity: “When I
query for zurich, you
should also query for
references of zurich as a
region”
zurich → //*[region = “ZH”]
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
10
Trail Example: Deep Web Bookmarks
Web
Server
train home
 Trail for a Bookmark: “When I
query for train home, you should
also query for the TrainCompany’s
website with origin at ETH Uni
and destination at Seilbahn
Rigiblick”
train home →
//trainCompany.com//*[origin=“ETH Uni”
and dest =“Seilbahn Rigiblick”]
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
11
Trail Examples: Thesauri, Dictionaries,
Language-agnostic Search
car
auto
Email
Server
Laptop
 Trail for Thesauri: “When I
query for car, you should
also query for auto”
car → auto
car
carro
 Trails for Dictionary:
“When I query for car, you
should also query for carro
and vice-versa”
car → carro
carro → car
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
12
Trail Examples: Schema Equivalences
DB
Server
 Trail for schema match on
Employee
empId empName salary
//Employee//*.tuple.empName →
//Person//*.tuple.name
Person
SSN
names: “When I query for
Employee.empName, you should
also query for Person.name”
name
age income
 Trail for schema match on
salaries: “When I query for
Employee.salary, you should
also query for Person.income”
//Employee//*.tuple.salary →
//Person//*.tuple.income
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
13

Core Idea

Trail Examples

How are Trails Created?

Uncertainty and Trails
 iTrails

Rewriting Queries with Trails
 Experiments

Recursive Matches
Outline
 Motivation
 Conclusion and Future Work
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
14
How are Trails Created?
 Given by the user

Explicitly

Via Relevance Feedback
 (Semi-)Automatically

Information extraction techniques

Automatic schema matching

Ontologies and thesauri (e.g., wordnet)

User communities (e.g., trails on gene data, bookmarks)
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
15
Uncertainty and Trails
 Probabilistic Trails:

model uncertain trails

probabilities used to rank trails
QL [.CL] → QR [.CR], 0 ≤ p ≤ 1
p

Example: car → auto
p = 0.8
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
16
Certainty and Trails
 Scored Trails:

give higher value to certain trails

scoring factors used to boost scores of query results obtained
by the trail
QL [.CL] →
Q
[.C
],
sf
>
1
R
R
sf

Examples:
- T1: weather → //Temperatures/*
p = 0.9, sf = 2
- T2: yesterday → //*[date = today() – 1]
p = 1, sf = 3
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
17
Rewriting Queries with Trails
U
U
Query
weather
yesterday
T2
matches
Trail
U
yesterday
//*[date = today() – 1]
T2: yesterday → //*[date = today() – 1]
(1) Matching
September 26, 2007
weather
(3) Merging
(2) Transformation
Marcos Vaz Salles / ETH Zurich / [email protected]
18
Replacing Trails
 Trails that use replace instead of union
semantics
U
U
Query
weather
yesterday
weather
(3) Merging
//*[date = today() – 1]
T2
matches
Trail
T2: yesterday
//*[date = today() – 1]
(1) Matching
September 26, 2007
(2) Transformation
Marcos Vaz Salles / ETH Zurich / [email protected]
19
Problem: Recursive Matches (1/2)
U
weather
U
yesterday
New query
still matches T2,
so T2 could be applied
again
//*[date = today() – 1]
T2
matches
U
weather
T2: yesterday →
//*[date = today() – 1]
U
... U
U
T2
yesterday
matches
September 26, 2007
U
//*[date = today() – 1]
//*[date = today() – 1]
//*[date = today() – 1]
...
//*[date = today() – 1]
Infinite recursion
Marcos Vaz Salles / ETH Zurich / [email protected]
20
Problem: Recursive Matches (2/2)
U
weather
U
yesterday
T3 matches
Trails may be
mutually recursive
//*[date = today() – 1]
U
weather
//*.tuple.modified
T10: //*.tuple.modified →
//*.tuple.date
yesterday
T10 matches
U
//*[modified = today() – 1]
//*[date = today() – 1]
U
T3: //*.tuple.date →
U
weather
U
yesterday
U
We again match T3
and enter an infinite loop
U //*[date = today() – 1]
//*[date = today() – 1]
//*[modified = today() – 1]
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
21
Solution: Multiple Match Coloring
Algorithm
T3, T4 match
U
U
First
Level
//*[date = today() – 1]
U
yesterday
U
weather //Temperatures/*
T1
matches
T2
matches
U
weather
yesterday
Second
Level
U
U
yesterday
weather
//Temperatures/*
U
U
//*[date = today() – 1]
//*[received = today() – 1]
T1:
T 2:
T 3:
T 4:
weather → //Temperatures/*
//*[modified = today() – 1]
yesterday → //*[date = today() – 1]
//*.tuple.date → //*.tuple.modified
//*.tuple.date → //*.tuple.received
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
22
Multiple Match Coloring Algorithm
Analysis
 Problem: MMCA is exponential in number of levels
 Solution: Trail Pruning

Prune by number of levels

Prune by top-K trails matched in each level

Prune by both top-K trails and number of levels
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
23
Outline
 Motivation
 iTrails
 Experiments
 Conclusion and Future Work
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
24
iTrails Evaluation in iMeMex
 iMeMex Dataspace System: Open-source prototype
available at http://www.imemex.org
 Main Questions in Evaluation

Quality: Top-K Precision and Recall

Performance: Use of Materialization

Scalability: Query-rewrite Time vs. Number of Trails
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
25
iTrails Evaluation in iMeMex
 Scenario 1: Few High-quality Trails

Closer to information integration use cases

Obtained real datasets and indexed them

18 hand-crafted trails

14 hand-crafted queries
 Scenario 2: Many Low-quality Trails

Closer to search use cases

Generated up to 10,000 trails
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
26
iTrails Evaluation in iMeMex: Scenario 1
 Configured iMeMex to act in three modes

Baseline: Graph / IR search engine

iTrails: Rewrite search queries with trails

Perfect Query: Semantics-aware query
 Data: shipped to central index
sizes in MB
Laptop
September 26, 2007
Web
Server
Email
Server
Marcos Vaz Salles / ETH Zurich / [email protected]
DB
Server
27
Quality: Top-K Precision and Recall
perfect query
K = 20
Scenario 1:
few high-quality
trails
(18 trails)
Search
Engine
misses
relevant
results
Queries
Search
Query is
partially
semantics-aware
Q13: to =
Q3: pdf
raimund.grube@
enron.com
yesterday
September 26, 2007
Perfect Query always
has precision and recall
equal to 1
Marcos Vaz Salles / ETH Zurich / [email protected]
28
Performance: Use of Materialization
Scenario 1:
few high-quality
trails
(18 trails)
Trail merging adds
overhead to
query execution
Trail Materialization
provides
interactive times
for all queries
response times in sec.
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
29
Scalability: Query-rewrite Time vs.
Number of Trails
Scenario 2:
many low-quality
trails
Query-rewrite time
can be controlled
with pruning
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
30
Conclusion: Pay-as-you-go Information
Integration
global warming zurich
Dataspace
System
 Step 1: Provide a search service over all
the data
text,
links
 Step 2: Add integration semantics via trails
Data
Sources
 Step 3: If more semantics needed, go back to step 2
 Our Contributions

iTrails: generic method to model semantic relationships
(e.g. implicit meaning, bookmarks, dictionaries, thesauri,
attribute matches, ...)
 We propose a framework and algorithms for Pay-as-yougo Information Integration
 Smooth transition between search and data integration
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
31
Future Work
 Trail Creation

Use collections (ontologies, thesauri, wikipedia)

Work on automatic mining of trails from the dataspace
 Other types of trails

Associations

Lineage
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
32
Questions?
Thanks in advance for your feedback! 
[email protected]
http://www.imemex.org
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
33
Backup Slides
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
34
Related Work: Search vs. Data Integration
vs. Dataspaces
Integration Solution
Features
September 26, 2007
Search
Dataspaces
Data
Integration
Integration
Effort
Low
Pay-as-yougo
High
Query
Semantics
Precision /
Recall
Precision /
Recall
Precise
Need for
Schema
Schemanever
Schemalater
Schemafirst
Marcos Vaz Salles / ETH Zurich / [email protected]
35
Personal Dataspaces Literature
 Dittrich, Salles, Kossmann, Blunschi. iMeMex: Escapes from the





Personal Information Jungle (Demo Paper). VLDB, September
2005.
Dittrich, Salles. iDM: A Unified and Versatile Data Model for
Personal Dataspace Management. VLDB, September 2006
Dittrich. iMeMex: A Platform for Personal Dataspace
Management. SIGIR PIM, August 2006.
Blunschi, Dittrich, Girard, Karakashian, Salles. A Dataspace
Odyssey: The iMeMex Personal Dataspace Management
System (Demo Paper). CIDR, January 2007.
Dittrich, Blunschi, Färber, Girard, Karakashian, Salles. From
Personal Desktops to Personal Dataspaces: A Report on
Building the iMeMex Personal Dataspace Management System.
BTW 2007, March 2007
Salles, Dittrich, Karakashian, Girard, Blunschi. iTrails: Pay-as-yougo Information Integration in Dataspaces. VLDB, September
2007
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
36
iDM: iMeMex Data Model
 Our approach: get the data model closer to personal
information – not the other way around
 Supports:

Unstructured, semi-structured and structured data, e.g.,
files&folders, XML, relations

Clearly separation of logical and physical representation of data

Arbitrary directed graph structures, e.g., section references in
LaTeX documents, links in filesystems, etc

Lazily computed data, e.g., ActiveXML (Abiteboul et. al.)

Infinite data, e.g., media and data streams
See VLDB 2006
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
37
Data Model Options
Data Models
Bag of Words
Relational
XML
Support for
Graph data
Specific
schema
Extension:
XLink/
XPointer
Support for
Lazy
Computation
View
mechanism
Extension:
ActiveXML
Extension:
Relational
streams
Extension:
XML
streams
iDM
Nonschematic
data
Serialization
independent
Support
for
Personal
Data
Support for
Infinite data
September 26, 2007
Extension:
Document
streams
Marcos Vaz Salles / ETH Zurich / [email protected]
38
Data Models for Personal Information
Abstraction Level
lower
higher
Relational
iDM
Physical
Level
Personal
Information
XML
Document /
Bag of Words
September 26, 2007
Marcos Vaz Salles / ETH Zurich / [email protected]
39
Application Layer
Architectural Perspective
of iMeMex
Search & Browse
Indexes&Replicas access
(warehousing)
Email
Office Tools
...
iMeMex PDSMS
iQL
Query
Processor
Complex operators
(query algebra)
Data source access
(mediation)
...
iDM
Query
Processor
Data Source
Query
Processor
Operators
Physical
Algebra
Catalog
Data Store
Result
Cache
Operators
Data
Cleaning
Catalog
Data Store
Indexes
&
Replicas
Operators
Content
Converters
Catalog
Data Source
Plugins
Data Source Layer
...
September 26, 2007
File System
Marcos Vaz Salles / ETH Zurich / [email protected]
IMAP
...
DBMS
40