Transcript PPT slides

eTuner: Tuning Schema Matching
Software using Synthetic Scenarios
Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan
University of Illinois, USA
Arnon Rosenthal
MITRE Corp., USA
Main Points

Tuning matching systems:
long standing problem
– becomes increasingly worse

We propose a principled solution
– exploits synthetic input/output pairs
– promising, though much work remains

Idea applicable to other contexts
2
Schema Matching
price
Schema 1
agent-name
address
120,000 George Bush
Crawford, TX
239,900 Hillary Clinton New York City, NY
1-1 match
listed-price
Schema 2
320K
240K
contact-name
Jane Brown
Mike Smith
complex match
city
state
Seattle WA
Miami FL
3
Schema Matching is Ubiquitous

Databases
–
–
–
–
–
–

data integration,
model management
data translation,
collaborative data sharing
keyword querying, schema/view integration
data warehousing, peer data management, …
AI
– knowledge bases, ontology merging, information gathering agents, ...

Web
– e-commerce, Deep Web, Semantic Web

eGovernment, bio-informatics, scientific data management
4
Current State of Affairs

Finding semantic mappings is now a key bottleneck!
– largely done by hand, labor intensive & error prone

Numerous matching techniques have been developed
– Databases: IBM Almaden, Microsoft Research, BYU, George Mason,
U Leipzig, U Wisconsin, NCSU, U Illinois, Washington,
Humboldt-Universität zu Berlin, ...
– AI: Stanford, Karlsruhe University, NEC Japan, ...

Techniques are often synergistic, leading to
multi-component matching architectures
– each component employs a particular technique
– final predictions combine those of the components
5
An Example: LSD [SIGMOD-01]
agent
Schema 1
address
Urbana, IL James Smith
Seattle, WA Mike Doan
Schema 2
area
Peoria, IL
Kent, WA
name
agent-name
contact-agent
Name
Matcher
agent
0.5
Combiner
Naive Bayes
Matcher
0.1
0.3
(206) 634 9435
(617) 335 4243
area
=> (address, 0.7), (description, 0.3)
contact-agent => (agent-phone, 0.7), (agent-name, 0.3)
comments
contact
Constraint
Enforcer
Match
Selector
=> (address, 0.6), (desc, 0.4)
area = address
Only one attribute
of Schema 2
matches address
contact-agent = agent-phone
...
comments = desc
6
Multi-Component Matching Solutions

Developed in many recent works
– e.g., Doan et. al., WebDB-00, SIGMOD-01; Do&Rahm, VLDB-02;
Embley et.al.-02; Bernstein et. al. SIGMOD Record-04; Madhavan et. al. 05

Now commonly adopted, with industrial-strength systems
– e.g., Protoplasm [MSR], COMA++ [Univ of Lepzig]
Match
selector
Match
selector
Constraint
enforcer
Combiner
Matcher 1 … Matcher n
LSD
Match
selector
Match
selector
Constraint
enforcer
Combiner
Constraint
enforcer
Combiner
Matcher 1 … Matcher n
Matcher
COMA
SF
Matcher
Combiner
Matcher 1
…
Matcher n
LSD-SF

Such systems are very powerful ...
– maximize accuracy; highly customizable to individual domain

... but place a serious tuning burden on domain users
7
Tuning Schema Matching Systems

Given a particular matching situation
– how to select the right components?
– how to adjust the multitude of knobs?
Match
selector
Threshold
selector
Bipartite
graph selector
A* search enforcer Relax. labeler
Constraint
enforcer
Combiner
Matcher 1 …
Matcher n
Execution graph

ILP
Average
Min
Max
Weighted
combiner combiner combiner sum combiner
q-gram name Decision tree
matcher
matcher
TF/IDF
name matcher
Naïve Bays
matcher
SVM
matcher
Knobs of
decision tree matcher
•
•
•
•
Characteristics of attr.
Split measure
Post-prune?
Size of validation set
•
•
•
Library of matching components
Untuned versions produce inferior accuracy, however ...
8
... Tuning is Extremely Difficult

Large number of knobs
– e.g., 8-29 in our experiments

Wide variety of techniques
– database, machine learning, IR, information theory, etc.




Complex interaction among components
Not clear how to compare the quality of knob configs
Matching systems are still tuned manually, by trial and error
Multiple component systems make tuning even worse
Developing efficient tuning techniques is crucial
to making matching systems attractive in practice
9
The eTuner Solution

Given schema S & matching system M
– tunes M to maximize average accuracy
of matching S with future schemas
– incurs virtually no cost to user

Key challenge 1: Evaluation
– must search for “best” knob config
– how to compute the quality of any knob config C?
– if knowing “ground-truth” matches for a representative workload
W = {(S,T1), ..., (S,Tn)}, then can use W to evaluate C
– but often have no such W

Key challenge 2: Search
– how to efficiently evaluate the huge space of knob configs?
10
Key Idea: Generate Synthetic Input/Output Pairs

Need workload
W = {(S,T1), (S,T2), …, (S,Tn)}

To generate W
– start with S
– perturb S to generate T1
– perturb S to generate T2
– etc.

Know the perturbation => know matches between S & Ti
11
Key Idea: Generate Synthetic Input/Output Pairs
V
V1
1
Perturb # of tables
3
2
12
3
Perturb # of columns
in each table
.
.
.
Split S into V and U with
disjoint data tuples
EMPLOYEES
id
Schema S
first
last
Vn
salary ($)
1
Bill
Laup
40,000 $
2
Mike
Brown
60,000 $
12
3
Perturb column
and table names
EMPLOYEES
last
1
2
Laup
3
Brown
id
salary($)
1
40,000$
2
60,000$
EMPS
1
2
Perturb data tuples
in each table
3
12
U
3
id
wage
Laup
emp-last
1
40,000$
Brown
2
60,000$
12
3
EMPLOYEES
id
first
last
salary ($)
EMPS
EMPLOYEES
1
Bill
Laup
40,000 $
2
Mike
Brown
60,000 $
id
first
3
Jean
Ann
30,000 $
3
4
Roy
Bond
70,000 $
4
last
salary ($)
Jean
Ann
30,000$
Roy
Bond
70,000$
U
EMPS.emp-last = EMPLOYEES.last
EMPS.id
= EMPLOYEES.id
EMPS.wage
= EMPLOYEES.salary($)
Ω1: a set of semantic matches
emp-last
id
wage
Laup
1
45200
Brown
2
59328
V1
12
Examples of Perturbation Rules

Number of tables
– merge two tables based on a join path
– splits a table into two

Structure of table
– merges two columns
– e.g., neighboring columns, or sharing prefix/suffix (last-name, first-name)
– drop a column
– swap location of two columns

Names of tables/columns
– rules capture common name transformations
– abbreviation to the first 3-4 characters, dropping all vowels, synonyms,
dropping prefixes, adding table name to column name, etc

Data values
– rules capture common format transformations: 12/4 => Dec 4
– values are changed based on some distributions (e.g., Gaussian)
See paper for details
13
The eTuner Architecture
Tuning Procedures
Perturbation Rules
Workload
Generator
(Optional)
Schema S
Synthetic
Workload
U
Ω1
V1
U
Ω2
V2
U
Ωn
Vn
Staged
Tuner
Tuned Matching Tool M
Matching Tool M
14
The Staged Tuner
Match
selector
Level 4
Constraint
enforcer
Level 3
Combiner
Level 2
Matcher 1


…
Matcher n
Tuning
direction
Level 1
Tune sequentially starting with lowest-level components
Assume
– execution graph has k levels, m nodes per level
– each node can be assigned one of n components
– each component has p knobs, each of which has q values
tuning examines (npqkm) out of (npq)^(km) knob configs
15
Empirical Evaluation
Domains
Domain
# schemas
# tables per
schema
# attributes
per schema
# tuples
per table
Real Estate
5
2
30
1000
Courses
5
3
13
50
LSD
Inventory
10
4
10
20
Corpus (ICDE’05)
Product
2
2
50
120
iMAP (SIGMOD’04)
reference paper
LSD (SIGMOD’01)
Matching systems
LSD:
6 Matchers, 6 Combiners, 1 Constraint enforcer, 2 Match selectors, 21 Knobs
iCOMA: 10 Matchers, 4 Combiners, 2 Match selectors, 20 Knobs
SF:
3 Matchers, 1 Constraint enforcer, 2 Match selectors, 8 Knobs
LSD-SF: 7 Matcher, 7 Combiners, 1 Constraint enforcer, 2 Match selectors, 29 Knobs
16
Matching Accuracy
Off-the-shelf
Domain-independent
Domain-dependent
Source-dependent
0.9
0.8
0.9
0.8
LSD
0.7
0.5
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
Real Estate
Product
COMA
0.7
0.6
0.6
Inventory
Course
eTUNER: Automatic
eTUNER: Human-assisted
Real Estate
Product
Inventory
Course
0.9
0.8
0.9
SF
0.7
LSD-SF
0.8
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
Real Estate
Product
Inventory
Course
Real Estate
Product
Inventory
eTuner achieves higher accuracy than current best
methods, at virtually no cost to the user
Course
17
Cost of Using eTuner


You have a schema S and a matching system M
Vendor supplies eTuner
– will hook it up with matching system M

Vendor supplies a matching system M
– bundles eTuner inside
18
Sensitivity Analysis


Adding perturbation rules
Exploiting prior match results (enriching the workload)
0.7
Accuracy (F1)
0.6
0.5
0.4
0.3
0.2
Inventory Domain
Real Estate Domain
0.1
Average
0.0
1
10
20
25
40
50
Schemas in Synthetic Workload (#)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Tuned LSD
0
22
44
66
88
Previous matches in collection (%)
19
Summary: The eTuner Project @ Illinois

Tuning matching systems is crucial
– long standing problem, is getting worse
– a next logical step in schema matching research

Provides an automatic & principled solution
– generates a synthetic workload, employs it to tune efficiently
– incurs virtually no cost to human users
– exploits user assistance whenever available

Extensive experiments over 4 domains with 4 systems

Future directions
– find optimal synthetic workload
– apply to other matching scenarios
– adapt ideas to scenarios beyond schema matching (see 3rd speaker)
20
Backup: User Assistance
S(phone1,phone2,…)
 Generate V by dropping phone2: V(phone1,…)
 Rename phone1 in V: V(x,…)
 Problem:

– x matches phone1, x does not match phone2

User:
– group phone1 and phone2
– so if x matches phone1, it will also match phone2

Intuition: tell system do not bother to try distinguish phone1
and phone2
21