Transcript slides

Corleone: Hands-Off Crowdsourcing
for Entity Matching
Chaitanya Gokhale
University of Wisconsin-Madison
Joint work with AnHai Doan, Sanjib Das, Jeffrey
Naughton, Ram Rampalli, Jude Shavlik, and
Jerry Zhu
@WalmartLabs
Entity Matching
Amazon
Walmart
id
Name
brand
price
id
name
brand
price
1
HP Biscotti G72
17.3” Laptop ..
HP
395.0
1
Transcend
JetFlash 700
Transcend
30.0
2
Transcend 16 GB
JetFlash 500
Transcend
17.5
2
HP Biscotti 17.3”
G72 Laptop ..
HP
388.0
.... … .. ….
……
…..
.... … .. ….
……
…..
…. .. ……..
….
..
…. .. ……..
….
..



Has been studied extensively for decades
No satisfactory solution as yet
Recent work has considered crowdsourcing
2
Recent Crowdsourced EM Work

Verifying predicted matches
– e.g., [Demartini et al. WWW’12, Wang et al. VLDB’12, SIGMOD’13]

Finding best questions to ask crowd
– to minimize number of such questions
– e.g., [Whang et al. VLDB’13]

Finding best UI to pose questions
– display 1 question per page, or 10, or …?
– display record pairs or clusters?
– e.g., [Marcus et al. VLDB’11, Whang et al. TR’12]
3
Recent Crowdsourced EM Work

Example: verifying predicted matches
A
B
a
b
c
d
e
Blocking (a,d)
(b,e)
(c,d)
(c,e)
Matching
(a,d) Y
(b,e) N
(c,d) Y
(c,e) Y
Verifying
(a,d) Y
(c,e) Y
– sample blocking rule: if prices differ by at least $50  do not match

Shows that crowdsourced EM is highly promising
 But suffers from a major limitation
– crowdsources only parts of workflow
– needs a developer to execute the remaining parts
4
Need for Developer Poses Serious Problems

Does not scale to EM at enterprises
– enterprises often have tens to hundreds of EM problems
– can’t afford so many developers

Example: matching products at WalmartLabs
walmart.com
Walmart Stores (brick&mortar)
all
clothes
shirts
pants
electronics
TVs
all
……
……...
……...
books
……
science romance
electronics clothes ……...
TVs
……
……...
– hundreds of major product categories
– to obtain high accuracy, must match each category separately
– so have hundreds of EM problems, one per category
5
Need for Developer Poses Serious Problems

Can not handle crowdsourcing for the masses
– masses can’t be developers, can’t use crowdsourcing startups either

E.g., journalist wants to match two long lists of political
donors
– can’t use current EM solutions, because can’t act as a developer
– can pay up to $500
– can’t ask a crowdsourcing startup to help
 $500 is too little for them to engage a developer
– same problem for domain scientists, small business workers,
end users, data enthusiasts, …
6
Our Solution: Hands-Off Crowdsourcing

Crowdsources the entire workflow of a task
– requiring no developers

Given a problem P supplied by user U,
a crowdsourced solution to P is hands-off iff
– uses no developers, only crowd
– user U does no or little initial setup work, requiring no special skills

Example: to match two tables A and B, user U supplies
– the two tables
– a short textual instruction to the crowd on what it means to match
– two negative & two positive examples to illustrate the instruction
7
Hands-Off Crowdsourcing (HOC)

A next logical direction for EM research
– from no- to partial- to complete crowdsourcing

Can scale up EM at enterprises
 Can open up crowdsourcing for the masses
 E.g., journalist wants to match two lists of donors
– uploads two lists to an HOC website
– specifies a budget of $500 on a credit card
– HOC website uses crowd to execute the EM workflow,
returns matches to journalist

Very little work so far on crowdsourcing for the masses
– even though that’s where crowdsourcing can make a lot of impacts
8
Our Solution:
Corleone, an HOC System for EM
Crowd of workers
(e.g., on Amazon Mechanical Turk)
Tables
A
Blocker
User
B
Instructions
to the crowd
Four examples
Candidate
tuple
pairs
Matcher
Predicted
matches
Accuracy
Estimator
- Predicted
matches
- Accuracy
estimates
(P, R)
Difficult
Pairs’ Locator
9
Blocking

|A x B| is often very large (e.g., 10B pairs or more)
– developer writes rules to remove obviously non-matched pairs
trigram(a.title, b.title) < 0.2
[for matching Citations]
overlap(a.brand, b.brand) = 0
[for matching Products]
AND cosine(a.title, b.title) ≤ 0.1 AND
a.price/b.price ≥ 3 OR b.price/a.price ≥ 3 OR
isNULL(a.price,b.price))
– critical step in EM

How do we get the crowd to do this?
– ordinary workers; can’t write machine-readable rules
– if write in English, we can’t convert them into machine-readable

Crowdsourced EM so far asks people to label examples
– no work has asked people to write machine-readable rules
10
Our Key Idea

Ask people to label examples, as before

Use them to generate many machine-readable rules
– using machine learning, specifically a random forest

Ask crowd to evaluate, select and apply the best rules

This has proven highly promising
– e.g., reduce # of tuple pairs from 168M to 38.2K, at cost of $7.20
from 56M to 173.4K, at cost of $22
– with no developer involved
– in some cases did much better than using a developer
(bigger reduction, higher accuracy)
11
Blocking in Corleone

Decide if blocking is necessary
– If |A X B| < τ, no blocking, return A X B. Otherwise do blocking.


Take sample S from A x B
Train a random forest F on S (to match tuple pairs)
– using active learning, where crowd labels pairs
Four examples
supplied by user
(2 pos, 2 neg)
Train a random forest F
Stopping criterion satisfied?
Y
Random
forest F
N
Sample S
from |A x B|
Select q “most informative”
unlabeled examples
Label the q selected
examples using crowd
Amazon’s Mechanical Turk
Blocking in Corleone

Extract candidate rules from random forest F
Example random
forest F for
matching books
isbn_match
N
Y
No
title_match
N
Y
#pages_match
N
Y
No
No
publisher_match
N
Y
No
Yes
year_match
N
Extracted candidate rules
(isbn_match = N)
No
(isbn_match = Y) and (#pages_match = N)
Y
No
Yes
No
(title_match = N)
No
(title_match = Y) and (publisher_match = N)
No
(title_match = Y) and (publisher_match = Y) and (year_match = N)
No
13
Blocking in Corleone

Evaluate the precision of extracted candidate rules
– for each rule R, apply R to predict “match / no match” on sample S
– ask crowd to evaluate R’s predictions
– compute precision for R

Select most precise rules as “blocking rules”
Apply blocking rules to A and B using Hadoop, to
obtain a smaller set of candidate pairs to be matched

Multiple difficult optimization problems in blocking

– to minimize crowd effort & scale up to very large tables A and B
– see paper
14
The Rest of Corleone
Crowd of workers
(e.g., on Amazon Mechanical Turk)
Tables
A
Blocker
User
B
Instructions
to the crowd
Four examples
Candidate
tuple
pairs
Matcher
Predicted
matches
Accuracy
Estimator
- Predicted
matches
- Accuracy
estimates
Difficult
Pairs’ Locator
15
Empirical Evaluation
Datasets

Table A
Table B
|A X B|
|M|
# attributes # features
Restaurants
533
331
176,423
112
4
12
Citations
2616
64,263
168.1 M
5347
4
7
Products
2554
21,537
55 M
1154
9
23
Mechanical Turk settings
– Turker qualifications: at least 100 HITs completed with ≥ 95%
approval rate
– Payment: 1-2 cents per question

Repeated three times on each data set,
each run in a different week
16
Performance Comparison

Two traditional solutions: Baseline 1 and Baseline 2
– developer performs blocking
– supervised learning to match the candidate set

Baseline 1: labels the same # of pairs as Corleone

Baseline 2: labels 20% of the candidate set
– for Products, Corleone labels 3205 pairs, Baseline 2 labels 36076

Also compare with results from published work
17
Performance Comparison
Corleone
Datasets
Baseline 1
Baseline 2
R
Published
Works
P
R
F1
Cost
P
R
F1
P
F1
F1
Restaurants
97.0
96.1
96.5
$9.20
10.0
6.1
7.6
99.2
93.8 96.4
92-97 %
[1,2]
Citations
89.9
94.3
92.1
$69.50
90.4
84.3 87.1
93.0
91.1 92.0
88-92 %
[2,3,4]
Products
91.5
87.4
89.3
$256.80
92.9
26.6 40.5
95.0
54.8 69.5
Not
available
[1] CrowdER: crowdsourcing entity resolution, Wang et al., VLDB’12.
[2] Frameworks for entity matching: A comparison, Kopcke et al., Data Knowl. Eng. (2010).
[3] Evaluation of entity resolution approaches on real-world match problems,
Kopcke et al., PVLDB’10.
[4] Active sampling for entity matching. Bellare et al., SIGKDD’12.
18
Blocking
Cartesian
Product
Candidate
Set
Recall
(%)
Total
cost
Time
176.4K
176.4K
100
$0
-
Citations
168 million
38.2K
99
$7.20
6.2 hours
Products
56 million
173.4K
92
$22.00
2.7 hours
Datasets
Restaurants

Comparison against blocking by a developer
– Citations: 100% recall with 202.5K candidate pairs
– Products: 90% recall with 180.2K candidate pairs

See paper for more experiments
– on blocking, matcher, accuracy estimator, difficult pairs’ locator, etc.
19
Conclusion


Current crowdsourced EM often requires a developer
Need for developer poses serious problems
– does not scale to EM at enterprises
– cannot handle crowdsourcing for the masses

Proposed hands-off crowdsourcing (HOC)
– crowdsource the entire workflow, no developer

Developed Corleone, the first HOC system for EM
– competitive with or outperforms current solutions
– no developer effort, relatively little money
– being transitioned into production at WalmartLabs

Future directions
– scaling up to very large data sets
– HOC for other tasks, e.g., joins in crowdsourced RDBMSs, IE