Transcript slides
Corleone: Hands-Off Crowdsourcing for Entity Matching Chaitanya Gokhale University of Wisconsin-Madison Joint work with AnHai Doan, Sanjib Das, Jeffrey Naughton, Ram Rampalli, Jude Shavlik, and Jerry Zhu @WalmartLabs Entity Matching Amazon Walmart id Name brand price id name brand price 1 HP Biscotti G72 17.3” Laptop .. HP 395.0 1 Transcend JetFlash 700 Transcend 30.0 2 Transcend 16 GB JetFlash 500 Transcend 17.5 2 HP Biscotti 17.3” G72 Laptop .. HP 388.0 .... … .. …. …… ….. .... … .. …. …… ….. …. .. …….. …. .. …. .. …….. …. .. Has been studied extensively for decades No satisfactory solution as yet Recent work has considered crowdsourcing 2 Recent Crowdsourced EM Work Verifying predicted matches – e.g., [Demartini et al. WWW’12, Wang et al. VLDB’12, SIGMOD’13] Finding best questions to ask crowd – to minimize number of such questions – e.g., [Whang et al. VLDB’13] Finding best UI to pose questions – display 1 question per page, or 10, or …? – display record pairs or clusters? – e.g., [Marcus et al. VLDB’11, Whang et al. TR’12] 3 Recent Crowdsourced EM Work Example: verifying predicted matches A B a b c d e Blocking (a,d) (b,e) (c,d) (c,e) Matching (a,d) Y (b,e) N (c,d) Y (c,e) Y Verifying (a,d) Y (c,e) Y – sample blocking rule: if prices differ by at least $50 do not match Shows that crowdsourced EM is highly promising But suffers from a major limitation – crowdsources only parts of workflow – needs a developer to execute the remaining parts 4 Need for Developer Poses Serious Problems Does not scale to EM at enterprises – enterprises often have tens to hundreds of EM problems – can’t afford so many developers Example: matching products at WalmartLabs walmart.com Walmart Stores (brick&mortar) all clothes shirts pants electronics TVs all …… ……... ……... books …… science romance electronics clothes ……... TVs …… ……... – hundreds of major product categories – to obtain high accuracy, must match each category separately – so have hundreds of EM problems, one per category 5 Need for Developer Poses Serious Problems Can not handle crowdsourcing for the masses – masses can’t be developers, can’t use crowdsourcing startups either E.g., journalist wants to match two long lists of political donors – can’t use current EM solutions, because can’t act as a developer – can pay up to $500 – can’t ask a crowdsourcing startup to help $500 is too little for them to engage a developer – same problem for domain scientists, small business workers, end users, data enthusiasts, … 6 Our Solution: Hands-Off Crowdsourcing Crowdsources the entire workflow of a task – requiring no developers Given a problem P supplied by user U, a crowdsourced solution to P is hands-off iff – uses no developers, only crowd – user U does no or little initial setup work, requiring no special skills Example: to match two tables A and B, user U supplies – the two tables – a short textual instruction to the crowd on what it means to match – two negative & two positive examples to illustrate the instruction 7 Hands-Off Crowdsourcing (HOC) A next logical direction for EM research – from no- to partial- to complete crowdsourcing Can scale up EM at enterprises Can open up crowdsourcing for the masses E.g., journalist wants to match two lists of donors – uploads two lists to an HOC website – specifies a budget of $500 on a credit card – HOC website uses crowd to execute the EM workflow, returns matches to journalist Very little work so far on crowdsourcing for the masses – even though that’s where crowdsourcing can make a lot of impacts 8 Our Solution: Corleone, an HOC System for EM Crowd of workers (e.g., on Amazon Mechanical Turk) Tables A Blocker User B Instructions to the crowd Four examples Candidate tuple pairs Matcher Predicted matches Accuracy Estimator - Predicted matches - Accuracy estimates (P, R) Difficult Pairs’ Locator 9 Blocking |A x B| is often very large (e.g., 10B pairs or more) – developer writes rules to remove obviously non-matched pairs trigram(a.title, b.title) < 0.2 [for matching Citations] overlap(a.brand, b.brand) = 0 [for matching Products] AND cosine(a.title, b.title) ≤ 0.1 AND a.price/b.price ≥ 3 OR b.price/a.price ≥ 3 OR isNULL(a.price,b.price)) – critical step in EM How do we get the crowd to do this? – ordinary workers; can’t write machine-readable rules – if write in English, we can’t convert them into machine-readable Crowdsourced EM so far asks people to label examples – no work has asked people to write machine-readable rules 10 Our Key Idea Ask people to label examples, as before Use them to generate many machine-readable rules – using machine learning, specifically a random forest Ask crowd to evaluate, select and apply the best rules This has proven highly promising – e.g., reduce # of tuple pairs from 168M to 38.2K, at cost of $7.20 from 56M to 173.4K, at cost of $22 – with no developer involved – in some cases did much better than using a developer (bigger reduction, higher accuracy) 11 Blocking in Corleone Decide if blocking is necessary – If |A X B| < τ, no blocking, return A X B. Otherwise do blocking. Take sample S from A x B Train a random forest F on S (to match tuple pairs) – using active learning, where crowd labels pairs Four examples supplied by user (2 pos, 2 neg) Train a random forest F Stopping criterion satisfied? Y Random forest F N Sample S from |A x B| Select q “most informative” unlabeled examples Label the q selected examples using crowd Amazon’s Mechanical Turk Blocking in Corleone Extract candidate rules from random forest F Example random forest F for matching books isbn_match N Y No title_match N Y #pages_match N Y No No publisher_match N Y No Yes year_match N Extracted candidate rules (isbn_match = N) No (isbn_match = Y) and (#pages_match = N) Y No Yes No (title_match = N) No (title_match = Y) and (publisher_match = N) No (title_match = Y) and (publisher_match = Y) and (year_match = N) No 13 Blocking in Corleone Evaluate the precision of extracted candidate rules – for each rule R, apply R to predict “match / no match” on sample S – ask crowd to evaluate R’s predictions – compute precision for R Select most precise rules as “blocking rules” Apply blocking rules to A and B using Hadoop, to obtain a smaller set of candidate pairs to be matched Multiple difficult optimization problems in blocking – to minimize crowd effort & scale up to very large tables A and B – see paper 14 The Rest of Corleone Crowd of workers (e.g., on Amazon Mechanical Turk) Tables A Blocker User B Instructions to the crowd Four examples Candidate tuple pairs Matcher Predicted matches Accuracy Estimator - Predicted matches - Accuracy estimates Difficult Pairs’ Locator 15 Empirical Evaluation Datasets Table A Table B |A X B| |M| # attributes # features Restaurants 533 331 176,423 112 4 12 Citations 2616 64,263 168.1 M 5347 4 7 Products 2554 21,537 55 M 1154 9 23 Mechanical Turk settings – Turker qualifications: at least 100 HITs completed with ≥ 95% approval rate – Payment: 1-2 cents per question Repeated three times on each data set, each run in a different week 16 Performance Comparison Two traditional solutions: Baseline 1 and Baseline 2 – developer performs blocking – supervised learning to match the candidate set Baseline 1: labels the same # of pairs as Corleone Baseline 2: labels 20% of the candidate set – for Products, Corleone labels 3205 pairs, Baseline 2 labels 36076 Also compare with results from published work 17 Performance Comparison Corleone Datasets Baseline 1 Baseline 2 R Published Works P R F1 Cost P R F1 P F1 F1 Restaurants 97.0 96.1 96.5 $9.20 10.0 6.1 7.6 99.2 93.8 96.4 92-97 % [1,2] Citations 89.9 94.3 92.1 $69.50 90.4 84.3 87.1 93.0 91.1 92.0 88-92 % [2,3,4] Products 91.5 87.4 89.3 $256.80 92.9 26.6 40.5 95.0 54.8 69.5 Not available [1] CrowdER: crowdsourcing entity resolution, Wang et al., VLDB’12. [2] Frameworks for entity matching: A comparison, Kopcke et al., Data Knowl. Eng. (2010). [3] Evaluation of entity resolution approaches on real-world match problems, Kopcke et al., PVLDB’10. [4] Active sampling for entity matching. Bellare et al., SIGKDD’12. 18 Blocking Cartesian Product Candidate Set Recall (%) Total cost Time 176.4K 176.4K 100 $0 - Citations 168 million 38.2K 99 $7.20 6.2 hours Products 56 million 173.4K 92 $22.00 2.7 hours Datasets Restaurants Comparison against blocking by a developer – Citations: 100% recall with 202.5K candidate pairs – Products: 90% recall with 180.2K candidate pairs See paper for more experiments – on blocking, matcher, accuracy estimator, difficult pairs’ locator, etc. 19 Conclusion Current crowdsourced EM often requires a developer Need for developer poses serious problems – does not scale to EM at enterprises – cannot handle crowdsourcing for the masses Proposed hands-off crowdsourcing (HOC) – crowdsource the entire workflow, no developer Developed Corleone, the first HOC system for EM – competitive with or outperforms current solutions – no developer effort, relatively little money – being transitioned into production at WalmartLabs Future directions – scaling up to very large data sets – HOC for other tasks, e.g., joins in crowdsourced RDBMSs, IE