Transcript Document
Record Linkage and Disclosure Limitation William W. Cohen, CALD Steve Fienberg, Statistics, CALD & C3S Pradeep Ravikumar, CALD Research Goals • Understand the current “state of the art” in record linkage • Understand the interplay between record linkage and disclosure limitation problems – More generally, understand the interplay between record linkage and analysis of linked data Initial research question: What’s the state of the art in record linkage? • Same/related problems studied (in statistics, database, artificial intelligence) variously as: – Merge-purge, duplicate detection, de-duping, database hardening, field-matching, object identity problem, object identification, object consolidation, identity uncertainty, reference resolution, co-reference resolution, reference matching, name matching, … • Very few comparative studies across areas • Very few studies on multiple datasets – Importance of problem-specific tuning unclear Initial research question: What’s the state of the art in record linkage? • Test suite of 14 (small) linkage problems • “SecondString”: open-source, Java toolkit implementing: – Edit distance: Levenshtein, NeedlemanWunch, Smith-Waterman, “Monge-Elkan” – Jaro-like: Jaro measure, Jaro-Winkler – Token-based: Jaccard, TFIDF, JensenShannon (smoothed w/ Dirichlet, Jelenik-Mercer) – Hybrid: Monge-Elkans “Level 2”, SoftTFIDF (TFIDF-Jaro hybrid) Initial research question: What’s the state of the art in record linkage? • “SecondString” supports: – Comparing methods on multiple datasets • Methodology from information retrieval • 11-pt interpolated precision – Easily implemented novel hybrid methods – Combining methods (via learned SVM) • Labeled data; proxy for hand-tuning on task • Different distance metrics for the same field 2.6*TFIDF(x,y) + 0.4*Levenshtein(x,y) + 1.2*Jaro(x,y) • Same method on different fields 1.3*dist(x-addr,y-addr) + 2.7*dist(x-lname,y-lname) Comparison: 7 methods vs 11 datasets SoftTFIDF is best on average Comparison: 5 edit-distance like metrics on 11 datasets Monge-Elkan is best on average Comparison: 5 metrics, 11 datasets Monge-Elkan may not be best choice on a particular dataset Levenshtein vs SoftTFIDF • Compare best average performer with one of the worst • Not strictly better! • Solution: look at learning best (combination of) methods. • Training data proxy for hand-tuning to a problem Research Goals • Understand the current “state of the art” in record linkage • Understand the interplay between record linkage and disclosure limitation problems – More generally, understand the interplay between record linkage and analysis of linked data Initial Research Goals • SecondString & experiments – Used by researchers at U Washington, elsewhere – Additional code release coming – Still need to implement/evaluate some advanced models • (Cohen, Ravikumar, Fienberg, 2003a) A Comparison of String Distance Metrics for Name-Matching Tasks (IIWeb workshop at IJCAI-03) • (Cohen, Ravikumar, Fienberg, 2003b) A Comparison of String Distance Metrics for Matching Names and Records (Data Cleaning workshop at KDD-03) • (Bilenko, Mooney, Cohen, Ravikumar, Fienberg, 2003) Adaptive name-matching in information integration, (IEEE Intelligent Systems, to appear) • (Ravikumar, Cohen, Fienberg, 2004?) More extensive survey paper, in preparation… Current Research Goals • Understand the interplay between record linkage and disclosure limitation problems (more generally, analysis of linked data) • Draft paper formalizing – Disclosure control for data A: • A A’ so only Pr(A|A’) is available – Disclosure policy (attack) as preventing (attempting) inference of: • Pr( PRIVATE | A’, OutsideInfo) – Linkage attack as using A’, B, joint Pr(A,B) Current Research Goals • Understand the interplay between record linkage and disclosure limitation • Draft paper • Data selected for initial analysis (NLTCS) • Linkage and analysis: – Analytic linkage: given (X,Y) and (X’,Z) where X and X’ can be linked, find links from X X’ and Pr(Y,Z) using a sort of bootstrap procedure • Pr(Y,Z) constrains possible links – How to modify this if Pr(Y,Z) is the important output? What if we only care about some property of Pr(Y,Z), e.g. estimating z = f(y) ?