Transcript Document

Record Linkage and
Disclosure Limitation
William W. Cohen, CALD
Steve Fienberg, Statistics, CALD & C3S
Pradeep Ravikumar, CALD
Research Goals
• Understand the current “state of the
art” in record linkage
• Understand the interplay between
record linkage and disclosure limitation
problems
– More generally, understand the interplay
between record linkage and analysis of
linked data
Initial research question:
What’s
the state of the art in record linkage?
• Same/related problems studied (in statistics,
database, artificial intelligence) variously as:
– Merge-purge, duplicate detection, de-duping,
database hardening, field-matching, object
identity problem, object identification, object
consolidation, identity uncertainty, reference
resolution, co-reference resolution, reference
matching, name matching, …
• Very few comparative studies across areas
• Very few studies on multiple datasets
– Importance of problem-specific tuning unclear
Initial research question:
What’s
the state of the art in record linkage?
• Test suite of 14 (small) linkage problems
• “SecondString”: open-source, Java
toolkit implementing:
– Edit distance: Levenshtein, NeedlemanWunch, Smith-Waterman, “Monge-Elkan”
– Jaro-like: Jaro measure, Jaro-Winkler
– Token-based: Jaccard, TFIDF, JensenShannon (smoothed w/ Dirichlet, Jelenik-Mercer)
– Hybrid: Monge-Elkans “Level 2”,
SoftTFIDF (TFIDF-Jaro hybrid)
Initial research question:
What’s
the state of the art in record linkage?
• “SecondString” supports:
– Comparing methods on multiple datasets
• Methodology from information retrieval
• 11-pt interpolated precision
– Easily implemented novel hybrid methods
– Combining methods (via learned SVM)
• Labeled data; proxy for hand-tuning on task
• Different distance metrics for the same field
2.6*TFIDF(x,y) + 0.4*Levenshtein(x,y) + 1.2*Jaro(x,y)
• Same method on different fields
1.3*dist(x-addr,y-addr) + 2.7*dist(x-lname,y-lname)
Comparison: 7 methods vs 11 datasets
SoftTFIDF is best on average
Comparison: 5 edit-distance like
metrics on 11 datasets
Monge-Elkan is best on average
Comparison: 5 metrics, 11 datasets
Monge-Elkan may not be best
choice on a particular dataset
Levenshtein vs SoftTFIDF
• Compare best average
performer with one of
the worst
• Not strictly better!
• Solution: look at
learning best
(combination of)
methods.
• Training data proxy
for hand-tuning to a
problem
Research Goals
• Understand the current “state of the
art” in record linkage
• Understand the interplay between
record linkage and disclosure limitation
problems
– More generally, understand the interplay
between record linkage and analysis of
linked data
Initial Research Goals
• SecondString & experiments
– Used by researchers at U Washington, elsewhere
– Additional code release coming
– Still need to implement/evaluate some advanced models
• (Cohen, Ravikumar, Fienberg, 2003a) A Comparison of
String Distance Metrics for Name-Matching Tasks
(IIWeb workshop at IJCAI-03)
• (Cohen, Ravikumar, Fienberg, 2003b) A Comparison of
String Distance Metrics for Matching Names and
Records (Data Cleaning workshop at KDD-03)
• (Bilenko, Mooney, Cohen, Ravikumar, Fienberg, 2003)
Adaptive name-matching in information integration,
(IEEE Intelligent Systems, to appear)
• (Ravikumar, Cohen, Fienberg, 2004?) More extensive
survey paper, in preparation…
Current Research Goals
• Understand the interplay between
record linkage and disclosure limitation
problems (more generally, analysis of linked
data)
• Draft paper formalizing
– Disclosure control for data A:
• A  A’ so only Pr(A|A’) is available
– Disclosure policy (attack) as preventing
(attempting) inference of:
• Pr( PRIVATE | A’, OutsideInfo)
– Linkage attack as using A’, B, joint Pr(A,B)
Current Research Goals
• Understand the interplay between
record linkage and disclosure limitation
• Draft paper
• Data selected for initial analysis (NLTCS)
• Linkage and analysis:
– Analytic linkage: given (X,Y) and (X’,Z) where X and
X’ can be linked, find links from X  X’ and Pr(Y,Z)
using a sort of bootstrap procedure
• Pr(Y,Z) constrains possible links
– How to modify this if Pr(Y,Z) is the important
output? What if we only care about some property
of Pr(Y,Z), e.g. estimating z = f(y) ?