Transcript shilpa
DUPLICATE RECORD DETECTION
AHMED K. ELMAGARMID
PURDUE UNIVERSITY, WEST LAFAYETTE, IN
Senior member, IEEE
PANAGIOTIS G. IPEIROTIS
LEONARD N. STERN SCHOOL OF BUSINESS, NEW YORK, NY
Member, IEEE computer security
VASSILIOS S. VERYKIOS
UNIVERSITY OF THESSALY, VOLOS, GREECE
Member ,IEEE computer security.
PRESENTED BY
SHILPA MURTHY
INTRODUCTION TO THE PROBLEM
Databases play an important role in today’s IT
based economy
Many businesses and organizations depend on the
quality of data(or the lack thereof) stored in the
databases.
Any discrepancies in the data can have
significant cost implications to a system that
relies on information to function.
DATA QUALITY
Data are not carefully controlled for quality nor
defined in a consistent way across different data
sources, thus data quality is compromised due to
many factors .//examples
Data errors.Ex: Microsft instead of Microsoft
Integrity errors. Ex: EmployeeAge=567
Multiple conventions for information.Ex: 44 W.4th
street and 44 west fourth street.
DATA HETEROGENEITY
While integrating data from different sources
into a warehouse , organizations become aware
potential systematic differences and these
problems and conflicts fall under a umbrella term
called as “DATA HETEROGENEITY”.
Two types of heterogeneity can be distinguished:
Structural heterogeneity and lexical
heterogeneity.
DATA QUALITY
Data cleaning refers to the process of resolving
identification problems in the data.
Structural heterogeneity
Different record structure
Addr versus City, State, and Zip code [1]
Lexical heterogeneity
Identical record structure, but data is different
th St. versus 44 West Fourth Street [1]
44 W. 4
TERMINOLOGY
Record
linkage
Identity
uncertainty
Record
matching
Coreference
resolution
Data
deduplication
Duplicate
Record
Detection
Name
matching
Merge
purge
Database
hardening
Instance
identification
DATA PREPARATION
Step before the duplicate record detection.
Improves the quality of the data
Makes data more comparable and more usable.
Data preparation stage includes three steps.
STEPS IN DATA PREPARATION
Parsing
Data transformation
Standardization
PARSING
Locates, identifies and isolates individual data
elements
Makes it easier to correct, standardize and match
data
Comparison of individual components rather
than complex strings
For example, the appropriate parsing of the name
and address components into consistent packets
is a very important step.
DATA TRANSFORMATION
Simple conversions of data type
Field renaming
Decoding field values
Range checking: involves examining data in a
field to ensure that it falls within the expected
range ,usually a numeric or date range
Dependency checking: is slightly more complex
kind of data transformation where we check the
values in a particular field to the values in
another field to ensure minimal level of
consistency in data
DATA STANDARDIZATION
Represent certain fields in a standard format
Addresses
US Postal Service Address Verification tool
Date and time formatting
Names (first, last, middle, prefix, suffix)
Titles
LAST STEP IN DATA PREPARATION
Store data in tables having comparable fields.
Identify fields suitable for comparison
Not foolproof
Data may still contain inconsistencies due to
misspellings and different conventions to
represent data
FIELD MATCHING TECHNIQUES
o
o
o
o
o
o
Most common sources of mismatches in database
entries is due to typographical errors
The field matching metrics that have been
designed to overcome this problem are :
Character –based similarity metrics
Token based similarity metrics
Phonetic similarity metrics
Numeric similarity metrics
CHARACTER BASED SIMILARITY
Works best on typographical errors
Edit distance
Shortest sequence of edit commands that can
transform a string s into t
Three types of edit operations . If (cost =1) this
version of edit distance is referred to as the
“Levenshein” distance.
Insert, delete, replace operations.
Example. S1=“tin” s2= “tan”
We need to replace “I” to “A” to convert string s1 to
s2.
The edit distance here is 1. because we needed only
one operation to convert s1 to s2.
CHARACTER BASED SIMILARITY
•
Affine gap distance
Strings that have been truncated
John R. Smith versus Jonathan Richard Smith
•
Smith-Waterman distance
Substring matching which ignores the prefix and
suffix
Example: Prof. John.R.Smith and John.R.Smith,Prof
•
Jaro distance
Compares first and last name
•
Q-Grams
Divides string into a series of substrings of length q.
E.g.: NELSON and NELSEN are phonetically similar
but spelled differently. The q-grams for these words
are NE LS ON and NE LS EN .
TOKEN BASED SIMILARITY
Works best when word (tokens) are transposed
Atomic Strings
Computational average
WHIRL
Weights words based on frequency to determine similarity
The words in the database have a weight associated with it,
which is calculated using a cosine similarity metric.
Example: in a database of company names the words
“AT&T” and “IBM” are less frequent than the word “inc.”
Similarity of John Smith and Mr.John Smith is close to 1.
But the similarity of comptr department and deprtment of
computer is zero since it doesn’t take care of misspelled
words.
Q-Grams with weighting
Extends WHIRL to handle spelling errors
PHONETIC SIMILARITY
Comparison based on how words sound
NUMERIC SIMILARITY
Considers only numbers
Convert numbers to text data
Simple range queries
Authors provided no insight in this area
SUMMARY OF METRICS
Edit
Distance
(Levenshtein)
Numeric
Affine Gap
Double
Metaphone
Smith
Waterman
Jaro
Distance
Metaphone
Comparison
Metrics
Oxford
Name
Compression
Atomic
Strings
NYSIIS
WHIRL
Soundex
Q-Grams
DUPLICATE RECORD DETECTION
The methods described till now have been
describing about similarity checking in single
fields.
The real life situations consist of multiple fields
which have to be checked for duplicate records.
CATEGORIZING METHODS
•
•
Probabilistic approaches and supervised
machine learning techniques
Approaches that rely on domain knowledge
or Generic distance metrics
PROBABILISTIC MATCHINGMODELS
Models derived from Bayes theorem
Use prior knowledge to make decision about current
data set
A tuple pair is assigned to one of the two classes M or
U. M class represents(match) same entity, and the U
class represents(non-match) different entity.
This can be determined by calculating the probability
distribution.
Rule-based decision tree
If-then-else traversal
SUPERVISED LEARNING
Relies on the existence of trained data.
The trained data is in the form of record pairs.
These record pairs are labeled matching or not.
SVM approach out performs all the simpler
approaches.
The post processing step is to create a graph for
all the records linking the matching records.
Records are considered identical using the
transitivity relation applied on the connected
components.
ACTIVE LEARNING
DISTANCE BASED TECHNIQUES
o
o
This method can be used when there is absence of
training data or human effort to create matching
models.
Treat a record as a one long field
Use a distance metric
Best matches are ranked using a weighting algorithm
Alternatively, use a single field
Must be highly discriminating
RULE BASED TECHNIQUES
Relies on business rules to derive key
Must determine functional dependencies
Requires subject matter expert to build matching
rules
RULE BASED TECHNIQUES
This figure depicts the equation theory that
dictates the logic of domain equivalence.
It specifies an inference about the similarity of
the records.
UNSUPERVISED LEARNING
Classify data as matched or unmatched without a
training set.
The comparison vector generally depicts which
category it belongs to. If it does not then it has to
be done manually.
One way to avoid manual labeling is to use the
clustering algorithms.
Group together similar comparison vectors.
Each cluster contains vectors with similar
characteristics.
By knowing the real class of only few vectors we
can infer the class of all the vectors.
TECHNIQUES TO IMPROVE EFFICIENCY
Reduce the number of record comparisons
Improve the efficiency of record comparison
COMPARATIVE METRICS
Elementary nested loop
Compare every record in one table to another table
Requires A*B comparisons (Cartesian product) which
is very expensive
Cost required for a single comparison
Must consider number of fields/record
REDUCE RECORD COMPARISONS
Blocking
Sorted Neighborhood
Clustering and Canopies
Set Joins
BLOCKING
Basic:
record
Compute a hash value for each
Only compare records in the same bucket
Subdivide
files into subsets (blocks)
Soundex, NYSIIS, or Metaphone
Drawback
Increases in speed may increase number of
false mismatches
Compromise is multiple runs using different
blocking fields
SORTED NEIGHBORHOOD
Create
composite key, sort data, merge
Assumption
Duplicate records will be close in sorted
system
Highly dependent upon the comparison key
CLUSTERING AND CANOPIES
Clustering: Duplicate records are kept in a
cluster and only the representative of a cluster is
kept for future comparisons.
This reduces the total number of record
comparisons without compromising the accuracy.
Canopies: The records are grouped into
overlapping clusters called as “canopies” and
then the records are compared which lead to
better qualitative results.
SOFTWARE TOOLS
Open Architecture
Freely Extensible Biomedical Record Linkage
(FEBRL) - Python
TAILOR – MLC++, DBGen
WHIRL – C++
Flamingo Project - C
BigMatch - C
DATABASE TOOLS
Commercial RDBMS
SQL Server 2005 implements “fuzzy matches”
Oracle 11g implements these techniques in its
utl_match package
Levenshtein Distance
Soundex
Jaro Winkler
CONCLUSIONS
Lack of a standardized, large-scale benchmarking
data set
Training data is needed to produce matching
models
Research diversion
Databases emphasize simple, fast, and efficient
techniques
Machine learning and statistics rely on sophisticated
techniques and probabilistic models
More synergy is needed among various communities
Detection systems need to be adaptive over time