Transcript Real-World Data Is Dirty
Real-World Data Is Dirty
Data Cleansing and the Merge/Purge Problem David Garnier 03/19/2001 Presented by Corey Gilmore, 02.April.2002
TOPICS
Introduction Basic Data Cleansing Solutions Test Results Incremental Merge Purge Conclusion cfg 02.April.2002
Introduction
The problem:
Some corporations acquire large amounts of information every month The data is stored in many large databases (DB) Theses databases may be heterogeneous Data in theses DB may simply be inaccurate cfg 02.April.2002
Requirement of the analysis
The data mining needs to be done Quickly Efficiently Accurately cfg 02.April.2002
Examples of real-world applications Credit card companies Assess risk of potential new customers Find false identities Match disparate records concerning a customer Mass Marketing companies Government agencies cfg 02.April.2002
Basic Data Cleansing Solutions
Duplicate Elimination
Sorted-Neighborhood Method (SNM) This is done in three phases Create a Key for each record Sort records on this key Merge/Purge records cfg 02.April.2002
SNM: Create key
First Compute a key for each record by extracting relevant fields or portions of fields Example: Last Address ID Key Sal Stolfo 123 First Street 45678987 STLSAL123FRST456 cfg 02.April.2002
SNM: Sort Data
Sort the records in the data list using the key in step 1 This can be very time consuming (N log N) Alternative: Instead of sorting, use a single scan process to create clusters cfg 02.April.2002
SNM: Merge records
Move a fixed size window through the sequential list of records.
This limits the comparisons to the records in the window cfg 02.April.2002
Question on SNM
What is the optimal window size while Maximizing accuracy Minimizing computational cost Execution time for large DB will be bound by Disk I/O Number of passes over the data set cfg 02.April.2002
Selection of Keys
The effectiveness the SNM highly depends on the key selected to sort the records A key is defined to be a sequence of a subset of attributes Keys must provide sufficient discriminating power cfg 02.April.2002
Example of Records and Keys
First Sal Sal Sal Sal Last Stolfo Address 123 First Street Stolfo Stolpho Stiles 123 First Street 123 First Street 123 Forest Street ID 45678987 45678987 45678987 45654321 Key STLSAL123FRST456 STLSAL123FRST456 STLSAL123FRST456 STLSAL123FRST456 cfg 02.April.2002
Equational Theory
The comparison during the merge phase is an inferential process Compares much more information than simply the key The more information there is, the better inferences can be made cfg 02.April.2002
Equational Theory - Example
Two names are spelled nearly identically and have the same address It may be inferred that they are the same person Two social security numbers are the same but the names and addresses are totally different Could be the same person who moved Could be two different people and there is an error in the social security number cfg 02.April.2002
A simplified rule in English
Given two records, r1 and r2 IF the last name of r1 equals the last name of r2, AND the first names differ slightly, AND the address of r1 equals the address of r2 THEN r1 is equivalent to r2 cfg 02.April.2002
The distance function
A “distance function” is used to compare pieces of data (usually text) Tested on Edit distance Phonetic distance “Typewriter” distance cfg 02.April.2002
Examples of matched records
SSN 334600443 334600443 525520001 525520001 0 0 0 0 850982319 950982319 Name (First, Initial, Last) Lisa Boardman Lisa Brown Ramon Bonilla Raymond Bonilla Diana D. Ambrosion Diana A. Dambrosion Colette Johnen John Colette Ivette A Keegan Yvette A Kegan cfg 02.April.2002
Address 144 Wars St.
144 Ward St.
38 Ward St.
38 Ward St.
40 Brik Church Av.
40 Brick Church Av.
600 113 th 600 113 th St. apt. 5a5 St. ap. 585 23 Florida Av.
23 Flordia St.
Building an equational theory
The process of creating a good equational theory is similar to the process of creating a good knowledge base for an expert system In complex problems, an expert’s assistance is needed to write the equational theory cfg 02.April.2002
Transitive Closure
In general, no single pass (i.e. no single key) will be sufficient to catch all matching records An attribute that appears first in the key has higher discriminating power than those appearing after them If an employee has two records in a DB with SNN 193456782 and 913456782, it’s unlikely they will fall under the same window cfg 02.April.2002
Transitive Closure
To increase the number of similar records merged Widen the scanning window size, w Execute several independent runs of the SNM Use a different key each time Use a relatively small window Call this the Multi-Pass approach cfg 02.April.2002
Transitive Closure
Each independent run of the Multi-Pass approach will produce a set of pairs of records Although one field in a record may be in error, another field may not Transitive closure can be applied to those pairs to be merged cfg 02.April.2002
Transitive Equality Example
IF A implies B AND B implies C THEN A implies C cfg 02.April.2002
Test Results
Test Environment
Test data was created by a database generator Names are randomly chosen from a list of 63000 real names The database generator provides a large number of parameters: size of the DB, percentage of duplicates, amount of error… cfg 02.April.2002
Correct Duplicate Detection
cfg 02.April.2002
Time for each run
cfg 02.April.2002
Accuracy for each run
cfg 02.April.2002
Real-World Test
Data was obtained from the Office of Children Administrative Research (OCAR) of the Department of Social and Health Services OCAR’s goals How long do children stay in foster care?
How many different homes do children typically stay in?
cfg 02.April.2002
OCAR’s Database
Most of OCAR’s data is stored in one relation The DB contains 6,000,000 total records The DB grows by about 50,000 records per month cfg 02.April.2002
Typical Problems in the DB
Names are frequently misspelled SSN or birthdays are either missing or clearly wrong Case number often changes when the child’s family moves to another part of the state Some records use service provider names instead of the child’s Finally, the schema is just awful… cfg 02.April.2002
OCAR Equational Theory
Keys for the independent runs Last Name, First Name, SSN, Case Number First Name, Last Name, SSN, Case Number Case Number, First Name, Last Name, SSN cfg 02.April.2002
OCAR Results
cfg 02.April.2002
Incremental Merge/Purge
New Data
Initially, the input list of records are concatenated and before being cleansed This is an unavoidable step and presumably acceptable the first time New data must be cleansed first before concatenating to an already clean DB cfg 02.April.2002
Incremental Merge/Purge
Simply concatenating new data before reapplying the merge/purge process may be very expensive in both time and space An incremental merge/purge approach is needed: the Prime Representatives method cfg 02.April.2002
Prime-Representative: Definition A set of records extracted from each cluster of records used to represent the information in the cluster The “Cluster Centroid” cfg 02.April.2002
Prime-Representative creation
Initially, no PR exists After the execution of the first merge/purge Input relations can be separated into cluster of similar records Sometimes more than one PR is required cfg 02.April.2002
Strategies for Choose a PR
Random Sample N-Latest Generalization Syntactic Utility cfg 02.April.2002
Important Assumption
No data previously used to select each cluster’s PR will be deleted No changes in the rule-set will occur after the first increment of data is processed cfg 02.April.2002
Results
Cumulative running time for the Incremental Merge/Purge algorithm is higher than the classic algorithm But total running time of the Incremental Merge/Purge algorithm is always smaller Finally, initial results seems interesting cfg 02.April.2002
Conclusion
Cleansing of Data
Sorted-Neighborhood Method is expensive Due heavily to the sort phase Performing multiple, relatively cheap processes result in a more accurate outcome Multiple, relatively cheap processes are also faster cfg 02.April.2002
Recap
Merging two databases is often complicated when the databases are heterogeneous, and when identifiers or strings in the data differ, or are erroneous.
cfg 02.April.2002
The 3 steps in SNM are: Creation of key(s) Sorting data Merging cfg 02.April.2002
A prime representative is a set of records from a cluster, considered to be representative of the data contained in the cluster Strategies for selecting a PR: Random Sample N-Latest Generalization Syntactic Utility cfg 02.April.2002
The End
This presentation may be found at: http://dln.uvm.edu/~cfg/cs295/ cfg 02.April.2002