Real-World Data Is Dirty

Download Report

Transcript Real-World Data Is Dirty

Real-World Data Is Dirty

Data Cleansing and the Merge/Purge Problem David Garnier 03/19/2001 Presented by Corey Gilmore, 02.April.2002

TOPICS

Introduction Basic Data Cleansing Solutions Test Results Incremental Merge Purge Conclusion cfg 02.April.2002

Introduction

The problem:

Some corporations acquire large amounts of information every month The data is stored in many large databases (DB) Theses databases may be heterogeneous Data in theses DB may simply be inaccurate cfg 02.April.2002

Requirement of the analysis

The data mining needs to be done    Quickly Efficiently Accurately cfg 02.April.2002

Examples of real-world applications Credit card companies   Assess risk of potential new customers Find false identities Match disparate records concerning a customer   Mass Marketing companies Government agencies cfg 02.April.2002

Basic Data Cleansing Solutions

Duplicate Elimination

Sorted-Neighborhood Method (SNM) This is done in three phases    Create a Key for each record Sort records on this key Merge/Purge records cfg 02.April.2002

SNM: Create key

First Compute a key for each record by extracting relevant fields or portions of fields Example: Last Address ID Key Sal Stolfo 123 First Street 45678987 STLSAL123FRST456 cfg 02.April.2002

SNM: Sort Data

Sort the records in the data list using the key in step 1 This can be very time consuming (N log N) Alternative: Instead of sorting, use a single scan process to create clusters cfg 02.April.2002

SNM: Merge records

Move a fixed size window through the sequential list of records.

This limits the comparisons to the records in the window cfg 02.April.2002

Question on SNM

What is the optimal window size while   Maximizing accuracy Minimizing computational cost Execution time for large DB will be bound by   Disk I/O Number of passes over the data set cfg 02.April.2002

Selection of Keys

The effectiveness the SNM highly depends on the key selected to sort the records A key is defined to be a sequence of a subset of attributes Keys must provide sufficient discriminating power cfg 02.April.2002

Example of Records and Keys

First Sal Sal Sal Sal Last Stolfo Address 123 First Street Stolfo Stolpho Stiles 123 First Street 123 First Street 123 Forest Street ID 45678987 45678987 45678987 45654321 Key STLSAL123FRST456 STLSAL123FRST456 STLSAL123FRST456 STLSAL123FRST456 cfg 02.April.2002

Equational Theory

The comparison during the merge phase is an inferential process Compares much more information than simply the key The more information there is, the better inferences can be made cfg 02.April.2002

Equational Theory - Example

Two names are spelled nearly identically and have the same address  It may be inferred that they are the same person Two social security numbers are the same but the names and addresses are totally different  Could be the same person who moved  Could be two different people and there is an error in the social security number cfg 02.April.2002

A simplified rule in English

Given two records, r1 and r2 IF the last name of r1 equals the last name of r2, AND the first names differ slightly, AND the address of r1 equals the address of r2 THEN r1 is equivalent to r2 cfg 02.April.2002

The distance function

A “distance function” is used to compare pieces of data (usually text) Tested on    Edit distance Phonetic distance “Typewriter” distance cfg 02.April.2002

Examples of matched records

SSN 334600443 334600443 525520001 525520001 0 0 0 0 850982319 950982319 Name (First, Initial, Last) Lisa Boardman Lisa Brown Ramon Bonilla Raymond Bonilla Diana D. Ambrosion Diana A. Dambrosion Colette Johnen John Colette Ivette A Keegan Yvette A Kegan cfg 02.April.2002

Address 144 Wars St.

144 Ward St.

38 Ward St.

38 Ward St.

40 Brik Church Av.

40 Brick Church Av.

600 113 th 600 113 th St. apt. 5a5 St. ap. 585 23 Florida Av.

23 Flordia St.

Building an equational theory

The process of creating a good equational theory is similar to the process of creating a good knowledge base for an expert system In complex problems, an expert’s assistance is needed to write the equational theory cfg 02.April.2002

Transitive Closure

In general, no single pass (i.e. no single key) will be sufficient to catch all matching records An attribute that appears first in the key has higher discriminating power than those appearing after them  If an employee has two records in a DB with SNN 193456782 and 913456782, it’s unlikely they will fall under the same window cfg 02.April.2002

Transitive Closure

To increase the number of similar records merged   Widen the scanning window size, w Execute several independent runs of the SNM  Use a different key each time  Use a relatively small window  Call this the Multi-Pass approach cfg 02.April.2002

Transitive Closure

Each independent run of the Multi-Pass approach will produce a set of pairs of records Although one field in a record may be in error, another field may not Transitive closure can be applied to those pairs to be merged cfg 02.April.2002

Transitive Equality Example

IF A implies B AND B implies C THEN A implies C cfg 02.April.2002

Test Results

Test Environment

Test data was created by a database generator  Names are randomly chosen from a list of 63000 real names The database generator provides a large number of parameters:  size of the DB,   percentage of duplicates, amount of error… cfg 02.April.2002

Correct Duplicate Detection

cfg 02.April.2002

Time for each run

cfg 02.April.2002

Accuracy for each run

cfg 02.April.2002

Real-World Test

Data was obtained from the Office of Children Administrative Research (OCAR) of the Department of Social and Health Services OCAR’s goals   How long do children stay in foster care?

How many different homes do children typically stay in?

cfg 02.April.2002

OCAR’s Database

Most of OCAR’s data is stored in one relation The DB contains 6,000,000 total records The DB grows by about 50,000 records per month cfg 02.April.2002

Typical Problems in the DB

Names are frequently misspelled SSN or birthdays are either missing or clearly wrong Case number often changes when the child’s family moves to another part of the state Some records use service provider names instead of the child’s Finally, the schema is just awful… cfg 02.April.2002

OCAR Equational Theory

Keys for the independent runs    Last Name, First Name, SSN, Case Number First Name, Last Name, SSN, Case Number Case Number, First Name, Last Name, SSN cfg 02.April.2002

OCAR Results

cfg 02.April.2002

Incremental Merge/Purge

New Data

Initially, the input list of records are concatenated and before being cleansed  This is an unavoidable step and presumably acceptable the first time New data must be cleansed first before concatenating to an already clean DB cfg 02.April.2002

Incremental Merge/Purge

Simply concatenating new data before reapplying the merge/purge process may be very expensive in both time and space An incremental merge/purge approach is needed: the Prime Representatives method cfg 02.April.2002

Prime-Representative: Definition A set of records extracted from each cluster of records used to represent the information in the cluster The “Cluster Centroid” cfg 02.April.2002

Prime-Representative creation

Initially, no PR exists After the execution of the first merge/purge  Input relations can be separated into cluster of similar records Sometimes more than one PR is required cfg 02.April.2002

Strategies for Choose a PR

Random Sample N-Latest Generalization Syntactic Utility cfg 02.April.2002

Important Assumption

No data previously used to select each cluster’s PR will be deleted No changes in the rule-set will occur after the first increment of data is processed cfg 02.April.2002

Results

Cumulative running time for the Incremental Merge/Purge algorithm is higher than the classic algorithm But total running time of the Incremental Merge/Purge algorithm is always smaller Finally, initial results seems interesting cfg 02.April.2002

Conclusion

Cleansing of Data

Sorted-Neighborhood Method is expensive  Due heavily to the sort phase Performing multiple, relatively cheap processes result in a more accurate outcome Multiple, relatively cheap processes are also faster cfg 02.April.2002

Recap

Merging two databases is often complicated when the databases are heterogeneous, and when identifiers or strings in the data differ, or are erroneous.

cfg 02.April.2002

   The 3 steps in SNM are: Creation of key(s) Sorting data Merging cfg 02.April.2002

A prime representative is a set of records from a cluster, considered to be representative of the data contained in the cluster      Strategies for selecting a PR: Random Sample N-Latest Generalization Syntactic Utility cfg 02.April.2002

The End

This presentation may be found at: http://dln.uvm.edu/~cfg/cs295/ cfg 02.April.2002