2. Data Preparation and Preprocessing Data and Its Forms Preparation

Transcript 2. Data Preparation and Preprocessing Data and Its Forms Preparation

2. Data Preparation and Preprocessing
Data and Its Forms
Preparation
Preprocessing and Data Reduction
9/03
Data Mining – Data Preprocessing
Guozhu Dong
1
Data Types and Forms


Attribute-vector data:
Data types

A1
A2
…
An
C
numeric, categorical (see the
hierarchy for their relationship)


static, dynamic (temporal)
Other data forms



9/03
distributed data
text, Web, meta data
images, audio/video
Data Mining – Data Preprocessing
Guozhu Dong
2
Data Preparation







An important & time consuming task in KDD
High dimensional data (20, 100, 1000, …)
Huge size (volume) data
Missing data
Outliers
Erroneous data (inconsistent, mis-recorded,
distorted)
Raw data
9/03
Data Mining – Data Preprocessing
Guozhu Dong
3
Data Preparation Methods


Data annotation
Data normalization


Dealing with sequential or temporal data


Examples: image pixels, age
Transform to tabular form
Removing outliers

9/03
Different types
Data Mining – Data Preprocessing
Guozhu Dong
4
Normalization

Decimal scaling



Min-max normalization into new max/min range:



v’(i) = v(i)/10k for the smallest k such that max(|v’(i)|)<1.
For the range between -991 and 99, 10k is 1000, -991  -.991
v’ = (v - minA)/(maxA - minA) *
(new_maxA - new_minA) + new_minA
v = 73600 in [12000,98000]  v’= 0.716 in [0,1] (new range)
Zero-mean normalization:



9/03
v’ = (v - meanA) / std_devA
(1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1)
If meanIncome = 54000 and std_devIncome = 16000,
then v = 73600  1.225
Data Mining – Data Preprocessing
Guozhu Dong
5
Temporal Data

The goal is to forecast t(n+1) from previous
values


X = {t(1), t(2), …, t(n)}
An example with two features and widow size 3

How to determine the window size?
Inst
A(n-2)
A(n-1)
A(n)
B(n-2)
B(n-1)
B(n)
211
1
7
10
6
215
211
214
6
214
2
10
6
11
211
214
221
4
11
221
5
12
210
3
6
11
12
214
221
210
6
14
218
4
11
12
14
221
210
218
Time
A
B
1
7
215
2
10
3
9/03
Data Mining – Data Preprocessing
Guozhu Dong
6
Outlier Removal


Outlier: Data points inconsistent with the majority
of data
Different outliers



Valid: CEO’s salary,
Noisy: One’s age = 200, widely deviated points
Removal methods



9/03
Clustering
Curve-fitting
Hypothesis-testing with a given model
Data Mining – Data Preprocessing
Guozhu Dong
7
Data Preprocessing

Data cleaning




missing data
noisy data
inconsistent data
Data reduction



9/03
Dimensionality reduction
Instance selection
Value discretization
Data Mining – Data Preprocessing
Guozhu Dong
8
Missing Data

Many types of missing data




not measured
not applicable
wrongly placed, and ?
Some methods




9/03
leave as is
ignore/remove the instance with missing value
manual fix (assign a value for implicit meaning)
statistical methods (majority, most likely,mean,
nearest neighbor, …)
Data Mining – Data Preprocessing
Guozhu Dong
9
Noisy Data

Noise: Random error or variance in a measured
variable



Noise is normally a minority in the data set


inconsistent values for features or classes
(processing)
measuring errors (source)
Why?
Removing noise



9/03
Clustering/merging
Smoothing (rounding, averaging within a window)
Outlier detection (deviation-based or distance-based)
Data Mining – Data Preprocessing
Guozhu Dong
10
Inconsistent Data


Inconsistent with our models or common sense
Examples





9/03
The same name occurs as different ones in an
application
Different names appear the same (Dennis vs. Denis)
Inappropriate values (Male-Pregnant, negative age)
One bank’s database shows that 5% of its customers
were born on 11/11/11
…
Data Mining – Data Preprocessing
Guozhu Dong
11
Dimensionality Reduction

Feature selection




select m from n features, m≤ n
remove irrelevant, redundant features
+ saving in search space
Feature transformation (PCA)



9/03
form new features (a) in a new domain from original
features (f)
many uses, but it does not reduce the original
dimensionality
often used in visualization of data
Data Mining – Data Preprocessing
Guozhu Dong
12
Feature Selection

Problem illustration




Full set
Empty set
Enumeration
Search




9/03
Exhaustive/Complete (Enumeration/B&B)
Heuristic (Sequential forward/backward)
Stochastic (generate/evaluate)
Individual features or subsets
generation/evaluation
Data Mining – Data Preprocessing
Guozhu Dong
13
Feature Selection (2)

Goodness metrics






Dependency: dependence on classes
Distance: separating classes
Information: entropy
Consistency: 1 - #inconsistencies/N
 Example: (F1, F2, F3) and (F1,F3)
 Both sets have 2/6 inconsistency rate
Accuracy (classifier based): 1 - errorRate
F1
F2
F3
C
0
0
1
1
0
0
1
0
0
0
1
1
1
0
0
1
1
0
0
0
1
0
0
0
Their comparisons

9/03
Time complexity, number of features,
removing redundancy
Data Mining – Data Preprocessing
Guozhu Dong
14
Feature Selection (3)

Filter vs. Wrapper Model


Pros and cons
 time
 generality
 performance such as accuracy
Stopping criteria


9/03
thresholding (number of iterations, some accuracy,…)
anytime algorithms
 providing approximate solutions
 solutions improve over time
Data Mining – Data Preprocessing
Guozhu Dong
15
Feature Selection (Examples)

SFS using consistency (cRate)



LVF using consistency (cRate)
1
2
3


select 1 from n, then 1 from n-1, n-2,… features
increase the number of selected features until prespecified cRate is reached.
randomly generate a subset S from the full set
if it satisfies prespecified cRate, keep S with min #S
go back to 1 until a stopping criterion is met
LVF is an any time algorithm
Many other algorithms: SBS, B&B, ...
9/03
Data Mining – Data Preprocessing
Guozhu Dong
16
Transformation: PCA

D’ = DA, D is meancentered, (Nn)

Calculate and rank eigenvalues
of the covariance matrix
m
E-values
Diff
Prop
Cumu
1
2.91082
1.98960
0.72771
0.72770
2
0.92122
0.77387
0.23031
0.95801
3
0.14735
0.12675
0.03684
0.99485
4
0.02061
0.00515
1.00000
n
r = (  i ) / (  i )
i=1



i=1
Select largest ’s such that r >
threshold (e.g., .95)
corresponding eigenvectors
form A (nm)
Example of Iris data
9/03
V1
V2
V3
V4
F1
0.522372
0.372318
-.721017
-.261996
F2
-.263355
0.925556
0.242033
0.124135
F3
0.581254
0.021095
0.140892
0.801154
F4
0.565611
0.065416
0.633801
-.523546
Data Mining – Data Preprocessing
Guozhu Dong
17
Instance Selection

Sampling methods



random sampling
stratified sampling
Search-based methods




9/03
Representatives
Prototypes
Sufficient statistics (N, mean, stdDev)
Support vectors
Data Mining – Data Preprocessing
Guozhu Dong
18
Value Discretization

Binning methods





Equal-width
Equal-frequency
Class information is not used
Entropy-based
ChiMerge

9/03
Chi2
Data Mining – Data Preprocessing
Guozhu Dong
19
Binning

Attribute values (for one attribute e.g., age):


Equi-width binning – for bin width of e.g., 10:





Bin 1: 0, 4
[-,10) bin
Bin 2: 12, 16, 16, 18
[10,20) bin
Bin 3: 24, 26, 28
[20,+) bin
We use – to denote negative infinity, + for positive infinity
Equi-frequency binning – for bin density of e.g., 3:




0, 4, 12, 16, 16, 18, 24, 26, 28
Bin 1: 0, 4, 12
Bin 2: 16, 16, 18
Bin 3: 24, 26, 28
[-,14) bin
[14,21) bin
[21,+] bin
Any problems with the above methods?
9/03
Data Mining – Data Preprocessing
Guozhu Dong
20
Entropy-based

Given attribute-value/class pairs:


Entropy-based binning via binarization:




(0,P), (4,P), (12,P), (16,N), (16,N), (18,P), (24,N), (26,N),
(28,N)
Intuitively, find best split so that the bins are as pure as possible
Formally characterized by maximal information gain.
Let S denote the above 9 pairs, p=4/9 be fraction of P
pairs, and n=5/9 be fraction of N pairs.
Entropy(S) = - p log p - n log n.


9/03
Smaller entropy – set is relatively pure; smallest is 0.
Large entropy – set is mixed. Largest is 1.
Data Mining – Data Preprocessing
Guozhu Dong
21
Entropy-based (2)

Let v be a possible split. Then S is divided into two sets:


Information of the split:



I(S1,S2) = (|S1|/|S|) Entropy(S1)+ (|S2|/|S|) Entropy(S2)
Information gain of the split:


S1: value <= v and S2: value > v
Gain(v,S) = Entropy(S) – I(S1,S2)
Goal: split with maximal information gain.
Possible splits: mid points b/w any two consecutive values.

For v=14, I(S1,S2) = 0 + 6/9*Entropy(S2) = 6/9 * 0.65 = 0.433
Gain(14,S) = Entropy(S) - 0.433
 maximum Gain means minimum I.

The best split is found after examining all possible split points.

9/03
Data Mining – Data Preprocessing
Guozhu Dong
22
ChiMerge and Chi2



Given attribute-value/class pairs
Build a contingency table for every
pair of intervals
Chi-Squared Test (goodness-of-fit),
2 k
2 = 
 (Aij – Eij
)2 /
Eij
i=1 j=1

Parameters: df = k-1 and p% level
of significance

9/03
Chi2 algorithm provides an automatic
way to adjust p
Data Mining – Data Preprocessing
Guozhu Dong
C1
C2

I-1
A11
A12
R1
I-2
A21
A22
R2

C1
C2
N
F
C
12
P
12
N
12
P
16
N
16
N
16
P
24
N
24
N
24
N
23
Summary

Data have many forms


Raw data need to be prepared and preprocessed
for data mining





Attribute-vectors: the most common form
Data miners have to work on the data provided
Domain expertise is important in DPP
Data preparation: Normalization, Transformation
Data preprocessing: Cleaning and Reduction
DPP is a critical and time-consuming task

9/03
Why?
Data Mining – Data Preprocessing
Guozhu Dong
24
Bibliography




9/03
H. Liu & H. Motoda, 1998. Feature Selection for
Knowledge Discovery and Data Mining. Kluwer.
M. Kantardzic, 2003. Data Mining - Concepts, Models,
Methods, and Algorithms. IEEE and Wiley InterScience.
H. Liu & H. Motoda, edited, 2001. Instance Selection
and Construction for Data Mining. Kluwer.
H. Liu, F. Hussain, C.L. Tan, and M. Dash, 2002.
Discretization: An Enabling Technique. DMKD 6:393423.
Data Mining – Data Preprocessing
Guozhu Dong
25