슬라이드 1 - SNUT Data Mining & Data Analysis Tool

Download Report

Transcript 슬라이드 1 - SNUT Data Mining & Data Analysis Tool

2011 Data Mining
Industrial & Information Systems Engineering
Chapter 3:
Data Exploration & Dimension Reduction
•Pilsung Kang
•Industrial & Information Systems Engineering
•Seoul National University of Science & Technology
2011 Data Mining, IISE, SNUT
Steps in Data Mining revisited
1. Define and understand the purpose of data mining project
2. Formulate the data mining problem
3. Obtain/verify/modify the data
4. Explore and customize the data
5. Build data mining models
6. Evaluate and interpret the results
7. Deploy and monitor the model
2
2011 Data Mining, IISE, SNUT
Example: Boston Housing Data
1
Define and understand the purpose of data mining project
 Make the local economy stable by maintaining home price.
3
2011 Data Mining, IISE, SNUT
Example: Boston Housing Data
Formulate the data mining problem
2
 What is the purpose?
 To predict the median value of a housing unit in the
neighborhood.
 What data mining task is appropriate?
 Prediction.
4
2011 Data Mining, IISE, SNUT
Example: Boston Housing Data
Obtain/verify/modify the data: Data acquisition
Variable
Description
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town.
3
CHAS Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
PTRATIO pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT lower status of the population
MEDV median value of owner-occupied homes in $1000
5
2011 Data Mining, IISE, SNUT
Example: Boston Housing Data
Obtain/verify/modify the data: Data example
3
MEDV
CAT.
MEDV
4.98
24
0
9.14
21.6
0
392.83
4.03
34.7
1
18.7
394.63
2.94
33.4
1
222
18.7
396.9
5.33
36.2
1
3
222
18.7
394.12
5.21
28.7
0
5.5605
5
311
15.2
395.6
12.43
22.9
0
96.1
5.9505
5
311
15.2
396.9
19.15
27.1
0
5.631
100
6.0821
5
311
15.2
386.63
29.93
16.5
0
0.524
6.004
85.9
6.5921
5
311
15.2
386.71
17.1
18.9
0
0
0.524
6.377
94.3
6.3467
5
311
15.2
392.52
20.45
15
0
7.87
0
0.524
6.009
82.9
6.2267
5
311
15.2
396.9
13.27
18.9
0
12.5
7.87
0
0.524
5.889
39
5.4509
5
311
15.2
390.5
15.71
21.7
0
0.62976
0
8.14
0
0.538
5.949
61.8
4.7075
4
307
21
396.9
8.26
20.4
0
0.63796
0
8.14
0
0.538
6.096
84.5
4.4619
4
307
21
380.02
10.26
18.2
0
0.62739
0
8.14
0
0.538
5.834
56.5
4.4986
4
307
21
395.62
8.47
19.9
0
1.05393
0
8.14
0
0.538
5.935
29.3
4.4986
4
307
21
386.85
6.58
23.1
0
0.7842
0
8.14
0
0.538
5.99
81.7
4.2579
4
307
21
386.75
14.67
17.5
0
0.80271
0
8.14
0
0.538
5.456
36.6
3.7965
4
307
21
288.99
11.69
20.2
0
0.7258
0
8.14
0
0.538
5.727
69.5
3.7965
4
307
21
390.95
11.28
18.2
0
1.25179
0
8.14
0
0.538
5.57
98.1
3.7979
4
307
21
376.57
21.02
13.6
0
0.85204
0
8.14
0
0.538
5.965
89.2
4.0123
4
307
21
392.53
13.83
19.6
0
1.23247
0
8.14
0
0.538
6.142
91.7
3.9769
4
307
21
396.9
18.72
15.2
0
0.98843
0
8.14
0
0.538
5.813
100
4.0952
4
307
21
394.54
19.88
14.5
0
0.75026
0
8.14
0
0.538
5.924
94.1
4.3996
4
307
21
394.33
16.3
15.6
0
0.84054
0
8.14
0
0.538
5.599
85.7
4.4546
4
307
21
303.42
16.51
13.9
0
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
0.00632
18
2.31
0
0.538
6.575
0.02731
0
7.07
0
0.469
6.421
0.02729
0
7.07
0
0.469
0.03237
0
2.18
0
0.06905
0
2.18
0.02985
0
0.08829
DIS
RAD
TAX
PTRATIO
B
LSTAT
65.2
4.09
1
296
15.3
396.9
78.9
4.9671
2
242
17.8
396.9
7.185
61.1
4.9671
2
242
17.8
0.458
6.998
45.8
6.0622
3
222
0
0.458
7.147
54.2
6.0622
3
2.18
0
0.458
6.43
58.7
6.0622
12.5
7.87
0
0.524
6.012
66.6
0.14455
12.5
7.87
0
0.524
6.172
0.21124
12.5
7.87
0
0.524
0.17004
12.5
7.87
0
0.22489
12.5
7.87
0.11747
12.5
0.09378
6
2011 Data Mining, IISE, SNUT
Example: Boston Housing Data
Data exploration: Basic statistics
4
Average Median
0.26
3.61
CRIM
0.00
11.36
ZN
9.69
11.14
INDUS
0.00
0.07
CHAS
0.54
0.55
NOX
6.21
6.28
RM
77.50
68.57
AGE
3.21
3.80
DIS
5.00
9.55
RAD
408.24 330.00
TAX
19.05
18.46
PTRATIO
356.67 391.44
B
11.36
12.65
LSTAT
21.20
22.53
MEDV
Min
0.01
0.00
0.46
0.00
0.39
3.56
2.90
1.13
1.00
187.00
12.60
0.32
1.73
5.00
7
Max
88.98
100.00
27.74
1.00
0.87
8.78
100.00
12.13
24.00
711.00
22.00
396.90
37.97
50.00
Stdev
8.60
23.32
6.86
0.25
0.12
0.70
28.15
2.11
8.71
168.54
2.16
91.29
7.14
9.20
Skew.
5.22
2.23
0.30
3.41
0.73
0.40
-0.60
1.01
1.00
0.67
-0.80
-2.89
0.91
1.11
Kurt.
37.13
4.03
-1.23
9.64
-0.06
1.89
-0.97
0.49
-0.87
-1.14
-0.29
7.23
0.49
1.50
2011 Data Mining, IISE, SNUT
Example: Boston Housing Data
Data exploration: Single variable
 Histogram
 Shows a rough distribution of a single variable.
Histogram
180
160
4
Frequency
140
120
100
80
60
40
20
0
5
10
15
20
25
30
MEDV
8
35
40
45
50
2011 Data Mining, IISE, SNUT
Example: Boston Housing Data
Data exploration: Single variable
 Box plot
 Shows basic statistics of a single variable.
Single box plot
Boxbox
Plotplot
Conditional
60
mean
outliers
50
“max”
40
quartile 3
median
quartile 1
Y Values
4
30
MEDV
20
10
“min”
outlier
0
0
9
CHAS
1
2011 Data Mining, IISE, SNUT
Example: Boston Housing Data
Data exploration: Multiple variables
 Correlation analysis
 Shows the correlation between every pair of two variables.
 Help to select a representative one among highly correlated
(positively or negatively) variables.
4
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
MEDV
CRIM
1.00
-0.20
0.41
-0.06
0.42
-0.22
0.35
-0.38
0.63
0.58
0.29
-0.39
0.46
-0.39
ZN INDUS
1.00
-0.53
-0.04
-0.52
0.31
-0.57
0.66
-0.31
-0.31
-0.39
0.18
-0.41
0.36
1.00
0.06
0.76
-0.39
0.64
-0.71
0.60
0.72
0.38
-0.36
0.60
-0.48
CHAS
1.00
0.09
0.09
0.09
-0.10
-0.01
-0.04
-0.12
0.05
-0.05
0.18
NOX
1.00
-0.30
0.73
-0.77
0.61
0.67
0.19
-0.38
0.59
-0.43
RM
1.00
-0.24
0.21
-0.21
-0.29
-0.36
0.13
-0.61
0.70
AGE
1.00
-0.75
0.46
0.51
0.26
-0.27
0.60
-0.38
10
DIS
1.00
-0.49
-0.53
-0.23
0.29
-0.50
0.25
RAD
1.00
0.91
0.46
-0.44
0.49
-0.38
TAX PTRATIO
1.00
0.46
-0.44
0.54
-0.47
1.00
-0.18
0.37
-0.51
B LSTAT
1.00
-0.37
0.33
1.00
-0.74
MEDV
1.00
2011 Data Mining, IISE, SNUT
Example: Boston Housing Data
Data exploration: Multiple variables
 Scatter plot matrix
 Shows the interaction between two pair of variables.
4
11
2011 Data Mining, IISE, SNUT
Example: Boston Housing Data
Data exploration: Multiple variables
 Pivot table in Excel
 User-specified data summarization tool.
 Able to find non-linear relation between two variables.
4
평균 : MEDV
CHAS
Binned_RM
4.05
5.15
6.25
7.35
8.84
총합계
0
15.35
15.83
20.11
33.08
46.06
22.09
12
1 총합계
17.00
25.19
37.71
40.63
28.44
15.35
15.91
20.41
33.55
44.97
22.53
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
Data customization: Dimensionality reduction
 Variable Selection
 Select a small set of original variables.
• Filter: Variable selection and model building process are
independent.
4
• Wrapper: Variable selection is guided by the result of
data mining models (forward, backward, stepwise).
 Variable Extraction
 Construct a small set of variables by transforming and
combining original variables.
 An independent performance criterion is used.
13
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
Variable selection: Filter approach
 Example: Select variables based on the correlation matrix.
 Remove NOX, AGE, DIS, TAX.
• NOX, DIS, TAX are highly correlated with INDUS.
• AGE is highly correlated with NOX.
4
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
CRIM
1.00
-0.20
0.41
-0.06
0.42
-0.22
0.35
-0.38
0.63
0.58
0.29
-0.39
0.46
ZN INDUS
1.00
-0.53
-0.04
-0.52
0.31
-0.57
0.66
-0.31
-0.31
-0.39
0.18
-0.41
1.00
0.06
0.76
-0.39
0.64
-0.71
0.60
0.72
0.38
-0.36
0.60
CHAS
1.00
0.09
0.09
0.09
-0.10
-0.01
-0.04
-0.12
0.05
-0.05
NOX
1.00
-0.30
0.73
-0.77
0.61
0.67
0.19
-0.38
0.59
RM
1.00
-0.24
0.21
-0.21
-0.29
-0.36
0.13
-0.61
14
AGE
1.00
-0.75
0.46
0.51
0.26
-0.27
0.60
DIS
1.00
-0.49
-0.53
-0.23
0.29
-0.50
RAD
1.00
0.91
0.46
-0.44
0.49
TAX PTRATIO
1.00
0.46
-0.44
0.54
1.00
-0.18
0.37
B LSTAT
1.00
-0.37
1.00
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
Variable selection: Wrapper approach
 Select variables based on the model results.
 Forward selection
• Start with the most relevant variable.
• Add another variable if it increases the accuracy of the
4
data mining model.
 Backward elimination
• Start with the entire variables.
• Remove the most irrelevant variable if the accuracy of the
data mining model increases (at least does not decrease).
 Stepwise selection
• Do Forward selection + Backward elimination alternately.
15
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
Variable extraction: Principal component analysis (PCA)
 Purpose
 Preserve the variance as much as possible with fewer bases.
 Example:
4
3.5
3
2.5
2
1.5
1
0.5
0
0
0.5
1
1.5
2
2.5
3
3.5
16
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
Variable extraction: Principal component analysis (PCA)
 Mathematical backgrounds
 Projection:
b
4
b


x  pa
a
(b  pa )T a  0  b T a  pa T a  0 
b Ta
x  pa  T a
a a
If a is unit vector
p  b Ta 
x  pa  (b T a )a
17
b Ta
p T
a a
a
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
Variable extraction: Principal component analysis (PCA)
 Mathematical backgrounds
 Covariance:
• X: a data set (m by n, m: # of variables, n: # of records).
4
Cov ( X) 
1
( X  X)( X  X) T
n
• Cov(X)ij = Cov(X)ji
• Total variance of the data set
= tr[Cov(X)]
= Cov(X)11+ Cov(X)22 + Cov(X)33 +…+ Cov(X)mm
18
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
Variable extraction: Principal component analysis (PCA)
 Mathematical backgrounds
 Eigen problem:
4
WhenmatrixA is given,scalar value  and vectorx thatsatisfy
Ax  x or ( A-I)x  0
are called eigenvalueand eigenvector, respectively.
 If A is m by m non-singular matrix,
• There are m different eigenvalues and eigenvectors.
• Eigenvectors are orthogonal.
• tr(A) = λ1+ λ2+ λ3+ …+ λm
19
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
Variable extraction: Principal component analysis (PCA)
 PCA Procedure 1: Normalize the data
4
x1
2.5
0.5
2.2
1.9
3.1
2.3
2
1
1.5
1.1
x2
2.4
0.7
2.9
2.2
3
2.7
1.6
1.1
1.6
0.9
x1
0.69
-1.31
0.39
0.09
1.29
0.49
0.19
-0.81
-0.31
-0.71
x2
0.49
-1.21
0.99
0.29
1.09
0.79
-0.31
-0.81
-0.31
-1.01
4
2
3
1
X2
X2
2
0
1
-1
0
-1
-1
0
1
2
3
-2
-2
4
X1
20
-1
0
X1
1
2
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
Variable extraction: Principal component analysis (PCA)
 PCA Procedure 2: Formulate the problem
 If a set of vectors (x) are projected onto w, then the variance
after projection becomes:
4
V  (w T X)(w T X)T  w T XX T w  nw TSw
S is thesamplecovariancematrix where x is normalized.
 PCA aims at maximizing V:
max
w T Sw
s.t.
wTw  1
 0.6166 0.6154 
S 

0.6154
0.7166


21
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
Variable extraction: Principal component analysis (PCA)
 PCA Procedure 3: Solve the problem
 Use a Lagrangian multiplier.
max w T S w
s.t.
4
wTw  1
L  w T S w  λ (w T w  1),
L
 0  S w  λw  0  (S  λI )w  0
w
 0.7352 0.6779 
Eigenvectors  

 0.6779 0.7352 
Eigenvalues   0.0491 1.2840 
22
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
Variable extraction: Principal component analysis (PCA)
 PCA Procedure 4: Select the bases
 In the descending order of eigenvalues.
FeatureVector  (eig1 , eig 2 ,
4
eig n )
 0.6779  0.7352 
FeatureVector  

0.7352
0.6779


 With only one basis, 96% of original variance is preserved.
Let w1 be one of the eigenvectors and 1 be the corresponding eigenvalue.
The variation of the samples projected onto w1 is
(w1T X)(w1T X)T  w1T XX T w1  w1TSw1
Since Sw1  1w1 ,
w1TSw1  w1T 1w1  1w1T w1  1
23
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
Variable extraction: Principal component analysis (PCA)
 PCA Procedure 5: Construct new data
0.69
-1.31
0.39
0.09
1.29
0.49
0.19
-0.81
-0.31
-0.71
x2
0.49
-1.21
0.99
0.29
1.09
0.79
-0.31
-0.81
-0.31
-1.01
z1
0.83
-1.78
0.99
0.27
1.68
0.91
-0.10
-1.14
-0.44
-1.22
2
x1
1
X2
0
4
-1
2
-2 -2
0
-1
X2
1
0
-1
X1
-1
0
X1
1
1
-2
-2
2
2
24
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
PCA Example: Breakfast cereals
 Original data
4
25
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
PCA Example: Breakfast cereals
 When there are only two variables
4
26
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
PCA Example: Breakfast cereals
 Covariance matrix
4
Calories
Rating
Calories
379.63
-188.68
Rating
-188.68
197.32
 Scatter plot
27
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
PCA Example: Breakfast cereals
 Eigenvalues and eigenvectors
Components
Variable
4
1
2
calories
rating
-0.84705347
0.53150767
0.53150767
0.84705347
Variance
Variance%
Cum%
P-value
498.0244751
86.31913757
86.31913757
0
78.932724
13.68086338
100
1
28
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
PCA Example: Breakfast cereals
 Newly constructed variables
4
29
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
PCA Example: Breakfast cereals
 General case: more than two variables
4
30
2011 Data Mining, IISE, SNUT
Dimensionality Reduction
PCA Example: Breakfast cereals
 Scatter plot on principal components
4
31