슬라이드 1 - Tistory

Transcript 슬라이드 1 - Tistory

2011 Data Mining
Industrial & Information Systems Engineering
Chapter 4:
Evaluating Classification &
Predictive Performance
•Pilsung Kang
•Industrial & Information Systems Engineering
•Seoul National University of Science & Technology
2011 Data Mining, IISE, SNUT
Steps in Data Mining revisited
1. Define and understand the purpose of data mining project
2. Formulate the data mining problem
3. Obtain/verify/modify the data
4. Explore and customize the data
5. Build data mining models
6. Evaluate and interpret the results
7. Deploy and monitor the model
2
2011 Data Mining, IISE, SNUT
Why Evaluate?
 Over-fitting for training data
Training data
Validation data
Test data
Is red boundary is better than blue one?
3
2011 Data Mining, IISE, SNUT
Why Evaluate?
 Over-fitting for training data
Do not memorize them all!!
Training data
Validation data
Test data
4
2011 Data Mining, IISE, SNUT
Why Evaluate?
 Multiple methods are available to classify or predict.
 Classification:
• Naïve bayes, linear discriminant, k-nearest neighbor, classification trees, etc.
 Prediction:
• Multiple linear regression, neural networks, regression trees, etc.
 For each method, multiple choices are available for settings.
 Neural networks: # hidden nodes, activation functions, etc.
 To choose best model, need to assess each model’s performance.
 Best setting (parameters) among various candidates for an algorithm
(validation).
 Best model among various data mining algorithms for the task (test).
5
2011 Data Mining, IISE, SNUT
Classification Performance
1
Example: Gender classification
 Classify a person based on his/her body fat percentage (BFP).
10.0
21.7
8.9
19.9
23.4
28.9
15.7
21.6
21.5
23.2
 Simple classifier: if BFP > 20 then female else male.
10.0
21.7
8.9
19.9
23.4
28.9
15.7
21.6
21.8
23.2
M
F
M
M
F
F
M
F
F
F
 How do you evaluate the performance of the above classifier?
6
2011 Data Mining, IISE, SNUT
Classification Performance
Confusion Matrix
2
 Summarizes the correct and incorrect classifications that a
classifier produced for a certain data set.
10.0
21.7
8.9
19.9
23.4
28.9
15.7
21.6
21.5
23.2
M
F
M
M
F
F
M
F
F
F
 Confusion matrix can be constructed as
Predicted
Confusion Matrix
Actual
F
M
F
4
1
M
2
3
7
2011 Data Mining, IISE, SNUT
Classification Performance
Confusion Matrix
2
 Summarizes the correct and incorrect classifications that a
classifier produced for a certain data set.
Predicted
Confusion Matrix
Actual
1(+)
0(-)
1(+)
n11
n10
0(-)
n01
n00
• Sensitivity (true positive, recall) = n11/(n11+n10)
• Specificity (true negative) = n00/(n01+n00)
• Precision = n11/(n11+n01)
• Type I error (false negative) = n10/(n11+n10)
• Type II error (false positive) = n01/(n01+n00)
8
2011 Data Mining, IISE, SNUT
Classification Performance
Confusion Matrix: continued
2
Predicted
Confusion Matrix
Actual
1(+)
0(-)
1(+)
n11
n10
0(-)
n01
n00
 Misclassification error = (n12+n21)/(n11+n12+n21+n22)
 Accuracy (1-misclassification error) = (n11+n22)/(n11+n12+n21+n22)
 Balanced correction rate =
n11
n22

n11  n12 n21  n22
sqrt(sensitivity*specificity)
 F1 measure (harmonized mean of recall and precision) =
2  Recall  Precision
F1 measure 
Recall  Precision
9
2011 Data Mining, IISE, SNUT
Classification Performance
Confusion Matrix
2
 For the previous example:
Predicted
Confusion
Matrix
F
M
F
4
1
M
2
3
Actual
4
2
1
3
 Sensitivity: 4/5 = 0.8, Specificity: 3/5 = 0.6
 Recall: 4/5 = 0.8, Precision: 4/6 = 0.67
 Type I error: 1/5 = 0.2, Type II error: 2/5 = 0.4
 Misclassification error: (1+2)/(4+1+2+3) = 0.3, accuracy = 0.7
 Balanced correction rate: sqrt(0.8*0.6) = 0.69
 F1 measure: (2*0.8*0.67)/(0.8+0.67) = 0.85
10
2011 Data Mining, IISE, SNUT
Classification Performance
Cut-off for classification
 A new classifier: : if BFP > θ then female else male.
3
10.0
21.7
8.9
19.9
23.4
28.9
15.7
21.6
21.5
23.2
19.9
15.7
10.0
8.9
 Sort data in a descending order of BFS.
28.6
25.4
24.2
23.6
22.7
21.5
 How do you decide the cut-off for classification?
11
2011 Data Mining, IISE, SNUT
Classification Performance
Cut-off for classification
 Performance measures for different cut-offs:
3
No.
BFS
Gender
1
28.6
F
2
25.4
M
3
24.2
F
=
22,
 IfIf
Ifθθθ=
=24,
18,
Predicted
Confusion
Matrix
F
M
F
5
4
2
0
1
3
M
2
1
3
4
Actual
4
23.6
F
5
22.7
F
6
21.5
M
Misclassification
error:
0.2
••• Misclassification
Misclassificationerror:
error:0.4
0.2
7
19.9
F
Accuracy:
0.8
••• Accuracy:
Accuracy:0.6
0.8
8
15.7
M
Balanced
correction
rate:
0.8
••• Balanced
Balancedcorrection
correctionrate:
rate:0.57
0.77
9
10.0
M
F1
measure
0.8
••• F1
F1measure
measure==
=0.5
0.83
10
8.9
M
12
2011 Data Mining, IISE, SNUT
Classification Performance
Cut-off for classification
 In general, classification algorithms can produce the likelihood for
each class in terms of probability or degree of evidence, etc.
3
 Classification performance highly depends on the cut-off of the
algorithm.
 For model selection & model comparison, cut-off independent
performance measures are recommended.
 Lift charts, receiver operating characteristic (ROC) curve, etc.
13
Patient P(Malignant)
1
0.976
Status Patient P(Malignant)
1
26
Status Patient P(Malignant)
0.716
1
51
2
0.973
1
27
0.676
0
52
Classification
Performance
0.410
Status Patient P(Malignant)
Status
2011 Data Mining, IISE, SNUT
0
76
0.186
0
0.406
1
77
0.183
0
3
0.971
0
28
0.672
0
53
0.378
0
78
0.178
0
4
0.967
1
29
0.662
0
54
0.376
0
79
0.178
0
5
0.937
0
30
0.647
0
55
0.362
0
80
0.173
0
6
Lift charts:
An example
0.936
1
31
0.640
1
56
0.355
0
81
0.170
0
7
0.929
0
57
0.343
0
82
0.133
0
0
58
0.338
0
83
0.120
0
0
84
0.119
0
1
32
0.625
8
 Cancer
diagnosis:
0.927
0
33
0.624
9
0.923
10
0.898
11
0.863
• A
patients.
1 total
36 of 100
0.604
0
12
0.863
13
0.859
14
15
4
0.855
1
34 patients’
0.613 probability
1
59of malignant.
0.335
• Predict
0
60
0.334
0
85
0.112
0
61
0.328
0
86
0.093
0
1
37
0.601
0
• 20
patients
are
malignant.
62
0.313
0
87
0.086
0
• Malignant
ratio:
0
39
0.578 0.2.
0
63
0.285
1
88
0.079
0
0
64
0.274
0
89
0.071
0
0
35
38
0.606
0.594
0
0.847
1
40
0.548
0
65
0.274
0
90
0.069
0
16
0.847
1
41
0.539
1
66
0.272
0
91
0.047
0
17
0.837
0
42
0.525
1
67
0.267
0
92
0.029
0
18
0.833
0
43
0.524
0
68
0.265
0
93
0.028
0
19
0.814
0
44
0.514
0
69
0.237
0
94
0.027
0
20
0.813
0
45
0.510
0
70
0.217
0
95
0.022
0
21
0.793
1
46
0.509
0
71
0.213
0
96
0.019
0
22
0.787
0
47
0.455
0
72
0.204
1
97
0.015
0
23
0.757
1
48
0.449
0
73
0.201
0
98
0.010
0
24
0.741
0
49
0.434
0
74
0.200
0
99
0.005
0
25
0.737
0
50
0.414
0
75
0.193
0
100
0.002
0
14
2011 Data Mining, IISE, SNUT
Classification Performance
Confusion matrix
 Set the cut-off to 0.9
• Malignant if P(Malignant) > 0.9, else benign.
4
Predicted
Confusion
Matrix
M
B
M
6
14
B
3
77
Actual
6
• Misclassification error = 0.17
• Accuracy = 0.83
 Is it a good classification model?
15
3
14
77
2011 Data Mining, IISE, SNUT
Classification Performance
Confusion matrix
 Set the cut-off to 0.8
• Malignant if P(Malignant) > 0.8, else benign.
4
Predicted
Confusion
Matrix
M
B
M
10
10
B
10
70
Actual
10
• Misclassification error = 0.2
• Accuracy = 0.8
 Is it worse than the previous model?
16
1
0
10
70
2011 Data Mining, IISE, SNUT
Classification Performance
Lift charts
 Useful for assessing performance in terms of identifying the most
important class.
 Compare performance of DM model to “no model, pick randomly.”
4
 Measures ability of DM model to identify the important class,
relative to its average prevalence.
 Charts give explicit assessment of results over a large number of
cutoffs.
17
2011 Data Mining, IISE, SNUT
Classification Performance
Lift charts: Preparation
 Benchmark model (B): randomly assign malignant with the
probability of 0.2.
 Compute the number of malignant patients for each decile.
Non-cumulative
4
Cumulative
Decile
A
B
A
B
1
6
2
6
2
2
4
2
10
4
3
3
2
13
6
4
2
2
15
8
5
2
2
17
10
6
1
2
18
12
7
1
2
19
14
8
1
2
20
16
9
0
2
20
18
10
0
2
20
20
18
2011 Data Mining, IISE, SNUT
Classification Performance
Lift charts
 Plot the case/relative ratio/proportion for each decile.
Case
4
Relative ratio
7
3.5
6
3
5
2.5
4
2
3
1.5
2
1
1
0.5
0
0
1
2
3
4
5
A
6
7
8
9
10
B
1
2
3
4
5
A
19
6
B
7
8
9 10
2011 Data Mining, IISE, SNUT
Classification Performance
Lift charts
 Plot the case/relative ratio/proportion for each decile.
Proportion (non-cumulative)
0.70
• Top 20~30% Prob.
0.60
• 30% of them are malignant.
0.50
4
• Lift = 0.3/0.2 = 1.5
0.40
0.30
0.20
0.10
0.00
1
2
3
4
5
A
6
7
8
9 10
B
20
2011 Data Mining, IISE, SNUT
Classification Performance
Lift charts
 Plot the case/relative ratio/proportion for each decile.
Proportion (cumulative)
0.70
• Top 0~30% Prob.
0.60
• 43.33% of them are
0.50
4
malignant.
0.40
• Cumulative lift:
0.30
0.43/0.2 = 2.17
0.20
0.10
0.00
1
2
3
4
5
A
6
7
8
9 10
B
21
2011 Data Mining, IISE, SNUT
Classification Performance
Gain chart
 Compare two models for each cumulative decile.
1
4
• Top 0~30% Prob.
0.8
• 65% of malignant patients
0.6
belong to this group.
• Cumulative lift:
0.4
0.65/0.3 = 2.17
0.2
• Cumulative lift chart (y-axis):
0
1
2
3
4
5
A
6
7
8
9 10
B
(malignant/total patients) in the group
• Gain chart (y-axis):
(malignant in the group)/total malignant
22
76
0.186
0
77
0.183
0 2011 Data
1.000Mining, IISE,
0.713SNUT
78
Classification Performance79
0.178
0
1.000
0.725
0.178
0
1.000
0.738
80
0.173
0
1.000
0.750
81
0.170
0
1.000
0.763
82
0.133
0
1.000
Receiver operating characteristics
(ROC) curve
0.775
83
 Sort the records based on
the P(interesting class) in a
descending order.
 Compute the true positive
rate and false positive rate
5
by varying the cut-off.
 Draw a chart where x & y
axes are false & true
positive, respectively.
0.120
84
0.119
Patient
P(Malignant)
85
0.112
1
0.976
86
0.093
2
0.973
87
3
88
4
89
5
90
6
91
7
92
8
93
9
94
10
95
11
96
12
97
13
98
14
99
15
100
16
0.086
0.971
0.079
0.967
0.071
0.937
0.069
0.936
0.047
0.929
0.029
0.927
0.028
0.923
0.027
0.898
0.022
0.863
0.019
0.863
0.015
0.859
0.010
0.855
0.005
0.847
0.002
0.847
0
1.000
1.000
0
Status
True1.000
positive
0
1.000
1
0.050
0
1.000
1
0.100
0
0
0
1
0
0
0
1
0
1
0
0
0
1
0
0
0
1
0
1
0
0
0
0
0
1
0
1
1.000
0.100
1.000
0.150
1.000
0.150
1.000
0.200
1.000
0.250
1.000
0.250
1.000
0.300
1.000
0.300
1.000
0.350
1.000
0.400
1.000
0.400
1.000
0.400
1.000
0.450
1.000
0.500
0.700
0.788
false0.800
positive
0.813
0.000
0.825
0.000
0.838
0.013
0.850
0.013
0.863
0.025
0.875
0.025
0.888
0.025
0.900
0.038
0.913
0.038
0.925
0.050
0.938
0.050
0.950
0.050
0.963
0.063
0.975
0.075
0.988
0.075
1.000
0.075
17
0.837
0
0.500
0.088
23 18
0.833
0
0.500
0.100
2011 Data Mining, IISE, SNUT
Classification Performance
Receiver operating characteristics (ROC) curve
ROC curve
5
True positive (sensitivity)
1.0
Ideal classifier
0.8
0.6
Random classifier
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
False positive (1-specificity)
24
1.0
2011 Data Mining, IISE, SNUT
Classification Performance
Receiver operating characteristics (ROC) curve
low
cut-off
True
positive
high
False positive
5
So so
Good
25
Bad
2011 Data Mining, IISE, SNUT
Classification Performance
ROC curve and confusion matrix
Low
Cut-off
5
26
High
2011 Data Mining, IISE, SNUT
Classification Performance
ROC curve, lift chart, and gain chart
5
27
2011 Data Mining, IISE, SNUT
Classification Performance
Area Under ROC curve (AUROC)
ROC curve
 The area under the
1.0
 Can be a useful
metric for
parameter/model
selection.
 1 for the ideal
classifier
6
True positive (sensitivity)
ROC curve.
0.8
0.6
AUROC
0.4
0.2
 0.5 for the random
classifier.
0.0
0.0
0.2
0.4
0.6
0.8
False positive (1-specificity)
28
1.0
2011 Data Mining, IISE, SNUT
Classification Performance
Asymmetric misclassification costs
 In many cases it is more important to identify members of one
class.
• Cancer diagnosis, tax fraud, credit default, response to
promotional offer, etc.
 In such cases, we are willing to tolerate greater overall error, in
return for better identifying the important class for further
attention.
 The cost of making a misclassification error may be higher for
one class than the other(s).
7
 The benefit of making a correct classification may be higher for
one class than the other(s).
29
2011 Data Mining, IISE, SNUT
Classification Performance
Example: Response to promotional offer
 Suppose we send an offer to 1000 people, with 1% average
response rate (“1” = response, “0” = non-response).
 “Naïve rule”
• Classify everyone as “0.
Confusion Matrix
Actual
Predicted
1
0
1
0
10
0
0
990
• Misclassification error = 1%
7
• Accuracy = 99%.
30
2011 Data Mining, IISE, SNUT
Classification Performance
Example: Response to promotional offer
 DM model
• Correctly classify eight 1’s as 1’s at the cost of misclassifying
twenty 0’s as 1’s and two 0’s as 1’s.
Confusion Matrix
Actual
Predicted
1
0
1
8
2
0
20
970
• Misclassification error = 2.2%
• Accuracy = 97.8%
7
 Is it worse than the previous model?
31
2011 Data Mining, IISE, SNUT
Classification Performance
Profit/Cost matrix
 Assign profit/cost for each cell of confusion matrix.
• Example:
 $10: net profit for the responders if the offer is sent.
 $10: net cost for not sending offer for the responders.
 $1: net cost for sending an offer.
Predicted
Confusion Matrix
Actual
7
1
0
1
$9
-$10
0
-$1
0
• Total profit for the naïve rule: 10*(-$10) = -$100
• Total profit for DM model: 8*($9)+2*(-$10)+20*(-$1) = $32
32
2011 Data Mining, IISE, SNUT
Classification Performance
Profit/Cost matrix for cancer diagnosis
 Can assign the net cost for classifying malignant to benign?
Predicted
Confusion
Matrix
1
0
1
Save one’s lift
Can
measure?
0
Misdiagnosis
cost
0
Actual
• This is why doctors’ diagnoses are usually very conservative.
7
33
2011 Data Mining, IISE, SNUT
Classification Performance
Cost ratio
 In general, actual costs and benefits are hard to estimate.
 Need to express everything in terms of costs (i.e. cost of
misclassification per record).
 Goal is to minimize the average cost per record.
 A good practical substitute for individual costs is the ratio of
misclassification costs
 Misclassifying responders costs 10 times higher then
misclassifying non-responders.
 Misclassifying fraudulent firms is 5 times worse than
7
misclassifying solvent firms.
34
2011 Data Mining, IISE, SNUT
Classification Performance
Cost ratio
 Evaluation using cost ratio:
• q0/q1: misclassifying cost for negative(0)/positive(1) class.
Predicted
Confusion
Matrix
Actual
1(+)
0(-)
1(+)
n11
n10
0(-)
n01
n00
 Expected misclassification cost per record:
=
=
7
qo  n0,1  q1  n1,0
n
n0,1
n0,0  n0,1
n0,1

n0,0  n0,1
 qo 
n1,0

n1,0  n1,1
n1,0  n1,1
n
n1,0
=
 p(C0 )  qo 
 p(C1 )  q1
n0,0  n0,1
n1,0  n1,1
n
35
 q1
2011 Data Mining, IISE, SNUT
Classification Performance
Oversampling for asymmetric costs
 When misclassification costs are equal:
7
36
2011 Data Mining, IISE, SNUT
Classification Performance
Oversampling for asymmetric costs
 When misclassification costs are unequal:
• Misclassification cost for o is 5 times higher than that of x.
7
37
2011 Data Mining, IISE, SNUT
Classification Performance
Oversampling for asymmetric costs
 Oversampling:
• Generate four synthetic o instances around each o.
7
38
2011 Data Mining, IISE, SNUT
Classification Performance
Confusion matrix for over-sampled data
 Assume that there are 2% of class 1 and 98% of class 2.
 Conducted over-sampling so that there are equal number of class
1 and class 2 records.
 After oversampling:
Confusion Matrix
Actual
Predicted
1
0
1
420
80
0
110
390
• Misclassification rate = (80+110)/1,000 = 19%
7
39
2011 Data Mining, IISE, SNUT
Classification Performance
Confusion matrix for over-sampled data
 # of records for the entire data: 0.02*X = 500, X=25,000.
 # of records for class 2: 25,000*0.98 = 24,500
 For the original data:
Confusion Matrix
Actual
Predicted
1
0
1
420
80
0
5,390
19,110
• Misclassification rate = (80+5,390)/25,000 = 21.9%
7
40
2011 Data Mining, IISE, SNUT
Prediction Performance
1
Example
 Predict a baby’s weight(kg) based on his age.
Age
Actual
Weight(y)
Predicted
Weight(y’)
25
1
5.6
6.0
20
2
6.9
6.4
15
3
10.4
10.9
10
4
13.7
12.4
5
5
17.4
15.6
0
6
20.7
21.5
7
23.5
23.0
1
2
3
4
5
Actual Weight(y)
Predicted Weight(y’)
41
6
7
2011 Data Mining, IISE, SNUT
Prediction Performance
Average error
2
 Indicate whether the predictions
are on average over- or underpredicted.
1 n
( y  y)

i 1
n
 0.342
Averageerror 
42
Age
Actual
Weight(y)
Predicted
Weight(y’)
1
5.6
6.0
2
6.9
6.4
3
10.4
10.9
4
13.7
12.4
5
17.4
15.6
6
20.7
21.5
7
23.5
23.0
2011 Data Mining, IISE, SNUT
Prediction Performance
Mean absolute error (MAE)
 Gives the magnitude of the
average error.
3
1 n
MAE  i 1 y  y
n
 0.829
43
Age
Actual
Weight(y)
Predicted
Weight(y’)
1
5.6
6.0
2
6.9
6.4
3
10.4
10.9
4
13.7
12.4
5
17.4
15.6
6
20.7
21.5
7
23.5
23.0
2011 Data Mining, IISE, SNUT
Prediction Performance
Mean absolute percentage error (MAPE)
 Gives a percentage score of
how predictions deviate (on
average) from the actual values.
4
MAPE  100% 
1 n

n i 1
y  y
y
 6.43%
44
Age
Actual
Weight(y)
Predicted
Weight(y’)
1
5.6
6.0
2
6.9
6.4
3
10.4
10.9
4
13.7
12.4
5
17.4
15.6
6
20.7
21.5
7
23.5
23.0
2011 Data Mining, IISE, SNUT
Prediction Performance
(Root) Mean squared error ((R)MSE)
 Standard error of estimate.
 Same units as the variable
predicted.
1 n
2

(
y

y
)

n i 1
 0.926
MSE 
5
1 n
2

RMSE 
(
y

y
)

n i 1
 0.962
45
Age
Actual
Weight(y)
Predicted
Weight(y’)
1
5.6
6.0
2
6.9
6.4
3
10.4
10.9
4
13.7
12.4
5
17.4
15.6
6
20.7
21.5
7
23.5
23.0