Computer Science 1 - Maastricht University

Transcript Computer Science 1 - Maastricht University

Evaluation of learned models
Kurt Driessens
again with slides stolen from Evgueni
Smirnov and Hendrik Blockeel
Overview
• Motivation
• Metrics for Classifier Evaluation
• Methods for Classifier Evaluation &
Comparison
• Costs in Data Mining
– Cost-Sensitive Classification and Learning
– Lift Charts
– ROC Curves
Motivation
• It is important to evaluate classifier’s
generalization performance in order to:
– Determine whether to employ the classifier;
(For example: when learning the effectiveness of medical
treatments from a limited-size data, it is important to
estimate the accuracy of the classifiers.)
– Optimize the classifier.
(For example: when post-pruning decision trees we must
evaluate the accuracy of the decision trees on each
pruning step.)
Model’s Evaluation in the KDD Process
Knowledge
Transformed
data
Target
data
data
Selection
Processed
data
Patterns
Interpretation
Evaluation
Data Mining
Transformation
Preprocessing& feature
& cleaning selection
How to evaluate the Classifier’s
Generalization Performance?
Assume that we test a classifier on some test
set and we derive at the end the following
confusion matrix:
Predicted class
Actual
class
Pos
Neg
+
TP
FN
P
-
FP
TN
N
Metrics for Classifier’s Evaluation
Accuracy = (TP+TN)/(P+N)
Error = (FP+FN)/(P+N)
Precision = TP/(TP+FP)
Recall/TP rate = TP/P
FP Rate = FP/N
Actual
class
Predicted class
Pos
Neg
+
TP
FN
P
-
FP
TN
N
How to Estimate the Metrics?
• We can use:
– Training data;
– Independent test data;
– Hold-out method;
– k-fold cross-validation method;
– Leave-one-out method;
– Bootstrap method;
– And many more…
Estimation with Training Data
The accuracy/error estimates on the training data are
not good indicators of performance on future data.
Classifier
Training set
Training set
– Q: Why?
– A: Because new data will probably not be exactly the same
as the training data!
• The accuracy/error estimates on the training data
measure the degree of classifier’s overfitting.
Estimation with Independent Test Data
Estimation with independent test data is used when we have
plenty of data and there is a natural way to forming
training and test data.
Classifier
Training set
Test set
• For example: Quinlan in 1987 reported experiments in a
medical domain for which the classifiers were trained on
data from 1985 and tested on data from 1986.
Hold-out Method
The hold-out method splits the data into training data and test
data (usually 2/3 for train, 1/3 for test). Then we build a
classifier using the train data and test it using the test data.
Classifier
Training set
Data
• used with thousands of instances,
• including plenty from each class.
Test set
Classification: Train, Validation, Test Split
Results Known
Data
+
+
+
Model
Builder
Training set
Evaluate
Classifier Builder
Y
N
Validation set
Final Test Set
Classifier
The test data can’t be used for parameter tuning!
Predictions
+
+
+
- Final Evaluation
+
-
Making the Most of the Data
• Once evaluation is complete, all the data can
be used to build the final classifier.
• Generally, the larger the training data the
better the classifier (but returns diminish).
• The larger the test data the more accurate
the error estimate.
Stratification
• The holdout method reserves a certain
amount for testing and uses the remainder
for training.
–Usually: one third for testing, the rest for training.
• For “unbalanced” datasets, samples might
not be representative.
–Few or none instances of some classes.
• Stratified sampling: advanced version of
balancing the data.
–Make sure that each class is represented with
approximately equal proportions in both subsets.
Repeated Holdout Method
In general, estimates can be made more reliable
by repeated sampling
– Each iteration, a certain proportion is randomly
selected for training (possibly with stratification).
– The error rates on the different iterations are
averaged to yield an overall error rate.
This is called the repeated holdout method.
Repeated Holdout Method, 2
Random sampling ≠ optimal
– the different test sets overlap
– we would like all our instances from the data to be
tested at least once
Can we prevent overlapping?
k-Fold Cross-Validation
• k-fold cross-validation avoids overlapping test sets:
– First step: data is split into k subsets of equal size;
– Second step: each subset in turn is used for testing and the
remainder for training.
• The subsets are stratified
before the cross-validation.
• The estimates are averaged to
yield an overall estimate.
Data
Classifier
train
train
test
train
test
train
test
train
train
More on Cross-Validation
• Standard method for evaluation: stratified 10-fold crossvalidation.
• Why 10? Extensive experiments have shown that this is the
best choice to get an accurate estimate.
• Stratification reduces the estimate’s variance.
Even better: repeated stratified cross-validation:
– E.g. ten-fold cross-validation is repeated ten times and
results are averaged (reduces the variance).
Leave-One-Out Cross-Validation
• Leave-One-Out is a particular form of crossvalidation:
– Set number of folds to number of training
instances;
– I.e., for n training instances, build classifier n
times.
• Makes best use of the data.
• Involves no random sub-sampling.
• Very computationally expensive.
Leave-One-Out Cross-Validation and
Stratification
• A disadvantage of Leave-One-Out-CV is that
stratification is not possible:
– It guarantees a non-stratified sample because
there is only one instance in the test set!
• Extreme example - random dataset split
equally into two classes:
– Best inducer predicts majority class;
– 50% accuracy on fresh data;
– Leave-One-Out-CV estimate is 100% error!
Bootstrap Method
• Cross validation uses sampling without
replacement:
– The same instance, once selected, can not be selected
again for a particular training/test set
• The bootstrap uses sampling with replacement to
form the training set:
– Sample a dataset of n instances n times with replacement
to form a new dataset of n instances;
– Use this data as the training set;
– Use the instances from the original dataset that don’t
occur in the new training set for testing.
Bootstrap Method
The bootstrap method is also called the 0.632
bootstrap:
– A particular instance has a probability of 1–1/n of not
being picked;
– Thus its probability of ending up in the test data is:
n
 1
 
1
1


e

0
.
368
 
 n

– This means the training data will contain approximately
63.2% of the instances and the test data will contain
approximately 36.8% of the instances.
Estimating Error with the Bootstrap Method
The error estimate on the test data will be very
pessimistic because the classifier is trained on
approx. 63% of the instances.
– Therefore, combine it with the training error:
err

0
.
632

e

0
.
36

e
test
instan
trai
ins
– The training error gets less weight than the error on the
test data.
– Repeat process several times with different replacement
samples; average the results.
Confidence Intervals for Performance
Assume that the error errorS(h) of the
classifier h estimated by the 10-fold cross
validation is 25%.
• How close is the estimated error errorS(h) to
the true error errorD(h) ?
Confidence intervals (2)
If test data contain n examples, drawn independently of
each other, n  30
Then with approximately N% probability, errorD(h) lies
in the interval
erro
(
h
)(
1

err
(
h
))
r
S
S
N
n
error
(
h
)

z
S
where
N%: 50% 68% 80% 90% 95% 98% 99%
zN: 0.67 1.00 1.28 1.64 1.96 2.33 2.58
Comparison of hypotheses
Given two hypotheses, which one has lower true
error?
• Statistical hypothesis test:
– claim that both are equally good
– if claim rejected, accept that 1 is better
• 2 cases:
– compare 2 hypotheses on possibly different test
sets
– compare 2 hypotheses on same test set
Different Test Sets
To compare h1 and h2, estimate p1-p2 from
samples S1 (p’1) and S2 (p’2)
– if “very likely” p1-p2 > 0 (i.e., confidence interval is entirely to
the right of 0): h1 is better
– similarly, < 0 : h2 is better
– otherwise, no difference demonstrated
Formula for confidence interval of difference:
p
'
(
1

p
'
)p
'
(
1

p
'
)
1
1
2
2
p
'

p
'

z
1
2
 n n
1
2
Same Test Set
When comparing hypotheses on the same data
set, more powerful procedure possible
– uses more information from test
– possible influence of easy/difficult examples
removed

More informative method:
– for each single example, compare h1 and h2
– how often was h1 correct and h2 wrong on the
same example, vs. the other way around?
– McNemar’s test
McNemar's test
• Consider table:
h
1
c
o
r
r
e
c
t h
1
w
r
o
n
g
h
2
c
o
r
r
e
c
t A
B
h
2
w
r
o
n
gC
D
• If h1 is equally good as h2:
– for each instance where h1 and h2 differ, probability
0.5 that either is correct
– hence we expect B  C  (B+C)/2
– B and C follow binomial (+/- normal) distribution
• Reject equality if B deviates too much from
(B+C)/2
Example comparison
Consider table below
• Method with independent test sets:
– 55-45 in favour of h2 (out of 100)
– not very convincing
• Method with same test set:
– much more convincing: 10-0 in favour of h2
h
1
c
o
r
r
e
c
th
1
w
r
o
n
g - h2 clearly better than h1
h
2
c
o
r
r
e
c
t
4
5
1
0- might not be discovered using
"conservative" comparison
h
2
w
r
o
n
g
0
4
5
Metric Evaluation Summary:
1. Use test sets and the hold-out method for “large”
data;
2. Use the cross-validation method for “middle-sized”
data;
3. Use the leave-one-out and bootstrap methods for
small data;
Don’t use test data for parameter tuning - use separate
validation data.
Comparing two classifiers to each other can use more
advanced statistics: t-test, McNemar, …
Drawbacks of Accuracy
• Evaluation based on accuracy is not always
appropriate
• Shortcomings:
– can sometimes be misleading
– unstable when class distribution may change
– assumes symmetric misclassification costs
1: Accuracy can be misleading
E.g., "99% correct prediction": is this good?
– Yes, if 50% "+" and 50% "-"
– No, if 1% "+" and 99% "-"
always predicting "neg" gives 99% accuracy
• Accuracy is a relative measure
– Should be compared with "base accuracy" of always
predicting the majority class
base accuracy in table = max{P,N} / T
– Even then, it may be misleading...
Assume all examples -, except blue region (+)
• Which of these classifiers is best?
Classifier 1
Classifier 2
+
IF false THEN pos
96% correct
+
IF green area THEN pos:
92% correct
• An alternative measure: correlation
– e.g., correlation  = (ad-bc) / TposTminT+T-
• close to 1: high correlation predictions - classes
• close to 0: no correlation
• (close to -1: predicting the opposite)
– Avoids the unintuitive results just mentioned
prediction
actual value
+
-
Sum
Pos
a
b
Tpos
Neg
c
d
Tneg
Sum
T+
T-
T
note:
+/- are actual values
pos/neg are predictions
2: Accuracy is sensitive to class distributions
If class distribution in test set differs from that in
training set, accuracy will also differ
E.g.:
– Suppose a classifier has TP = 0.8, TN = 0.6
– Tested on test set with T+/T = 0.5, T-/T = 0.5:
• Acc = 0.7
– Employed in environment with T+/T = 0.3, T-/T = 0.7:
• Acc = 0.66
3: Accuracy ignores misclassification costs
Accuracy ignores possibility of different
misclassification costs
– sometimes, incorrectly predicting "pos" costs more/less
than incorrectly predicting "neg”
E.g.:
• not treating an ill patient vs. treating a healthy patient
• refusing credit to client who would have paid back vs.
assigning credit to client who won't pay back
Need to distinguish probability of making different
types of errors
Misclassification Costs
Solution: distinguish “predictive accuracy” for
different classes
– Acc: probability that some instance is classified
correctly
– Decomposed into
• TP: “true positive” rate, (estimated) probability that a
positive instance is classified correctly
• TN: “true negative” rate, (estimated) probability that a
negative instance is classified correctly
– We also define
• FP = 1-TN: “false positive rate”: estimated probability that a
negative is classified as positive
• analogously FN = 1-TP
Misclassification Costs (2)
Consider costs CFP and CFN
= cost of false positive resp. false negative

Expected cost of a single prediction:
C = CFP P(pos|-) P(-) + CFN P(neg|+) P(+)
– estimated by C = CFP FP T-/T + CFN FN T+ /T
Note :
– Acc is weighted average of TP and TN
Acc = TP T+/T + TN T-/T
– C is not computable from Acc alone
Cost Sensitive Learning
Simple methods for cost sensitive learning:
• Resampling of instances according to costs
• Weighting of instances according to costs
In Weka Cost Sensitive Classification and Learning can be
applied for any classifier using the meta scheme:
CostSensitiveClassifier.
Lift Charts
In practice, decisions are usually made by comparing
possible scenarios taking into account different costs.
E.g.
- Promotional mailout to 1,000,000 households. If we mail to
all households, we get 0.1% respond (1000).
- Data mining tool identifies (a) subset of 100,000 households
with 0.4% respond (400); or (b) subset of 400,000 households
with 0.2% respond (800);
- Depending on the costs we can make final decision using lift
charts!
- A lift chart allows a visual comparison.
Generating a Lift Chart
Instances are sorted according to their predicted probability
of being a true positive:
Rank
1
2
3
4
…
Predicted probability
0.95
0.93
0.93
0.88
…
Actual class
Pos
Pos
Neg
Pos
…
In the lift chart, x axis is sample size and y axis is number of
true positives.
Hypothetical Lift Chart
ROC diagrams
• ROC = "Receiver operating characteristic"
• Allows to see
– how well a classifier will perform given certain
misclassification costs and class distribution
– in which environments one classifier is better
than another
• Explicitly aims at solving problems 2 and 3
mentioned before
ROC diagram (2)
• ROC diagram plots TP-rate versus FP-rate
• From confusion matrix:
– TP = a/(a+c) = a/T+
– FP = b/(b+d) = b/Tprediction
actual value
+
-
Sum
Pos
a
b
Tpos
Neg
c
d
Tneg
Sum
T+
T-
T
Classifier in ROC diagram
1 classifier = 1 point on ROC diagram
perfect
prediction
no positives
forgotten
TP
1
if false
then pos
no
negatives
returned
if true
then pos
Predicted
Tru
e
pos
neg
+
80
20
-
50
50
Predicted
random
prediction
Tru
e
pos
neg
+
60
40
-
20
80
Predicted
0
1
FP
Tru
e
pos
neg
+
40
60
-
30
70
Dominance in the ROC Space
Classifier A dominates classifier B if and only if
TPrA > TPrB and FPrA < FPrB.
ROC Convex Hull (ROCCH)
Determined by the dominant classifiers
 Classifiers below
 Any point
ROCCH are always sub-optimal.
of the line
segment connecting
two classifiers can be
achieved by randomly
choosing between
them;
 The classifiers on
ROCCH can be
combined to form a
hybrid.
Rank classifiers
Rank classifiers assign a rank to their predictions
some predictions are more certain than others -> higher
rank
E.g.
(1) decision trees:
– use purity of leaf used for prediction to rank it
– E.g. leaf with 90% positives is ranked higher than leaf
with 80% positives
(2) neural nets:
– criterion: <0.5 = neg, >=0.5 = pos
– but 0.9 is more certainly positive than 0.51
– raise/lower threshold of 0.5: TP and FP go down or up
Rank classifiers yield a ROC curve
each specific threshold = 1 point on that curve
TP
1
Ranker
Ranker with low threshold:
better than Blue
Ranker with high threshold:
worse than Red
0
1
FP
ROC for one Classifier
Reasonable separation between
the classes, mostly convex.
Good separation between
the classes, convex curve.
Poor separation between the classes,
large and small concavities.
Fairly poor separation
between the classes,
mostly convex.
Random performance.
The AUC Metric
The area under ROC curve (AUC) assesses the ranking in terms of
separation of the classes.
AUC estimates that randomly chosen positive instance will be
ranked before randomly chosen negative instances.
Note
• To generate ROC curves or Lift charts we need
to use some evaluation methods considered in
this lecture.
• ROC curves and Lift charts can be used for
internal optimization of classifiers.
Costs in ROC diagram
Given misclassification costs:
– CFP: cost of a false positive
– CFN: cost of a false negative (undetected "+")
Average cost is
– C = CFP * FP * T-/T + CFN * (1-TP) * T+/T
– Lines of equal cost can be drawn in ROC diagram
(straight lines)
• Slope of such a line : (CFP * T-/T) / (CFN * T+/T)
TP
1
0
1
FP
high cost of false positive:
Red is better
TP
low cost of false positive:
Ranker with low threshold
is better
1
Blue and Green are never
better than the Ranker or
Red
0
1
FP
Iso-Accuracy Lines
Remember:
Accuracy is weighted average of TP and TN
Acc = TP T+/T + TN T-/T
= TP T+/T + (1-FP) T-/T
TP = N/P FP + Cte
Higher iso-accuracy lines are better.
Example
For uniform class distribution,
C4.5 is optimal and achieves
about 82% accuracy.
With 4 times as many positives as
negatives SVM is optimal and
achieves about 84% accuracy.
With 4 times as many negatives
as positives CN2 is optimal and
achieves about 86% accuracy.
Summary
• Metrics for Classifier’s Evaluation
• Methods for Classifier’s Evaluation &
Comparison
• Costs in Data Mining
– Cost-Sensitive Classification and Learning
– Lift Charts
– ROC Curves
Evaluation of regression models
• Predicting numbers: no "right or wrong" approach
• Possible measures:
– Sum of squared errors SSE
• Is an absolute measure
– Relative error RE: measures improvement over trivial model
• RE = SSE(hypothesis) / SSE(trivial hypothesis)
• Trivial hypothesis: e.g. always predict mean
• RE normally between 0 and 1
– Spearman correlation r
• measures how well predictions and actual values correlate
• less sensitive to actual errors

Computer Science 1 - Maastricht University

Transcript Computer Science 1 - Maastricht University

Directory