test set - LIACS Data Mining Group

Download Report

Transcript test set - LIACS Data Mining Group

Learning Algorithm Evaluation
Algorithm evaluation: Outline

Why?


How?


Train/Test vs Cross-validation
What?


Overfitting
Evaluation measures
Who wins?

Statistical significance
Introduction
Introduction

A model should perform well on unseen data drawn
from the same distribution
Classification accuracy

performance measure





Success: instance’s class is predicted correctly
Error: instance’s class is predicted incorrectly
Error rate: #errors/#instances
Accuracy: #successes/#instances
Quiz

50 examples, 10 classified incorrectly
• Accuracy? Error rate?
Evaluation
Rule #1
Never evaluate on training data!
Train and Test
Step 1: Randomly split data into training and test set (e.g. 2/3-1/3)
a.k.a. holdout set
Train and Test
Step 2: Train model on training data
Train and Test
Step 3: Evaluate model on test data
Train and Test
Quiz: Can I retry with other parameter settings?
Evaluation
Rule #1
Never evaluate on training data!
Rule #2
Never train on test data!
(that includes parameter setting or feature
selection)
Train and Test
Step 4: Optimize parameters on separate validation set
validation
testing
Test data leakage

Never use test data to create the classifier


Can be tricky: e.g. social network
Proper procedure uses three sets



training set: train models
validation set: optimize algorithm parameters
test set: evaluate final model
Making the most of the data


Once evaluation is complete, all the data can be
used to build the final classifier
Trade-off: performance  evaluation accuracy


More training data, better model (but returns diminish)
More test data, more accurate error estimate
Train and Test
Step 5: Build final model on ALL data (more data, better model)
Cross-Validation
k-fold Cross-validation
•
•
•
•
Split data (stratified) in k-folds
Use (k-1) for training, 1 for testing
Repeat k times
Average results
Original
Fold 1
Fold 2
Fold 3
train
test
Cross-validation

Standard method:


Stratified ten-fold cross-validation
10? Enough to reduce sampling bias

Experimentally determined
Leave-One-Out Cross-validation
Original
100



Fold 100
………
A particular form of cross-validation:


Fold 1
#folds = #instances
n instances, build classifier n times
Makes best use of the data, no sampling bias
Computationally expensive
ROC Analysis
ROC Analysis





Stands for “Receiver Operating Characteristic”
From signal processing: tradeoff between hit rate
and false alarm rate over noisy channel
Compute FPR, TPR and plot them in ROC space
Every classifier is a point in ROC space
For probabilistic algorithms


Collect many points by varying prediction threshold
Or, make cost sensitive and vary costs (see below)
Confusion Matrix
actual
+
+
-
TP
FP
true positive
-
FN
false negative
TP+FN
TPrate (sensitivity):
FPrate (fall-out):
false positive
TN
true negative
FP+TN
ROC space
J48
parameters fitted
J48
OneR
classifiers
ROC curves
Change prediction threshold:
Threshold t: (P(+) > t)
Area Under Curve (AUC)
=0.75
ROC curves




Alternative method (easier, but less intuitive)
Rank probabilities
Start curve in (0,0), move down probability list
If positive, move up. If negative, move right


Jagged curve—one set of test data
Smooth curve—use cross-validation
ROC curves
Method selection




Overall: use method with largest Area
Under ROC curve (AUROC)
If you aim to cover just 40% of true
positives in a sample: use method A
Large sample: use method B
In between: choose between A and B with
appropriate probabilities
ROC Space and Costs
equal
costs
skewed
costs
Different Costs


In practice, TP and FN errors incur different costs
Examples:




Medical diagnostic tests: does X have leukemia?
Loan decisions: approve mortgage for X?
Promotional mailing: will X buy the product?
Add cost matrix to evaluation that weighs TP,FP,...
pred +
pred -
actual +
cTP = 0
cFN = 1
actual -
cFP = 1
cTN = 0
Statistical Significance
Comparing data mining schemes

Which of two learning algorithms performs better?



Note: this is domain dependent!
Obvious way: compare 10-fold CV estimates
Problem: variance in estimate


Variance can be reduced using repeated CV
However, we still don’t know whether results are reliable
Significance tests

Significance tests tell us how confident we can be
that there really is a difference




Null hypothesis: there is no “real” difference
Alternative hypothesis: there is a difference
A significance test measures how much evidence
there is in favor of rejecting the null hypothesis
E.g. 10 cross-validation scores: B better than A?
P(perf)
Algorithm A
Algorithm B
x x x xxxxx x x
x x x xxxx x x x
perf
Paired t-test
P(perf)
Algorithm A
Algorithm B
perf


Student’s t-test tells whether the means of two
samples (e.g., 10 cross-validation scores) are
significantly different
Use a paired t-test when individual samples are
paired


i.e., they use the same randomization
Same CV folds are used for both algorithms
William Gosset
Born: 1876 in Canterbury; Died: 1937 in
Beaconsfield, England
Worked as chemist in the Guinness brewery in
Dublin in 1899. Invented the t-test to handle
small samples for quality control in brewing.
Wrote under the name "Student".
Performing the test
1. Fix a significance level


Algoritme A
Algoritme B
P(perf)

perf
Significant difference at % level implies (100-)%
chance that there really is a difference
Scientific work: 5% or smaller (>95% certainty)
2. Divide
 by two (two-tailed test)
3. Look up the z-value corresponding to
/2:
4. If t  –z or t  z: difference is significant

null hypothesis can be rejected
α
z
0.1%
4.3
0.5%
3.25
1%
2.82
5%
1.83
10%
1.38
20%
0.88