Sl1 - Maastricht University

Download Report

Transcript Sl1 - Maastricht University

Evaluation of Learning Models

Evgueni Smirnov

Overview

• Motivation • Metrics for Classifier’s Evaluation • Methods for Classifier’s Evaluation • Comparing the Performance of two Classifiers • Costs in Classification – Cost-Sensitive Classification and Learning – Lift Charts – ROC Curves

Motivation

• It is important to evaluate classifier’s generalization performance in order to: – Determine whether to employ the classifier; (

For example: when learning the effectiveness of medical treatments from a limited-size data, it is important to estimate the accuracy of the classifiers.

) – Optimize the classifier.

(

For example: when post-pruning decision trees we must evaluate the accuracy of the decision trees on each pruning step.

)

Model’s Evaluation in the KDD Process

Knowledge Transformed data Patterns data Target data Processed data Interpretation Evaluation Data Mining Selection Preprocessing & cleaning Transformation & feature selection

How to evaluate the Classifier’s Generalization Performance?

• Assume that we test a classifier on some test set and we derive at the end the following

confusion matrix

:

Predicted class

Pos Neg

Actual class

Pos

TP FN

Neg

FP TN P N

Metrics for Classifier’s Evaluation • Accuracy = (

TP

+

TN

)/(

P

+

N

) • Error = (

FP

+

FN

)/(

P

+

N

) • Precision =

TP

/(

TP

+

FP

) • Recall/TP rate =

TP

/

P

• FP Rate =

FP

/

N Predicted class

Pos Neg

Actual class

Pos

TP FN

Neg

FP TN P N

How to Estimate the Metrics?

• We can use: – Training data; – Independent test data; – Hold-out method; –

k

-fold cross-validation method; – Leave-one-out method; – Bootstrap method; – And many more…

Estimation with Training Data

• The accuracy/error estimates on the training data are

not

good indicators of performance on future data.

Classifier Training set Training set

– –

Q: Why?

A:

Because new data will probably not be

exactly

the same as the training data!

• The accuracy/error estimates on the training data measure the degree of classifier’s overfitting.

Estimation with Independent Test Data

• Estimation with independent test data is used when we have plenty of data and there is a natural way to forming training and test data.

Classifier Training set Test set

For example: Quinlan in 1987 reported experiments in a medical domain for which the classifiers were trained on data from 1985 and tested on data from 1986.

Hold-out Method

• The hold-out method splits the data into training data and test data (usually 2/3 for train, 1/3 for test). Then we build a classifier using the train data and test it using the test data.

Classifier Training set Test set Data

• The hold-out method is usually used when we have thousands of instances, including several hundred instances from each class.

Classification: Train, Validation, Test Split

Data

Results Known

+ + +

Training set

Validation set

Classifier Builder Evaluate Predictions Y N

+ + -

Model Builder Final Test Set Classifier

+ + -

Final Evaluation

The test data can’t be used for parameter tuning!

Making the Most of the Data

• Once evaluation is complete,

all the data

can be used to build the final classifier.

• Generally, the larger the training data the better the classifier (but returns diminish).

• The larger the test data the more accurate the error estimate.

Stratification

• The

holdout

method reserves a certain amount for testing and uses the remainder for training.

Usually: one third for testing, the rest for training.

• For “unbalanced” datasets, samples might not be representative.

Few or none instances of some classes.

Stratified sample: advanced version of balancing the data.

Make sure that each class is represented with approximately equal proportions in both subsets.

Repeated Holdout Method

• Holdout estimate can be made more reliable by repeating the process with different subsamples.

– In each iteration, a certain proportion is randomly selected for training (possibly with stratification).

– The error rates on the different iterations are averaged to yield an overall error rate.

• This is called the

repeated holdout

method.

Repeated Holdout Method, 2

• Still not optimum: the different test sets overlap, but we would like all our instances from the data to be tested at least ones. • Can we prevent overlapping?

witten & eibe

k

-Fold Cross-Validation

k-fold cross-validation

– avoids overlapping test sets:

First step

: data is split into

k

subsets of equal size; –

Second step

: each subset in turn is used for testing and the remainder for training.

• The subsets are stratified

Classifier

before the cross-validation.

• The estimates are averaged to yield an overall estimate.

train train test Data train test train test train train

More on Cross-Validation

• Standard method for evaluation: stratified 10-fold cross validation.

• Why 10? Extensive experiments have shown that this is the best choice to get an accurate estimate.

• Stratification reduces the estimate’s variance.

• Even better: repeated stratified cross-validation: – E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance).

Leave-One-Out Cross-Validation

• Leave-One-Out is a particular form of cross-validation: – Set number of folds to number of training instances; – I.e., for

n

times.

training instances, build classifier

n

• Makes best use of the data.

• Involves no random sub-sampling.

• Very computationally expensive.

Leave-One-Out Cross-Validation and Stratification

• A disadvantage of Leave-One-Out-CV is that stratification is not possible: – It

guarantees

a non-stratified sample because there is only one instance in the test set!

• Extreme example - random dataset split equally into two classes: – Best inducer predicts majority class; – 50% accuracy on fresh data; – Leave-One-Out-CV estimate is 100% error!

Bootstrap Method

• Cross validation uses sampling

without replacement:

– The same instance, once selected, can not be selected again for a particular training/test set • The

bootstrap

uses sampling

with replacement

to form the training set: – Sample a dataset of

n

instances

n

times

with replacement

to form a new dataset of

n

instances; – Use this data as the training set; – Use the instances from the original dataset that don’t occur in the new training set for testing.

Bootstrap Method

• The bootstrap method is also called the

0.632 bootstrap:

– A particular instance has a probability of 1–1/

n

being picked; of

not

– Thus its probability of ending up in the test data is:  1 1

n

n

e

 1  0 .

368 – This means the training data will contain approximately 63.2% of the instances and the test data will contain approximately 36.8% of the instances.

Estimating Error with the Bootstrap Method

• • The error estimate on the test data will be very pessimistic because the classifier is trained on just ~63% of the instances.

Therefore, combine it with the training error:

err

 0 .

632 

e

test instances  0 .

368 

e

training instances • • The training error gets less weight than the error on the test data.

Repeat process several times with different replacement samples; average the results.

Confidence Intervals for Accuracy • Assume that the estimated accuracy

acc S

(

h

) of classifier

h

is 75%.

• How close is the estimated accuracy

acc S

(

h

) to the true accuracy

acc D

(

h

) ?

Confidence Intervals for Accuracy • Classification of an instance is a Bernoulli trial. – A Bernoulli trial has 2 outcomes: correct and wrong; – The random variable

X

, the number of correct outcomes of

N

Bernoulli trials, has a Binomial distribution

b

(

x

;

N

,

acc D

); – Example: If we have a classifier with true accuracy

acc D

equal to 50%, then to classify 30 randomly chosen instances we receive the following Binomial Distribution:

X

Confidence Intervals for Accuracy

The main question: Given number

N

of test instances and number

x

of correct classifications, or equivalently, empirical accuracy

acc S

=

x

/

N

, can we predict the true accuracy

acc D

of the classifier?

Confidence Intervals for Accuracy • The binomial distribution of

X

has mean equal to

N acc D

and variance

N acc D

(1-

acc D

).

• It can be shown that the empirical accuracy

acc S

=

X

/

N

follows also a binomial distribution with mean equal to

acc D

and variance

acc D

(1-

acc D

)/

N

.

X

Confidence Intervals for Accuracy

Area = 1 -

 • For large test sets (

N

> 30), a binomial distribution is approximated by a normal one with mean

acc D

and variance

acc D

(1 • Thus

, P

(

Z

 / 2 

acc D

)/

N.

acc S acc D

( 1  

acc D acc D

) /

N

Z

/2

Z

1   / 2 )  1  

Z 1-

/2

• Confidence Interval for

acc D

: 2 

N

acc S

Z

2  / 2 

Z

2  / 2 2 (

N

 4 

N

acc S

Z

2  / 2 )  4 

N

acc S

2

Confidence Intervals for Accuracy • Confidence Interval for

acc D

: 2 

N

acc S

Z

 2 / 2 

Z

 2 / 2  4 

N

acc S

 4 

N

acc S

2 2 (

N

Z

 2 / 2 ) • The confidence intervals shrink when we decrease confidence: 1 α 0.99

0.98

0.95

0.9

0.8

0.7

0.5

Z

α/2 2.58

2.33

1.96

1.65

1.28

1.04

0.67

• The confidence intervals become tighter when the number

N

of test instances increases. See below the evolution of the intervals for confidence level 95% for a classifier with accuracy 80% on 100 test instances.

N

20

Confidence Interval

[0.58, 0.92] 50 100 500 1000 5000 [0.67, 0.89] [0.71, 0.87] [0.76, 0.83] [0.77, 0.82] [0.78, 0.81]

Estimating Confidence Intervals of the Difference of Generalization Performances of two Classifier Models • Assume that we have two classifiers,

M

1 and

M

2 , and we would like to know which one is better for a classification problem.

• We test the classifiers on

n

test data sets

D

1 ,

D

2 , …,

D

n , and we receive error rate estimates

e

11 ,

e

12 , …,

e

1n for classifier

M

1 and error rate estimates

e

21 ,

e

22 , …,

e

2n for classifier

M

2 . • Using rate estimates we can compute the mean error rate

e

1 for classifier

M

1 and the mean error rate

e

2 for classifier

M

2 . • These mean error rates are just

estimates

of error on the true population of

future

data cases.

What if the difference between the two error rates is just attributed to chance?

Estimating Confidence Intervals of the Difference of Generalization Performances of two Classifier Models • We note that error rate estimates

e

11 ,

e

12 , …,

e

1n for classifier

M

1 and error rate estimates

e

21 ,

e

22 , …,

e

2n for classifier

M

2 are paired. Thus, we consider the differences

d

1 ,

d

2 , …,

d

n where

d

j = |

e

1j -

e

2j |.

• The differences

d

1 ,

d

2 , …,

d

n are instantiations of

n

random variables

D

1 ,

D

2 , …,

D

n with mean

µ D

and standard deviation

σ D

.

• We need to establish confidence intervals for

µ D

in order to decide whether the difference in the generalization performance of the classifiers

M

1 and

M

2 is statistically significant or not.

Estimating Confidence Intervals of the Difference of Generalization Performances of two Classifier Models • Since the standard deviation

σ D

is unknown, we approximate it using the sample standard deviation

s d

:

s d

 1

n i n

  1 [(

e

1

i

e

2

i

)  (

e

1 

e

2 )] 2 • Since we approximate the true standard deviation

σ D

, we introduce

T

statistics:

T

D

 

D s d

/

n

Estimating Confidence Intervals of the Difference of Generalization Performances of two Classifier Models • The

T

statistics is governed by

t

-distribution with

n

- 1 degrees of freedom.

Area = 1 -

t

/2 t 1-

/2

Estimating Confidence Intervals of the Difference of Generalization Performances of two Classifier Models • If

d

and

s d

are the mean and standard deviation of the normally distributed differences of

n

random pairs of errors, a (1 – α)100% confidence interval for

µ D

=

µ 1 - µ 2

is :

d

t

 / 2

s d n

 

D

d

t

 / 2

s d n

, where

t

α/2 is the

t

-value with

v

=

n

-1 degrees of freedom, leaving an area of α/2 to the right.

• Thus, if the interval contains 0.0 we can conclude on significance level α that the difference is 0.0.

Metric Evaluation Summary:

• Use test sets and the hold-out method for “large” data; • Use the cross-validation method for “middle sized” data; • Use the leave-one-out and bootstrap methods for small data; • Don’t use test data for parameter tuning - use separate validation data.

Counting the Costs

• In practice, different types of classification errors often incur different costs • Examples: – ¨ Terrorist profiling • “Not a terrorist” correct 99.99% of the time – Loan decisions – Fault diagnosis – Promotional mailing

True class Hypothesized class Pos Neg

Cost Matrices

Pos

TP Cost FP Cost

Neg

FN Cost TN Cost

Usually, TP Cost and TN Cost are set equal to 0!

Cost-Sensitive Classification

• If the classifier outputs probability for each class, it can be adjusted to minimize the expected costs of the predictions. • Expected cost is computed as dot product of vector of class probabilities and appropriate column in cost matrix.

Hypothesized class True class Pos Neg Pos

TP Cost FP Cost

Neg

FN Cost TN Cost

Cost Sensitive Classification

• Assume that the classifier returns for an instance probs

p

pos and

p

neg = 0.6 = 0.4. Then, the expected cost if the instance is classified as positive is 0.6 * 0 + 0.4 * 10 = 4. The expected cost if the instance is classified as negative is 0.6 * 5 + 0.4 * 0 = 3.

To minimize the costs the instance is classified as negative.

True class Hypothesized class Pos Neg Pos 0 10 Neg 5 0

Cost Sensitive Learning

• Simple methods for cost sensitive learning: – Resampling of instances according to costs; – Weighting of instances according to costs.

Hypothesized class True class Pos Neg Pos 0 10 Neg 5 0

In Weka Cost Sensitive Classification and Learning can be applied for any classifier using the meta scheme:

CostSensitiveClassifier.

Lift Charts

• In practice, decisions are usually made by comparing possible scenarios taking into account different costs.

• Example: •

Promotional mailout to 1,000,000 households. If we

mail to all households, we get 0.1% respond (1000).

Data mining tool identifies (a) subset of 100,000 households with 0.4% respond (400); or (b) subset of

400,000 households with 0.2% respond (800); Depending on the costs we can make final decision using lift charts!

• A

lift chart

allows a visual comparison.

Generating a Lift Chart

• Instances are sorted according to their predicted probability of being a true positive: Rank 1 2 3 4 … Predicted probability 0.95

0.93

0.93

0.88

… Actual class Yes Yes No Yes … • In lift chart,

x

axis is sample size and

y

axis is number of true positives.

Hypothetical Lift Chart

ROC Curves and Analysis

True pos neg Predicted pos 40 30 neg 60 70 True pos neg Predicted pos 70 50 neg 30 50 True pos neg Predicted pos 60 20 neg 40 80

Classifier 1 TPr = 0.4

FPr = 0.3

Classifier 2 TPr = 0.7

FPr = 0.5

Classifier 3 TPr = 0.6

FPr = 0.2

Ideal classifier

ROC Space

always positive True Negative Rate chance always negative

Dominance in the ROC Space

Classifier A dominates classifier B if and only if TPr A > TPr B and FPr A < FPr B .

ROC Convex Hull (ROCCH) • ROCCH is determined by the dominant classifiers; • Classifiers on ROCCH achieve the best accuracy; • Classifiers below ROCCH are always sub-optimal.

Convex Hull

• Any performance on a line segment connecting two ROC points can be achieved by randomly choosing between them; • The classifiers on ROCCH can be combined to form a hybrid.

Iso-Accuracy Lines

• Iso-accuracy line connects ROC points with the same accuracy

A

: • • (

P

*

TPr

+

N

*(1–

FPr

))/(

P

+

N

) = A;

TPr

= (

A*

(

P+N

)-

N

)/

P

+

N

/

P

*

FPr.

• Iso-accuracy lines have slope

N

/

P

.

• Higher iso-accuracy lines are better.

Iso-Accuracy Lines

• For uniform class distribution, C4.5 is optimal and achieves about 82% accuracy.

Iso-Accuracy Lines

• With for times as many positives as negatives SVM is optimal and achieves about 84% accuracy.

Iso-Accuracy Lines

• With for times as many negatives as positives CN2 is optimal and achieves about 86% accuracy.

Iso-Accuracy Lines

• With less than 9% positives, AlwaysNeg is optimal.

• With less than 11% negatives, AlwaysPos is optimal.

P

pos 0.99

0.98

0.7

0.6

0.43

How to Construct ROC Curve for one Classifier • Sort the instances according to their

P

pos .

• Move a threshold on the sorted instances.

• For each threshold define a classifier with confusion matrix.

• Plot the TPr and FPr rates of the classfiers.

True Class pos pos neg pos neg

True pos neg pos Predicted neg 2 1 1 1

ROC for one Classifier

Good separation between the classes, convex curve.

ROC for one Classifier

Reasonable separation between the classes, mostly convex.

ROC for one Classifier

Fairly poor separation between the classes, mostly convex.

ROC for one Classifier

Poor separation between the classes, large and small concavities.

ROC for one Classifier

Random performance.

The AUC Metric

• The area under ROC curve (AUC) assesses the ranking in terms of separation of the classes. • AUC estimates that randomly chosen positive instance will be ranked before randomly chosen negative instances.

Note

• To generate ROC curves or Lift charts we need to use some evaluation methods considered in this lecture. • ROC curves and Lift charts can be used for internal optimization of classifiers.

Summary

• In this lecture we have considered: – Metrics for Classifier’s Evaluation – Methods for Classifier’s Evaluation – Comparing Data Mining Schemes – Costs in Data Mining • Cost-Sensitive Classification and Learning • Lift Charts • ROC Curves