Approximate Randomization tests

Transcript Approximate Randomization tests

Approximate Randomization tests February 5

, 2013

Classic t-test

Why ar testing?

• • • Classic tests often assume a given distribution (student t, normal, …) of the variable This is ≈ok for recall, but not for precision or F score Possible hypotheses to test with non parametric tests is limited

Illustration

• • • • 30,000 runs, 1000 instances, 500 of class A True positives (TP): 400 (stdev:80) False positives (FP): 60 (stdev: 15) Assumption: true and false positives for class A are normally distributed. This is already an approximation since TP and FP are restricted by 0 and the number of instances.

Definitions

• • • Recall = truly predicted A / A in reference = truly predicted A / C te  If A is normal, recall is normal.

Precision = truly predicted A / A in system  A in system is a non-linear combination of TP and FP. Precision is not normal.

F-score: non-linear combination of recall and precision  Not normal.

Approximate randomization test

• • • • No assumption on distribution Can handle complicated statistics Only assumption: independence between shuffled elements References: – Computer Intensive Methods for Testing Hypotheses, Noreen, 1989.

– More accurate tests for the statistical significance of results differences, Yeh, 2000.

Basic idea

• Exact randomization test Contents Expert Glass 1 Polish Polish Glass 2 Premium Premium Glass 3 Russian Budget Glass 4 Budget Russian

Exact probability

H0: expert is independent of contents P(ncorrect ≥ 2) = 7/24 = 0.29

Thus, do not reject H0 because the probability is larger than alpha=0.05.

Approximate probability

• • The number of permutations is n! => quick increase of number of permutations If too much permutations to compute: approximation: P = (nge + 1) / (NS + 1) – nge : number of times pseudostatistic ≥ actual statistic – NS: number of shuffles – +1: correction for validity

DIFFERENT SETUPS

Translation to instances

• • • • Each glass is an instance Contents and expert are two labeling systems Contents has an accuracy of 100%, expert has an accuracy of 50% Statistic is precision, f-score, recall, … instead of accuracy

Stratified shuffling

• • For labeled instances, it makes no sense to shuffle the class label of one instance to another Only shuffle labels per instance

MBT

• • Assumpton of independence between instances Shuffle per sentence rather than per token This is nice .

System 1

DT VBZ JJ .

System 2

NNS VB RB .

Term extraction

• Shuffling extracted terms between output of two term extraction systems

Reference

happy good lively

System 1

happy

System 2

sad good happy angry

Script

• • • http://www.clips.ua.ac.be/~vincent/software.html#art http://www.clips.ua.ac.be/scripts/art Options: – Exact and approximate randomization tests – – – – Instance based, also for MBT Term extraction based Stratified Shuffling Two sided / one-sided (check code!)

Remarks on usage

• • • It makes no sense to shuffle if exact randomization can be computed The value of p depends on NS. The larger NS, the lower p can be Validity check – Sign-test – Re-test: to alleviate bad randomization

Sign test

• • • Can be compared with P for accuracy H0: correctness is independent of system i.e.

P(groen) = 0.5

Binomial test System 1 System 2

Interpretation (1)

Reference System 1 System 2 A A B B C A A B B How much do these two systems differ based on precision for the A label?

Maximally Intermediate Minimally

A AB BA AB BA BA AB BA AB Labels B AB AB AB BA AB BA BA BA

Interpretation (2)

C AB AB BA AB BA BA BA AB System 1 1/3 0 1/2 0 1/2 1 0 1/2 Precision A System 2 0 1 0 1/2 0 0 1/3 0 Δ 1/3 -1 1/2 -1/2 1/2 1 -1/3 1/2

Conclusion

• • • Approximate randomization testing can be used for many applications.

The basic idea is that the actual difference between two systems is (im)probable to occur when all possible permutions of the outputs are evaluated.

Difference can be computed in many ways as long as the shuffled elements are independent.