Artificial Intelligence

Download Report

Transcript Artificial Intelligence

Artificial Intelligence
Empirical Evaluation
of AI Systems
Ian Gent
[email protected]
Artificial Intelligence
Empirical Evaluation
of Computer Systems
Part I :
Part II:
Part III:
Philosophy of Science
Experiments in AI
Basics of Experimental Design
with AI case studies
Science as Refutation
 Modern view of the progress of Science based on
Popper. (Sir Karl Popper, that is)
 A scientific theory is one that can be refuted
 I.e. it should make testable predictions
 If these predictions are incorrect, the theory is false
 theory may still be useful, e.g. Newtonian physics
 Therefore science is hypothesis testing
 Artificial intelligence aspires to be a science
3
Empirical Science
 Empirical = “Relying upon or derived from
observation or experiment”
 Most (all) of Science is empirical.
 Consider theoretical computer science
 study based on Turing machines, lambda calculus, etc
 Founded on empirical observation that computer systems
developed to date are Turing-complete
 Quantum computers might challenge this
 if so, an empirically based theory of quantum
computing will develop
4
Theory, not Theorems
 Theory based science need not be all theorems
 otherwise science would be Mathematics
 Compare Physics theory “QED”




most accurate theory in the whole of science?
based on a model of behaviour of particles
predictions accurate to many decimal places (9?)
success derived from accuracy of predictions
 not the depth or difficulty or beauty of theorems
 I.e. QED is an empirical theory
 AI/CS has too many theorems and not enough theory
 compare advice on how to publish in JACM
5
Empirical CS/AI
 Computer programs are formal objects
 so some use only theory that can be proved by theorems
 but theorems are hard
 Treat computer programs as natural objects
 like quantum particles, chemicals, living objects
 perform empirical experiments
 We have a huge advantage over other sciences
 no need for supercolliders (expensive) or animal
experiments (ethical problems)
 we should have complete command of experiments
6
What are our hypotheses?
 My search program is better than yours
 Search cost grows exponentially with number of
variables for this kind of problem
 Constraint search systems are better at handling
overconstrained systems, but OR systems are better
at handling underconstrained systems
 My company should buy an AI search system rather
than an OR one
7
Why do experiments?
 Too often AI experimenters might talk like this:







What is your experiment for?
is my algorithm better than his?
Why?
I want to know which is faster
Why?
Lots of people use each kind …
How will these people use your result?
 ?
8
Why do experiments?
 Compare experiments on identical twins:
 What is your experiment for?
 I want to find out if twins reared apart to those reared
together and nonidentical twins too.
 Why?
 We can get estimates of the genetic and social
contributors to performance
 Why?
 Because the role of genetics in behavior is one of the
great unsolved questions.
 Experiments should address research questions
 otherwise they can just be “track meets”
9
Basic issues in Experimental
Design
 From Paul R Cohen, Empirical Methods for Artificial
Intelligence, MIT Press, 1995, Chapter 3
 Control
 Ceiling and Floor effects
 Sampling Biases
10
Control
 A control is an experiment in which the hypothesised
variation does not occur
 so the hypothesised effect should not occur either
 e.g. Macaque monkeys given vaccine based on
human T-cells infected with SIV (relative of HIV)
 macaques gained immunity from SIV
 Later, macaques given uninfected human T-cells
 and macaques still gained immunity!
 Control experiment not originally done
 and not always obvious (you can’t control for all variables)
11
Case Study: MYCIN
 MYCIN was a medial expert system
 recommended therapy for blood/meningitis infections
 How to evaluate its recommendations?
 Shortliffe used
 10 sample problems
 8 other therapy recommenders
 5 faculty at Stanford Med. School, 1 senior resident, 1
senior postdoctoral researcher, 1 senior student
 8 impartial judges gave 1 point per problem
 Max score was 80
 Mycin: 65 Faculty 40-60, Fellow 60, Resident 45, Student 30
12
Case Study: MYCIN
 What were controls?
 Control for judge’s bias for/against computers
 judges did not know who recommended each therapy
 Control for easy problems
 medical student did badly, so problems not easy
 Control for our standard being low
 e.g. random choice should do worse
 Control for factor of interest
 e.g. hypothesis in MYCIN that “knowledge is power”
 have groups with different levels of knowledge
13
Ceiling and Floor Effects
 Well designed experiments can go wrong
 What if all our algorithms do particularly well (or they
all do badly)?
 We’ve got little evidence to choose between them
 Ceiling effects arise when test problems are
insufficiently challenging
 floor effects the opposite, when problems too challenging
 A problem in AI because we often use benchmark
sets
 But how do we detect the effect?
14
Ceiling Effects: Machine Learning
 14 datasets from UCI corpus of benchmarks
 used as mainstay of ML community
 Problem is learning classification rules
 each item is vector of features and a classification
 measure classification accuracy of method (max 100%)
 Compare C4 with 1R*, two competing algorithms:
 DataSet: BC CH GL G2 HD HE …
 C4
72 99.2 63.2 74.3 73.6 81.2 ...
 1R*
72.5 69.2 56.4 77 78 85.1 ...
Mean
85.9
83.8
15
Ceiling Effects




DataSet: BC
C4
72
1R*
72.5
Max
72.5
CH
99.2
69.2
99.2
GL
63.2
56.4
63.2
G2
74.3
77
77
HD
73.6
78
78
HE
81.2
85.1
85.1
…
...
...
…
Mean
85.9
83.8
87.4
 C4 achieves only about 2% better than 1R*
 If we take the best of the C4/1R* in each case, we can only
achieve 87.4% accuracy
 We have only weak evidence that C4 better
 both methods performing near ceiling of possible
 Ceiling effect is that we can’t compare the two methods well
because both are achieving near the best practicable
16
Ceiling Effects




In fact 1R* only uses one feature (the best one)
C4 uses on average 6.6 features
5.6 features buy only about 2% improvement
Conclusion?
 Either real world learning problems are easy (use 1R*)
 Or we need more challenging datasets
 We need to be aware of ceiling effects in results
17
Sampling Bias
 Sampling bias is when data collection is biased
against certain data
 e.g. teacher who says “Girls don’t answer maths question”
 observation might suggest that …
 indeed girls don’t answer many questions
 but that the teacher doesn’t ask them many questions
 Experienced AI researchers don’t do that, right?
18
Case Study: Phoenix
 Phoenix = AI system to fight (simulated) forest fires
 Experiments suggested that wind speed uncorrelated
with time to put out fire
 obviously incorrect (high winds spread forest fires)
 Wind Speed vs containment time (max 150 hours):
 3: 120 55 79 10 140 26 15 110 12
54 10 103
 6: 78 61 58 81 71 57 21 32 70
 9: 62 48 21 55 101
 What’s the problem?
19
Sampling bias in Phoenix
 The cut-off of 150 hours introduces sampling bias
 Many high-wind fires get cut off, not many low wind
 On remaining data, there is no correlation between
wind speed and time (r = -0.53)
 In fact, data shows that:
 a lot of high wind fires take > 150 hours to contain
 those that don’t are similar to low wind fires
 You wouldn’t do this, right?
You might if you had automated data analysis.
20