Transcript lecture6

Today’s Topics
• Dealing with Noise
• Overfitting (the key issue in all of ML)
• A ‘Greedy’ Algorithm for Pruning D-Trees
• Generating IF-THEN Rules from D-Trees
• Rule Pruning
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 3
1
Noise: Major Issue in ML
Worst Case of Noise
+, - at same point in feature space
Causes of Noise
1. Too few features (“hidden variables”)
or too few possible values
2. Incorrectly reported/measured/judged feature values
3. Mis-classified instances
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
2
Noise - Major Issue in ML (cont.)
Overfitting
Producing an ‘awkward’ concept because
of a few ‘noisy’ points
++
++
- +
+
+ +
-
+ +
-
Bad performance on future ex’s?
9/22/15
- -
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
-
-
Better performance?
3
Overfitting Viewed in Terms of
Function-Fitting
(can exactly fit N points with an N-1 degree polynomial)
+ + Overfitting?
+
f(x)
+ +
+
+
+
+
+
+
+
Underfitting?
+
+
x
Data = Red Line + Noise Model
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
4
Definition of Overfitting
Assuming large enough test set so that it is
representative, concept C overfit the training data if
there exists a simpler concept S so that
Training set accuracy of
C
>
Training set accuracy of
S
but
Test set accuracy of
C
9/22/15
<
Test set accuracy of
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
S
5
Remember!
• It is easy to learn/fit the training data
• What’s hard is generalizing well to future
(‘test set’) data!
• Overfitting avoidance (reduction, really) is
the key issue in ML
• Easy to think ‘spurious correlations’ are
meaningful signals
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
6
See a Pattern?
The first 10
digits of Pi:
3.14159265
What comes next in Pi?
3 (already used)
After that? 5
“35” rounds to “4” (in
fractional part of number)
Picture taken (by me) June 2015 in Lambeau Field Atrium, Green Bay, WI
“4” has since been added!
Presumably a ‘spurious correlation’
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
Lecture 1, Slide 7
Can One Underfit?
• Sure, if not fully fitting the training set
Eg, just return majority category
(+ or -) in the trainset as the learned model
• But also if not enough data to illustrate
important distinctions
Eg, color may be important, but all examples seen are
red, so no reason to include color and make more
complex model
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
8
Overfitting + Noise
Using the strict definition of overfitting
presented earlier, is it possible to overfit
noise-free data?
(Remember: overfitting the key ML issue,
not just a decision-tree topic)
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
9
Example of Overfitting
Noise-Free Data
Let
– Correct concept = A  B
– Feature C be true 50% of the time,
for both + and – examples
– Prob(pos example) = 0.66
– Training set
+: A B C D E,
A B C ¬D E,
-: A ¬B ¬C D ¬E, ¬A B ¬C ¬D E
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
A B C D ¬E
10
Example (concluded)
Tree
Trainset Accuracy
C
T
TestSet Accuracy
100%
50%
60%
66%
F
+
-
Pruned
+
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
11
ID3 & Noisy Data
To avoid overfitting, could allow splitting to stop
before all ex’s are of one class
– Early stopping was Quinlan’s original idea
Stop if further splitting not justified by a statistical test
(just skim text’s material on the 2 test)
– But post-pruning now seen as better
More robust to weaknesses of greedy algo’s
(eg, post-pruning benefits from seeing the full tree;
a node may look bad when building tree, but not in hindsight)
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
12
ID3 & Noisy Data
(cont.)
Recap: Build complete tree, then use
some ‘spare’ (tuning) examples to decide
which parts of tree can be pruned
- called Reduced [tuneset] Error Pruning
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
13
ID3 & Noisy Data (cont.)
Better tuneset
accuracy?
discard?
• See which dropped subtree leads to
highest tune-set accuracy
• Repeat (ie, another greedy algo)
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
14
Greedily Pruning D-Trees
Sample (Hill Climbing) Search Space
best
Stop here if node’s best
child is not an improvement
Note in pruning we’re
reversing the treebuilding process
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
15
Greedily Pruning D-trees
- Pseudocode
+
1. Run ID3 to fully fit TRAIN’ Set,
measure accuracy on TUNE
2. Consider all subtrees where ONE interior node
removed and replaced by leaf
- label with majority category in pruned subtree
IF progress on TUNE choose best subtree
ELSE (ie, if no improvement) quit
3. Go to 2
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
16
Train/Tune/Test Accuracies
(same sort of curves for other tuned param’s in other algo’s)
100%
Accuracy
Train
Tune
Test
Chosen pruned tree
Ideal
tree to
choose
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
Amount of Pruning
17
The General Tradeoff in
Greedy Algorithms (more later)
Efficiency vs. Optimality
Assume True Best Cuts
R
Initial
Tree
A
Discard C’s & F’s subtrees
Discard B’s subtrees - irrevocable
C
D
E
9/22/15
Single Best Cut
B
F
Greedy Search: Powerful, GeneralPurpose, Trick–of-the-Trade
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
18
Generating IF-THEN
Rules from Trees
• Antecedent: Conjunction of all decisions
leading to terminal node
• Consequent: Label of terminal node
COLOR ?
SIZE ?
+
9/22/15
+
-
-
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
19
Generating Rules (cont)
Previous slide’s tree generates these rules
If Color=Green  Output = If Color=Blue  Output = +
If Color=Red and Size=Big  +
If Color=Red and Size=Small  Note
1. Can ‘clean up’ the rule set (next slide)
2. Decision trees learn disjunctive concepts
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
20
Rule Post-Pruning
(Another Greedy Algorithm)
1. Induce a decision tree
2. Convert to rules (see earlier slide)
3. Consider dropping any one
rule antecedent
– Delete the one that improves tuning set
accuracy the most
– Repeat as long as progress being made
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
21
Rule Post-Pruning (cont)
Advantages
But note that the final rules will
overlap one another – so need a
‘conflict resolution’ scheme
– Allows an intermediate node to be pruned
from some rules but retained in others
– Can correct poor early decisions in
tree construction
– Final concept more understandable
Also applicable to ML algo’s that directly learn
rules (eg, ILP, MLNs)
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
22
Training with Noisy Data
If we can clean up the training data,
should we do so?
– No (assuming one can’t clean up the testing
data when the learned concept will be used)
– Better to train with the same type of data as
will be experienced when the result of
learning is put into use
– Recall hadBankcruptcy was best indicator of
“good candidate for credit card” story!
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
23
Aside:
A Rose by Any Other Name …
Tuning sets also called
– Pruning sets (in d-tree algorithms)
– Validation sets (in general),
but sometimes in the literature
(eg, stats community) AI’s test sets
called validation (and AI’s tuning sets
called test sets!)
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 6, Week 2
24