Transcript lecture4

Today’s Topics
• Learning Decision Trees (Chapter 18)
– We’ll use d-trees to introduce/motivate many general
issues in ML (eg, overfitting reduction)
• “Forests” of Decision Trees very Successful
ML approach, arguably the best on many tasks
• Expected-Value Calculations
(a topic we’ll revisit a few times)
• Information Gain
• Advanced Topic: Regression Trees
• Coding Tips for HW1
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
1
Learning Decision Trees (Ch 18):
The ID3 Algorithm
(Quinlan 1979; Machine Learning 1:1 1986)
Induction of Decision Trees (top-down)
– Based on Hunt’s CLS psych model (1963)
– Handles noisy & missing feature values
– C4.5 and C5.0 successors; CART very similar
COLOR?
-
SIZE?
9/15/15
+
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
2
Main Hypothesis of ID3
Ross Quinlan
The simplest tree that classifies training
examples will work best on future examples
(Occam’s Razor)
SIZE?
COLOR?
-
SIZE?
9/15/15
+
VS.
-
+
NP-Hard to find the smallest tree
(Hyafil +Rivest, 1976)
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
3
Why Occam’s Razor?
(Occam lived 1285 – 1349)
• There are fewer short hypotheses
(small trees in ID3) than long ones
• Short hypothesis that fits training data
unlikely to be coincidence
• Long hypothesis that fits training data
might be (since many more possibilities)
• COLT community formally addresses
these issues (ML theory)
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
4
Finding Small
Decision Trees
ID3 - Generate small trees with greedy algorithm:
– Find a feature that “best” divides the data
– Recur on each subset of the data that
the feature creates
What does “best” mean?
– We’ll briefly postpone answering this
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
5
Overview of ID3 (Recursion!)
Splitting
Attribute
(aka Feature)
Dataset
+1 + 2
+3 +4 +5
- 1 - 2 -3
+1 +2
+
A1 A3 A4
ID3
A2
A1 A2
A3 A4
- 1 -2
+3 +4 +5
A4
-1 - 2
A1 A3 A4
Resulting d-tree
shown in red
-3
-
A 1 A3
+ 3 +4 +5
A1 A3
A1 A3 A4
9/15/15
?
A1 A3 A4
Use Majority class
at parent node (+)
- why?
+
-
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
6
ID3 Algorithm
(Figure 18.5 of textbook)
Given
E, a set of classified examples
F, a set of features not yet in decision tree
If |E| = 0 then return majority class at parent
Else if All_Examples_Same_Class, Return <the class>
Else if |F| = 0 return majority class (have +/- ex’s with same feature values)
Else
Let bestF = FeatureThatGainsMostInfo(E, F)
Let leftF = F – bestF
Add node bestF to decision tree
For each possible value, v, of bestF do
Add arc (labeled v) to decision tree
And connect to result of
ID3({ex in E| ex has value v for feature bestF}, leftF)
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
7
Venn Diagram View of ID3
Question: How do decision trees
divide feature space?
F1
+
+
F2
-
9/15/15
+
+
+
+
+
+
- -
- +
+
-
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
8
Venn Diagram View of ID3
Question: How do decision trees
divide the feature space?
F1
+
+
F2
-
9/15/15
+
+
+
+
+
+
- -
- -
‘Axisparallel
splits’
+
+
F1
+
F2
-
+
+
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
-
-
F2
-
-
+
9
Use this
as a
guide on
how to
print dtrees in
ASCII
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
10
Main Issue
How to choose next feature
to place in decision tree?
– Random choice?
[works better than you’d expect]
– Feature with largest number of values?
– Feature with fewest?
– Information theoretic measure
(Quinlan’s approach)
General-purpose tool,
eg often used for
“feature selection”
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
11
Expected Value
Calculations: Sample Task
• Imagine you invest $1 in a lottery ticket
• It says odds are
– 1 in 10 times you’ll win $5
– 1 in 1,000,00 times you’ll win $100,000
• How much do you expect to get back?
0.1 x $5 + 0.000001 x $100,000 = $0.60
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
12
More Generally
Assume eventA has N discrete and disjoint
random outcomes
Expected value (eventA)
N
=  prob (outcomei occurs)  value(outcomei)
i=1
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
13
Scoring the Features
(so we can pick the best one)
Let f+ = fraction of positive examples
Let f - = fraction of negative examples
f+ = p / (p + n), f - = n / (p + n)
where p = #pos, n = #neg
From where
will we get
this info?
The expected information needed to determine the
category of one these examples is
InfoNeeded( f+, f -) = - f+ lg (f+) - f - lg (f -)
This is also called the entropy of the set of examples
(derived later)
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
14
Consider the Extreme Cases of
InfoNeeded(f +, f -)
All same class (+, say)
0 (by def’n)
InfoNeeded(1, 0) = -1 lg(1) - 0 lg(0) = 0
50-50 mixture
InfoNeeded(½, ½) = 2 [ -½ lg(½) ] = 1
1
InfoNeeded(f+, 1-f+)
0
9/15/15
0.5
1
f+
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
15
Evaluating a Feature
• How much does it help to know the
value of attribute/feature A ?
• Assume A divides the current set of
examples into N groups
Let qi = fraction of data on branch i
fi+ = fraction of +’s on branch i
fi - = fraction of –’s on branch i
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
16
Evaluating a Feature (cont.)
N
InfoRemaining(A)  Σ qi x InfoNeeded(fi+, fi-)
i= 1
– Info still needed after determining
the value of attribute A
– Another expected value calc
Pictorially
A
v1
InfoNeeded(f+, f-)
vN
InfoNeeded(fN+, fN-)
InfoNeeded(f1+, f1-)
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
17
Info Gain
Gain(A)  InfoNeeded(f+, f -) – InfoRemaining(A)
Our scoring
function in our
hill-climbing
(greedy)
algorithm
Constant
for all
features
So pick A with
smallest
Remainder(A)
That is, choose the feature that statistically tells us the most
about the class of another example drawn from this distribution
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
18
Sample Info-Gain Calculation
InfoNeeded( f+, f -) = - f+ lg (f+) - f - lg (f -)
Color
9/15/15
Shape
Size
Class
Red
BIG
+
Blue
BIG
+
Red
SMALL
-
Yellow
SMALL
-
Red
BIG
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
+
19
Info-Gain Calculation (cont.)
InfoNeeded ( f  , f  )  I (0.6,0.4)  0.6  log 2 (0.6)  0.4  log 2 (0.4)  0.97
2 1
Re mainder (color )  0.6  I ( , )  0.2  I (1,0)  0.2  I (0,1)
3 3
2 1
1 1
Re mainder ( shape)  0.6  I ( , )  0.4  I ( , )  Re mainder (color )
3 3
2 2
Re mainder ( size )  0.6  I (1,0)  0.4  I (0,1)  0
Note that “Size” provides complete classification, so done
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
20
Recursive Methods
You’ll Need to Write
• The d-tree learning algo
(pseudocode appeared above)
• Classifying a ‘testset’ example
Leaf nodes:
return leaf’s label (ie, the predicted category)
Interior nodes: determine which feature value to lookup in ex
return result of recursive call on the ‘left’ or ‘right’ branch
• Printing the d-tree in ‘plain ASCII’ (you need not follow verbatim)
Tip: pass in ‘currentDepthOfRecursion’ (initially 0)
Leaf nodes:
print LABEL (and maybe # training ex’s reaching here) + LINEFEED
Interior nodes: for each outgoing arc
print LINEFEED and 3 x currentDepthOfRecursion spaces
print FEATURE NAME +“ = “ + the arc’s value + “: “
make recursive call on arc, with currentDepthOfRecursion + 1
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4, Week 2
21
Suggested Approach
• Randomly choose a feature
– Get tree building to work
– Get tree printing to work
– Get tree traversal (for test ex’s) to work
• Add in code for infoGain
– Test on simple, handcrafted datasets
• Train and test on SAME file (why?)
– Should get ALL correct (except if extreme noise)
• Produce what the HW requests
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 4,
Week 2
Lecture 1, Slide 22