Transcript lecture5

Today’s Topics
•
Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4),
Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5) of textbook
•
Reviewing the Info Gain Calc from Last Week
•
HW0 due 11:55pm, HW1 due in one week (two with late days)
•
Fun reading: http://homes.cs.washington.edu/~pedrod/Prologue.pdf
•
Information Gain Derived
(and Generalized to k Output Categories)
•
Handling Numeric and Hierarchical Features
•
Advanced Topic: Regression Trees
•
The Trouble with Too Many Possible Values
•
What if Measuring Features is Costly?
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
1
ID3 Info Gain Measure Justified
(Ref. C4.5, J. R. Quinlan, Morgan Kaufmann, 1993, pp 21-22)
Definition of Information
Info conveyed by message M depends on its probability, i.e.,
info(M)  -log2[Prob(M)] (due to Claude Shannon)
Note: last week we used infoNeeded() as a more informative name for info()
The Supervised Learning Task
Select example from a set S and announce it belongs to class C
The probability of this occurring is approx fC the fraction of C ’s in S
Hence info in this announcement is, by definition, -log2(fC)
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
2
ID3 Info Gain Measure (cont.)
Let there be K different classes in set S, namely C1, C2, …, CK
What’s expected info from msg about class of an example in set S ?
info( S )   f C1  log 2 ( f C1 )  f C2  log 2 ( f C2 )  ...  f CK  log 2 ( f CK )
 info( S )  
K
f
j 1
Cj
 log 2 ( f C j ),
where f C j  fraction of set S that are of class C j .
info(s) is the average number of bits of information (by
looking at feature values) needed to classify member of set S
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
3
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 2
4
Handling Hierarchical Features
in ID3
Define a new feature for each level in hierarchy, e.g.,
Shape
Circular
Polygonal
Shape1 = { Circular, Polygonal }
Shape2 = {
}
Let ID3 choose the appropriate level of abstraction!
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
5
Handling Numeric Features in ID3
On the fly create binary features and choose best
Step 1: Plot current examples (green=pos, red=neg)
5
7
9
11
13
Value of Feature
Step 2: Divide midway between every consecutive
pair of points with different categories to create new
binary features, eg
featurenew1  F<8 and featurenew2  F<10
Step 3: Choose split with best info gain
(compete with all other features)
9/15/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 2
Note: “On the
fly” means in
each recursive
call to ID3
6
Handling Numeric Features (cont.)
Technical Note
Cannot discard
numeric feature
after use in one
portion of d-tree
F<10
T
+
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
T
F
F< 5
+
F
-
7
Advanced Topic: Regression Trees
(assume features are numerically valued)
Age > 25
No
Yes
Output = 4 f3 + 7 f5 – 2 f9
Gender
M
F
Output = 100 f4 – 2 f8
9/22/15
Output = 7 f6 - 2 f1 - 2 f8 + f7
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
8
Advanced Topic: Scoring “Splits” for
Regression (Real-Valued) Problems
We want to return real values at the leaves
- For each feature, F, “split” as done in ID3
- Use residue remaining, say using Linear Least Squares (LLS),
instead of info gain to score candidate splits
ErrorF i 

[out (ex)  LLS (ex)]
exsubseti
Total Error ( F ) 
2
Output
LLS
 Error
iSplit
F i
Why not a weighted sum in total error?
Commonly models at leaves are wgt’ed sums of features (y = mx + b)
Some approaches just place constants at leaves
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
X
9
Unfortunate Characteristic Property
of Using Info-Gain Measure
FAVORS FEATURES WITH HIGH BRANCHING FACTORS
(ie, many possible values)
Extreme Case:
1
1+
0-
Student ID
…
999999
99
0+
0-
… 0+
1-
At most one example per leaf and all Info(.,.) scores for leaves
equals zero, so gets perfect score! But generalizes very
poorly (ie, memorizes data)
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
10
One Fix (used in HW0/HW1)
Convert all features to binary
eg, Color = { Red, Blue, Green }
From one N-valued feature to N binary-valued features
Color = Red?
Color = Blue?
Color = Green?
Used in Neural Nets and SVMs
D-tree readability probably less, but not necessarily
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
11
Considering the Cost of
Measuring a Feature
• Want trees with high accuracy and
whose tests are inexpensive to compute
– take temperature vs. do CAT scan
• Common Heuristic
– InformationGain(F)² / Cost(F)
– Used in medical domains
as well as robot-sensing tasks
9/22/15
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
12