Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012 Information Gain InfoGain(S,A): expected reduction in entropy due to A.
Download
Report
Transcript Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012 Information Gain InfoGain(S,A): expected reduction in entropy due to A.
Decision Trees
Advanced Statistical Methods in NLP
Ling572
January 10, 2012
Information Gain
InfoGain(S,A): expected reduction in entropy due to A
2
Information Gain
InfoGain(S,A): expected reduction in entropy due to A
InfoGain(S, A) = H(S)- H(S | A)
3
Information Gain
InfoGain(S,A): expected reduction in entropy due to A
InfoGain(S, A) = H(S)- H(S | A)
= H(S)- å p(A = a)H(S | A = a)
a
4
Information Gain
InfoGain(S,A): expected reduction in entropy due to A
InfoGain(S, A) = H(S)- H(S | A)
= H(S)- å p(A = a)H(S | A = a)
a
= H(S) -
å
aÎValues( A)
Sa
S
H (Sa )
Select A with max InfoGain
Resulting in lowest average entropy
5
Computing Average Entropy
|S| instances
Branch 2
Branch1
Sa2 a
Sa2 b
Sa1 a
Sa1 b
æS
AvgEntropy = å çç a
a=Values( A) è S
Fraction of samples
down branch i
Sa,c ö
å - S log2 S ÷÷
a
a ø
cÎclass
Sa,c
Disorder of class
distribution on branch i
6
Sunburn Example
Name
Hair
Height
Weight
Lotion
Result
Sarah
Blonde
Average
Light
No
Burn
Dana
Blonde
Tall
Average
Yes
None
Alex
Brown
Short
Average
Yes
None
Annie
Blonde
Short
Average
No
Burn
Emily
Red
Average
Heavy
No
Burn
Pete
Brown
Tall
Heavy
No
None
John
Brown
Average
Heavy
No
None
Katie
Blonde
Short
Light
Yes
None
7
Picking a Test
Hair Color
Height
Brown
Blonde
Red
Sarah: B
Dana: N
Annie: B
Katie: N
Emily: B
Alex: N
Pete: N
John: N
Sarah:B
Katie:N
Alex:N
Annie:B
Katie:N
Tall
Average
Sarah:B
Emily:B
John:N
Dana:N
Pete:N
Lotion
Weight
Light
Short
Average
Dana:N
Alex:N
Annie:B
Heavy
Emily:B
Pete:N
John:N
No
Sarah:B
Annie:B
Emily:B
Pete:N
John:N
Yes
Dana:N
Alex:N
Katie:N
8
Entropy in Sunburn Example
æS
a
InfoGain = H (S) - å çç
a=Values( A) è S
Sa,c ö
å - S log2 S ÷÷
a
a ø
cÎclass
Sa,c
9
Entropy in Sunburn Example
S = [3B,5N]
æS
a
InfoGain = H (S) - å çç
a=Values( A) è S
Sa,c ö
å - S log2 S ÷÷
a
a ø
cÎclass
Sa,c
10
Entropy in Sunburn Example
S = [3B,5N]
3 3 5 5
H(S) = -( log + log ) = 0.954
8 8 8 8
æS
a
InfoGain = H (S) - å çç
a=Values( A) è S
Sa,c ö
å - S log2 S ÷÷
a
a ø
cÎclass
Sa,c
11
Entropy in Sunburn Example
S = [3B,5N]
3 3 5 5
H(S) = -( log + log ) = 0.954
8 8 8 8
æS
a
InfoGain = H (S) - å çç
a=Values( A) è S
Sa,c ö
å - S log2 S ÷÷
a
a ø
cÎclass
Sa,c
Hair color = 0.954-(4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0)
= 0.954- 0.5 = 0.454
Height = 0.954 - 0.69= 0.264
Weight = 0.954 - 0.94= 0.014
Lotion = 0.954 - 0.61= 0.344
12
Picking a Test
Height
Short
Annie:B
Katie:N
Tall
Average
Sarah:B
Dana:N
Lotion
Weight
Light
Sarah:B
Katie:N
Average
Dana:N
Annie:B
Heavy
No
Sarah:B
Annie:B
Yes
Dana:N
Katie:N
13
Entropy in Sunburn Example
S=[2B,2N]
1
1 1
1
H(S) = -( log + log ) =1
2
2 2
2
æS
a
ç
InfoGain = H (S) - å ç
a=Values( A) è S
Height
Weight
Lotion
Sa,c ö
å - S log2 S ÷÷
a
a ø
cÎclass
Sa,c
= 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 1- 0.5 =0.5
= 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1
= 1- 0 = 1
14
Building Decision Trees
with Information Gain
Until there are no inhomogeneous leaves
15
Building Decision Trees
with Information Gain
Until there are no inhomogeneous leaves
Select an inhomogeneous leaf node
16
Building Decision Trees
with Information Gain
Until there are no inhomogeneous leaves
Select an inhomogeneous leaf node
Replace that leaf node by a test node creating subsets
that yield highest information gain
17
Building Decision Trees
with Information Gain
Until there are no inhomogeneous leaves
Select an inhomogeneous leaf node
Replace that leaf node by a test node creating subsets
that yield highest information gain
Effectively creates set of rectangular regions
Repeatedly draws lines in different axes
18
Alternate Measures
Issue with Information Gain:
19
Alternate Measures
Issue with Information Gain:
Favors features with more values
Option:
20
Alternate Measures
Issue with Information Gain:
Favors features with more values
Option:
Gain Ratio
InfoGain(S, A)
=
SplitRatio(S, A)
21
Alternate Measures
Issue with Information Gain:
Favors features with more values
Option:
Gain Ratio
InfoGain(S, A)
=
SplitRatio(S, A)
SplitRatio = -
å
aÎValues( A)
Sa
S
log
Sa
S
Sa : elements of S with value A=a
22
Overfitting
Overfitting:
Model fits the training data TOO well
Fits noise, irrelevant details
23
Overfitting
Overfitting:
Model fits the training data TOO well
Fits noise, irrelevant details
Why is this bad?
24
Overfitting
Overfitting:
Model fits the training data TOO well
Fits noise, irrelevant details
Why is this bad?
Harms generalization
Fits training data too well, fits new data badly
25
Overfitting
Overfitting:
Model fits the training data TOO well
Fits noise, irrelevant details
Why is this bad?
Harms generalization
Fits training data too well, fits new data badly
For model m, training_error(m), D_error(m) – D=all data
26
Overfitting
Overfitting:
Model fits the training data TOO well
Fits noise, irrelevant details
Why is this bad?
Harms generalization
Fits training data too well, fits new data badly
For model m, training_error(m), D_error(m) – D=all data
If overfit, for another model m’,
training_error(m) < training_error(m’), but
D_error(m) > D_error(m’)
27
Avoiding Overfitting
Strategies to avoid overfitting:
28
Avoiding Overfitting
Strategies to avoid overfitting:
Early stopping:
29
Avoiding Overfitting
Strategies to avoid overfitting:
Early stopping:
Stop when InfoGain < threshold
Stop when number of instances < threshold
Stop when tree depth > threshold
Post-pruning
30
Avoiding Overfitting
Strategies to avoid overfitting:
Early stopping:
Stop when InfoGain < threshold
Stop when number of instances < threshold
Stop when tree depth > threshold
Post-pruning
Grow full tree and remove branches
Which is better?
31
Avoiding Overfitting
Strategies to avoid overfitting:
Early stopping:
Stop when InfoGain < threshold
Stop when number of instances < threshold
Stop when tree depth > threshold
Post-pruning
Grow full tree and remove branches
Which is better?
Unclear, both used.
For some applications, post-pruning better
32
Post-Pruning
Divide data into
Training set: used to build the original tree
Validation set: used to perform pruning
33
Post-Pruning
Divide data into
Training set: used to build the original tree
Validation set: used to perform pruning
Build decision tree based on training data
34
Post-Pruning
Divide data into
Training set: used to build the original tree
Validation set: used to perform pruning
Build decision tree based on training data
Until pruning does not reduce validation set performance
Compute perf. for pruning each nodes (& its children)
Greedily remove nodes that do not reduce VS performance
35
Post-Pruning
Divide data into
Training set: used to build the original tree
Validation set: used to perform pruning
Build decision tree based on training data
Until pruning does not reduce validation set performance
Compute perf. for pruning each nodes (& its children)
Greedily remove nodes that do not reduce VS performance
Yields smaller tree with best performance
36
Performance Measures
Compute accuracy on:
37
Performance Measures
Compute accuracy on:
Validation set
k-fold cross-validation
38
Performance Measures
Compute accuracy on:
Validation set
k-fold cross-validation
Weighted classification error cost:
Weight some types of errors more heavily
39
Performance Measures
Compute accuracy on:
Validation set
k-fold cross-validation
Weighted classification error cost:
Weight some types of errors more heavily
Minimum description length:
40
Performance Measures
Compute accuracy on:
Validation set
k-fold cross-validation
Weighted classification error cost:
Weight some types of errors more heavily
Minimum description length:
Favor good accuracy on compact models
MDL = error(tree) + model_size(tree)
41
Rule Post-Pruning
Convert tree to rules
42
Rule Post-Pruning
Convert tree to rules
Prune rules independently
43
Rule Post-Pruning
Convert tree to rules
Prune rules independently
Sort final rule set
44
Rule Post-Pruning
Convert tree to rules
Prune rules independently
Sort final rule set
Probably most widely used method (toolkits)
45
Modeling Features
Different types of features need different tests
Binary: Test branches on
46
Modeling Features
Different types of features need different tests
Binary: Test branches on true/false
Discrete: Branches
47
Modeling Features
Different types of features need different tests
Binary: Test branches on true/false
Discrete: Branches for each discrete value
Continuous?
48
Modeling Features
Different types of features need different tests
Binary: Test branches on true/false
Discrete: Branches for each discrete value
Continuous?
Need to discretize
49
Modeling Features
Different types of features need different tests
Binary: Test branches on true/false
Discrete: Branches for each discrete value
Continuous?
Need to discretize
Enumerate all values
50
Modeling Features
Different types of features need different tests
Binary: Test branches on true/false
Discrete: Branches for each discrete value
Continuous?
Need to discretize
Enumerate all values not possible or desirable
Pick value x
Branches: value < x; value >= x
51
Modeling Features
Different types of features need different tests
Binary: Test branches on true/false
Discrete: Branches for each discrete value
Continuous?
Need to discretize
Enumerate all values not possible or desirable
Pick value x
Branches: value < x; value >= x
How can we pick split points?
52
Picking Splits
Need split useful, sufficient split points
What’s a good strategy?
53
Picking Splits
Need split useful, sufficient split points
What’s a good strategy?
Approach:
Sort all values for the feature in training data
54
Picking Splits
Need split useful, sufficient split points
What’s a good strategy?
Approach:
Sort all values for the feature in training data
Identify adjacent instances of different classes
Candidate split points between those instances
55
Picking Splits
Need split useful, sufficient split points
What’s a good strategy?
Approach:
Sort all values for the feature in training data
Identify adjacent instances of different classes
Candidate split points between those instances
Select candidate with highest information gain
56
Advanced Topics
Missing features:
What do you do if an instance lacks a feature value?
Advanced Topics
Missing features:
What do you do if an instance lacks a feature value?
Feature costs:
How do you model different costs for features?
Advanced Topics
Missing features:
What do you do if an instance lacks a feature value?
Feature costs:
How do you model different costs for features?
Regression trees:
How do you build trees with real-valued predictions?
Missing Features
Problem:
What if you don’t know the value for a feature?
Missing Features
Problem:
What if you don’t know the value for a feature?
Not binary presence/absence
Missing Features
Problem:
What if you don’t know the value for a feature?
Not binary presence/absence
Create synthetic value:
Missing Features
Problem:
What if you don’t know the value for a feature?
Not binary presence/absence
Create synthetic value:
‘blank’: allow a distinguished value ‘blank’
Missing Features
Problem:
What if you don’t know the value for a feature?
Not binary presence/absence
Create synthetic value:
‘blank’: allow a distinguished value ‘blank’
most common value: assign most common value of
feature in training set at that node
Missing Features
Problem:
What if you don’t know the value for a feature?
Not binary presence/absence
Create synthetic value:
‘blank’: allow a distinguished value ‘blank’
most common value: assign most common value of
feature in training set at that node
common value by class: assign most common value of
feature in training set at that node for that class
Missing Features
Problem:
What if you don’t know the value for a feature?
Not binary presence/absence
Create synthetic value:
‘blank’: allow a distinguished value ‘blank’
most common value: assign most common value of feature in
training set at that node
common value by class: assign most common value of feature
in training set at that node for that class
Assign prob pi to each possible value vi of A
Assign a fraction (pi) of example to each descendant in tree
Features with Cost
Issue:
Obtaining a value for a feature can be expensive
Features with Cost
Issue:
Obtaining a value for a feature can be expensive
i.e. Medical diagnosis:
Feature value is result of some diagnostic test
Features with Cost
Issue:
Obtaining a value for a feature can be expensive
i.e. Medical diagnosis:
Feature value is result of some diagnostic test
Goal: Build best tree with lowest expected cost
Approach: Modify feature selection
Features with Cost
Issue:
Obtaining a value for a feature can be expensive
i.e. Medical diagnosis:
Feature value is result of some diagnostic test
Goal: Build best tree with lowest expected cost
Approach: Modify feature selection
Replace information gain with measure including cost
Tan & Schlimmer (1990)
Gain (S, A)
Cost(A)
2
Regression Trees
Leaf nodes provide real-valued predictions
Regression Trees
Leaf nodes provide real-valued predictions
i.e. level of sunburn, rather than binary
Height of pitch accent, rather than +/-
Regression Trees
Leaf nodes provide real-valued predictions
i.e. level of sunburn, rather than binary
Height of pitch accent, rather than +/-
Leaf nodes provide
Value or linear function
E.g. mean of nodes on that branch
Regression Trees
Leaf nodes provide real-valued predictions
i.e. level of sunburn, rather than binary
Height of pitch accent, rather than +/-
Leaf nodes provide
Value or linear function
E.g. mean of nodes on that branch
What measure of inhomogeneity?
Variance, standard deviation,…
Decision Trees: Strengths
Simplicity (conceptual)
Feature selection
Handling of diverse features:
Binary, discrete, continuous
Fast decoding
Perspicuousness (Interpretability)
75
Decision Trees: Weaknesses
Features
Assumed independent
If want group effect, must model explicitly
E.g. make new feature AorB
Feature tests conjunctive
Inefficiency of training: complex, multiple calculations
Lack of formal guarantees:
greedy training, non-optimal trees
Inductive bias: Rectangular decision boundaries
Sparse data problems: splits at each node
Lack of stability/robustness
76
Train:
Decision Trees
Build tree by forming subsets of least disorder
Predict:
Traverse tree based on feature tests
Assign leaf node sample label
Pros: Robust to irrelevant features, some noise, fast
prediction, perspicuous rule reading
Cons: Poor feature combination, dependency, optimal
tree build intractable
77