Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012 Information Gain  InfoGain(S,A): expected reduction in entropy due to A.

Download Report

Transcript Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012 Information Gain  InfoGain(S,A): expected reduction in entropy due to A.

Decision Trees
Advanced Statistical Methods in NLP
Ling572
January 10, 2012
Information Gain
 InfoGain(S,A): expected reduction in entropy due to A
2
Information Gain
 InfoGain(S,A): expected reduction in entropy due to A
InfoGain(S, A) = H(S)- H(S | A)
3
Information Gain
 InfoGain(S,A): expected reduction in entropy due to A
InfoGain(S, A) = H(S)- H(S | A)
= H(S)- å p(A = a)H(S | A = a)
a
4
Information Gain
 InfoGain(S,A): expected reduction in entropy due to A
InfoGain(S, A) = H(S)- H(S | A)
= H(S)- å p(A = a)H(S | A = a)
a
= H(S) -
å
aÎValues( A)
Sa
S
H (Sa )
 Select A with max InfoGain
 Resulting in lowest average entropy
5
Computing Average Entropy
|S| instances
Branch 2
Branch1
Sa2 a
Sa2 b
Sa1 a
Sa1 b
æS
AvgEntropy = å çç a
a=Values( A) è S
Fraction of samples
down branch i
Sa,c ö
å - S log2 S ÷÷
a
a ø
cÎclass
Sa,c
Disorder of class
distribution on branch i
6
Sunburn Example
Name
Hair
Height
Weight
Lotion
Result
Sarah
Blonde
Average
Light
No
Burn
Dana
Blonde
Tall
Average
Yes
None
Alex
Brown
Short
Average
Yes
None
Annie
Blonde
Short
Average
No
Burn
Emily
Red
Average
Heavy
No
Burn
Pete
Brown
Tall
Heavy
No
None
John
Brown
Average
Heavy
No
None
Katie
Blonde
Short
Light
Yes
None
7
Picking a Test
Hair Color
Height
Brown
Blonde
Red
Sarah: B
Dana: N
Annie: B
Katie: N
Emily: B
Alex: N
Pete: N
John: N
Sarah:B
Katie:N
Alex:N
Annie:B
Katie:N
Tall
Average
Sarah:B
Emily:B
John:N
Dana:N
Pete:N
Lotion
Weight
Light
Short
Average
Dana:N
Alex:N
Annie:B
Heavy
Emily:B
Pete:N
John:N
No
Sarah:B
Annie:B
Emily:B
Pete:N
John:N
Yes
Dana:N
Alex:N
Katie:N
8
Entropy in Sunburn Example
æS
a
InfoGain = H (S) - å çç
a=Values( A) è S
Sa,c ö
å - S log2 S ÷÷
a
a ø
cÎclass
Sa,c
9
Entropy in Sunburn Example
S = [3B,5N]
æS
a
InfoGain = H (S) - å çç
a=Values( A) è S
Sa,c ö
å - S log2 S ÷÷
a
a ø
cÎclass
Sa,c
10
Entropy in Sunburn Example
S = [3B,5N]
3 3 5 5
H(S) = -( log + log ) = 0.954
8 8 8 8
æS
a
InfoGain = H (S) - å çç
a=Values( A) è S
Sa,c ö
å - S log2 S ÷÷
a
a ø
cÎclass
Sa,c
11
Entropy in Sunburn Example
S = [3B,5N]
3 3 5 5
H(S) = -( log + log ) = 0.954
8 8 8 8
æS
a
InfoGain = H (S) - å çç
a=Values( A) è S
Sa,c ö
å - S log2 S ÷÷
a
a ø
cÎclass
Sa,c
Hair color = 0.954-(4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0)
= 0.954- 0.5 = 0.454
Height = 0.954 - 0.69= 0.264
Weight = 0.954 - 0.94= 0.014
Lotion = 0.954 - 0.61= 0.344
12
Picking a Test
Height
Short
Annie:B
Katie:N
Tall
Average
Sarah:B
Dana:N
Lotion
Weight
Light
Sarah:B
Katie:N
Average
Dana:N
Annie:B
Heavy
No
Sarah:B
Annie:B
Yes
Dana:N
Katie:N
13
Entropy in Sunburn Example
S=[2B,2N]
1
1 1
1
H(S) = -( log + log ) =1
2
2 2
2
æS
a
ç
InfoGain = H (S) - å ç
a=Values( A) è S
Height
Weight
Lotion
Sa,c ö
å - S log2 S ÷÷
a
a ø
cÎclass
Sa,c
= 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 1- 0.5 =0.5
= 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1
= 1- 0 = 1
14
Building Decision Trees
with Information Gain
 Until there are no inhomogeneous leaves
15
Building Decision Trees
with Information Gain
 Until there are no inhomogeneous leaves
 Select an inhomogeneous leaf node
16
Building Decision Trees
with Information Gain
 Until there are no inhomogeneous leaves
 Select an inhomogeneous leaf node
 Replace that leaf node by a test node creating subsets
that yield highest information gain
17
Building Decision Trees
with Information Gain
 Until there are no inhomogeneous leaves
 Select an inhomogeneous leaf node
 Replace that leaf node by a test node creating subsets
that yield highest information gain
 Effectively creates set of rectangular regions
 Repeatedly draws lines in different axes
18
Alternate Measures
 Issue with Information Gain:
19
Alternate Measures
 Issue with Information Gain:
 Favors features with more values
 Option:
20
Alternate Measures
 Issue with Information Gain:
 Favors features with more values
 Option:
 Gain Ratio
InfoGain(S, A)
=
SplitRatio(S, A)
21
Alternate Measures
 Issue with Information Gain:
 Favors features with more values
 Option:
 Gain Ratio
InfoGain(S, A)
=
SplitRatio(S, A)
SplitRatio = -
å
aÎValues( A)
Sa
S
log
Sa
S
 Sa : elements of S with value A=a
22
Overfitting
 Overfitting:
 Model fits the training data TOO well
 Fits noise, irrelevant details
23
Overfitting
 Overfitting:
 Model fits the training data TOO well
 Fits noise, irrelevant details
 Why is this bad?
24
Overfitting
 Overfitting:
 Model fits the training data TOO well
 Fits noise, irrelevant details
 Why is this bad?
 Harms generalization
 Fits training data too well, fits new data badly
25
Overfitting
 Overfitting:
 Model fits the training data TOO well
 Fits noise, irrelevant details
 Why is this bad?
 Harms generalization
 Fits training data too well, fits new data badly
 For model m, training_error(m), D_error(m) – D=all data
26
Overfitting
 Overfitting:
 Model fits the training data TOO well
 Fits noise, irrelevant details
 Why is this bad?
 Harms generalization
 Fits training data too well, fits new data badly
 For model m, training_error(m), D_error(m) – D=all data
 If overfit, for another model m’,
 training_error(m) < training_error(m’), but
 D_error(m) > D_error(m’)
27
Avoiding Overfitting
 Strategies to avoid overfitting:
28
Avoiding Overfitting
 Strategies to avoid overfitting:
 Early stopping:
29
Avoiding Overfitting
 Strategies to avoid overfitting:
 Early stopping:
 Stop when InfoGain < threshold
 Stop when number of instances < threshold
 Stop when tree depth > threshold
 Post-pruning
30
Avoiding Overfitting
 Strategies to avoid overfitting:
 Early stopping:
 Stop when InfoGain < threshold
 Stop when number of instances < threshold
 Stop when tree depth > threshold
 Post-pruning
 Grow full tree and remove branches
 Which is better?
31
Avoiding Overfitting
 Strategies to avoid overfitting:
 Early stopping:
 Stop when InfoGain < threshold
 Stop when number of instances < threshold
 Stop when tree depth > threshold
 Post-pruning
 Grow full tree and remove branches
 Which is better?
 Unclear, both used.
 For some applications, post-pruning better
32
Post-Pruning
 Divide data into
 Training set: used to build the original tree
 Validation set: used to perform pruning
33
Post-Pruning
 Divide data into
 Training set: used to build the original tree
 Validation set: used to perform pruning
 Build decision tree based on training data
34
Post-Pruning
 Divide data into
 Training set: used to build the original tree
 Validation set: used to perform pruning
 Build decision tree based on training data
 Until pruning does not reduce validation set performance
 Compute perf. for pruning each nodes (& its children)
 Greedily remove nodes that do not reduce VS performance
35
Post-Pruning
 Divide data into
 Training set: used to build the original tree
 Validation set: used to perform pruning
 Build decision tree based on training data
 Until pruning does not reduce validation set performance
 Compute perf. for pruning each nodes (& its children)
 Greedily remove nodes that do not reduce VS performance
 Yields smaller tree with best performance
36
Performance Measures
 Compute accuracy on:
37
Performance Measures
 Compute accuracy on:
 Validation set
 k-fold cross-validation
38
Performance Measures
 Compute accuracy on:
 Validation set
 k-fold cross-validation
 Weighted classification error cost:
 Weight some types of errors more heavily
39
Performance Measures
 Compute accuracy on:
 Validation set
 k-fold cross-validation
 Weighted classification error cost:
 Weight some types of errors more heavily
 Minimum description length:
40
Performance Measures
 Compute accuracy on:
 Validation set
 k-fold cross-validation
 Weighted classification error cost:
 Weight some types of errors more heavily
 Minimum description length:
 Favor good accuracy on compact models
 MDL = error(tree) + model_size(tree)
41
Rule Post-Pruning
 Convert tree to rules
42
Rule Post-Pruning
 Convert tree to rules
 Prune rules independently
43
Rule Post-Pruning
 Convert tree to rules
 Prune rules independently
 Sort final rule set
44
Rule Post-Pruning
 Convert tree to rules
 Prune rules independently
 Sort final rule set
 Probably most widely used method (toolkits)
45
Modeling Features
 Different types of features need different tests
 Binary: Test branches on
46
Modeling Features
 Different types of features need different tests
 Binary: Test branches on true/false
 Discrete: Branches
47
Modeling Features
 Different types of features need different tests
 Binary: Test branches on true/false
 Discrete: Branches for each discrete value
 Continuous?
48
Modeling Features
 Different types of features need different tests
 Binary: Test branches on true/false
 Discrete: Branches for each discrete value
 Continuous?
 Need to discretize
49
Modeling Features
 Different types of features need different tests
 Binary: Test branches on true/false
 Discrete: Branches for each discrete value
 Continuous?
 Need to discretize
 Enumerate all values
50
Modeling Features
 Different types of features need different tests
 Binary: Test branches on true/false
 Discrete: Branches for each discrete value
 Continuous?
 Need to discretize
 Enumerate all values  not possible or desirable
 Pick value x
 Branches: value < x; value >= x
51
Modeling Features
 Different types of features need different tests
 Binary: Test branches on true/false
 Discrete: Branches for each discrete value
 Continuous?
 Need to discretize
 Enumerate all values  not possible or desirable
 Pick value x
 Branches: value < x; value >= x
 How can we pick split points?
52
Picking Splits
 Need split useful, sufficient split points
 What’s a good strategy?
53
Picking Splits
 Need split useful, sufficient split points
 What’s a good strategy?
 Approach:
 Sort all values for the feature in training data
54
Picking Splits
 Need split useful, sufficient split points
 What’s a good strategy?
 Approach:
 Sort all values for the feature in training data
 Identify adjacent instances of different classes
 Candidate split points between those instances
55
Picking Splits
 Need split useful, sufficient split points
 What’s a good strategy?
 Approach:
 Sort all values for the feature in training data
 Identify adjacent instances of different classes
 Candidate split points between those instances
 Select candidate with highest information gain
56
Advanced Topics
 Missing features:
 What do you do if an instance lacks a feature value?
Advanced Topics
 Missing features:
 What do you do if an instance lacks a feature value?
 Feature costs:
 How do you model different costs for features?

Advanced Topics
 Missing features:
 What do you do if an instance lacks a feature value?
 Feature costs:
 How do you model different costs for features?
 Regression trees:
 How do you build trees with real-valued predictions?
Missing Features
 Problem:
 What if you don’t know the value for a feature?
Missing Features
 Problem:
 What if you don’t know the value for a feature?
 Not binary presence/absence
Missing Features
 Problem:
 What if you don’t know the value for a feature?
 Not binary presence/absence
 Create synthetic value:
Missing Features
 Problem:
 What if you don’t know the value for a feature?
 Not binary presence/absence
 Create synthetic value:
 ‘blank’: allow a distinguished value ‘blank’
Missing Features
 Problem:
 What if you don’t know the value for a feature?
 Not binary presence/absence
 Create synthetic value:
 ‘blank’: allow a distinguished value ‘blank’
 most common value: assign most common value of
feature in training set at that node
Missing Features
 Problem:
 What if you don’t know the value for a feature?
 Not binary presence/absence
 Create synthetic value:
 ‘blank’: allow a distinguished value ‘blank’
 most common value: assign most common value of
feature in training set at that node
 common value by class: assign most common value of
feature in training set at that node for that class
Missing Features
 Problem:
 What if you don’t know the value for a feature?
 Not binary presence/absence
 Create synthetic value:
 ‘blank’: allow a distinguished value ‘blank’
 most common value: assign most common value of feature in
training set at that node
 common value by class: assign most common value of feature
in training set at that node for that class
 Assign prob pi to each possible value vi of A
 Assign a fraction (pi) of example to each descendant in tree
Features with Cost
 Issue:
 Obtaining a value for a feature can be expensive
Features with Cost
 Issue:
 Obtaining a value for a feature can be expensive
 i.e. Medical diagnosis:
 Feature value is result of some diagnostic test
Features with Cost
 Issue:
 Obtaining a value for a feature can be expensive
 i.e. Medical diagnosis:
 Feature value is result of some diagnostic test
 Goal: Build best tree with lowest expected cost
 Approach: Modify feature selection
Features with Cost
 Issue:
 Obtaining a value for a feature can be expensive
 i.e. Medical diagnosis:
 Feature value is result of some diagnostic test
 Goal: Build best tree with lowest expected cost
 Approach: Modify feature selection
 Replace information gain with measure including cost
 Tan & Schlimmer (1990)
Gain (S, A)
Cost(A)
2
Regression Trees
 Leaf nodes provide real-valued predictions
Regression Trees
 Leaf nodes provide real-valued predictions
 i.e. level of sunburn, rather than binary
 Height of pitch accent, rather than +/-
Regression Trees
 Leaf nodes provide real-valued predictions
 i.e. level of sunburn, rather than binary
 Height of pitch accent, rather than +/-
 Leaf nodes provide
 Value or linear function
 E.g. mean of nodes on that branch
Regression Trees
 Leaf nodes provide real-valued predictions
 i.e. level of sunburn, rather than binary
 Height of pitch accent, rather than +/-
 Leaf nodes provide
 Value or linear function
 E.g. mean of nodes on that branch
 What measure of inhomogeneity?
 Variance, standard deviation,…
Decision Trees: Strengths
 Simplicity (conceptual)
 Feature selection
 Handling of diverse features:
 Binary, discrete, continuous
 Fast decoding
 Perspicuousness (Interpretability)
75
Decision Trees: Weaknesses
 Features
 Assumed independent
 If want group effect, must model explicitly
 E.g. make new feature AorB
 Feature tests conjunctive
 Inefficiency of training: complex, multiple calculations
 Lack of formal guarantees:
 greedy training, non-optimal trees
 Inductive bias: Rectangular decision boundaries
 Sparse data problems: splits at each node
 Lack of stability/robustness
76
 Train:
Decision Trees
 Build tree by forming subsets of least disorder
 Predict:
 Traverse tree based on feature tests
 Assign leaf node sample label
 Pros: Robust to irrelevant features, some noise, fast
prediction, perspicuous rule reading
 Cons: Poor feature combination, dependency, optimal
tree build intractable
77