Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012 Information Gain InfoGain(S,A): expected reduction in entropy due to A.
Download ReportTranscript Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012 Information Gain InfoGain(S,A): expected reduction in entropy due to A.
Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012 Information Gain InfoGain(S,A): expected reduction in entropy due to A 2 Information Gain InfoGain(S,A): expected reduction in entropy due to A InfoGain(S, A) = H(S)- H(S | A) 3 Information Gain InfoGain(S,A): expected reduction in entropy due to A InfoGain(S, A) = H(S)- H(S | A) = H(S)- å p(A = a)H(S | A = a) a 4 Information Gain InfoGain(S,A): expected reduction in entropy due to A InfoGain(S, A) = H(S)- H(S | A) = H(S)- å p(A = a)H(S | A = a) a = H(S) - å aÎValues( A) Sa S H (Sa ) Select A with max InfoGain Resulting in lowest average entropy 5 Computing Average Entropy |S| instances Branch 2 Branch1 Sa2 a Sa2 b Sa1 a Sa1 b æS AvgEntropy = å çç a a=Values( A) è S Fraction of samples down branch i Sa,c ö å - S log2 S ÷÷ a a ø cÎclass Sa,c Disorder of class distribution on branch i 6 Sunburn Example Name Hair Height Weight Lotion Result Sarah Blonde Average Light No Burn Dana Blonde Tall Average Yes None Alex Brown Short Average Yes None Annie Blonde Short Average No Burn Emily Red Average Heavy No Burn Pete Brown Tall Heavy No None John Brown Average Heavy No None Katie Blonde Short Light Yes None 7 Picking a Test Hair Color Height Brown Blonde Red Sarah: B Dana: N Annie: B Katie: N Emily: B Alex: N Pete: N John: N Sarah:B Katie:N Alex:N Annie:B Katie:N Tall Average Sarah:B Emily:B John:N Dana:N Pete:N Lotion Weight Light Short Average Dana:N Alex:N Annie:B Heavy Emily:B Pete:N John:N No Sarah:B Annie:B Emily:B Pete:N John:N Yes Dana:N Alex:N Katie:N 8 Entropy in Sunburn Example æS a InfoGain = H (S) - å çç a=Values( A) è S Sa,c ö å - S log2 S ÷÷ a a ø cÎclass Sa,c 9 Entropy in Sunburn Example S = [3B,5N] æS a InfoGain = H (S) - å çç a=Values( A) è S Sa,c ö å - S log2 S ÷÷ a a ø cÎclass Sa,c 10 Entropy in Sunburn Example S = [3B,5N] 3 3 5 5 H(S) = -( log + log ) = 0.954 8 8 8 8 æS a InfoGain = H (S) - å çç a=Values( A) è S Sa,c ö å - S log2 S ÷÷ a a ø cÎclass Sa,c 11 Entropy in Sunburn Example S = [3B,5N] 3 3 5 5 H(S) = -( log + log ) = 0.954 8 8 8 8 æS a InfoGain = H (S) - å çç a=Values( A) è S Sa,c ö å - S log2 S ÷÷ a a ø cÎclass Sa,c Hair color = 0.954-(4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0) = 0.954- 0.5 = 0.454 Height = 0.954 - 0.69= 0.264 Weight = 0.954 - 0.94= 0.014 Lotion = 0.954 - 0.61= 0.344 12 Picking a Test Height Short Annie:B Katie:N Tall Average Sarah:B Dana:N Lotion Weight Light Sarah:B Katie:N Average Dana:N Annie:B Heavy No Sarah:B Annie:B Yes Dana:N Katie:N 13 Entropy in Sunburn Example S=[2B,2N] 1 1 1 1 H(S) = -( log + log ) =1 2 2 2 2 æS a ç InfoGain = H (S) - å ç a=Values( A) è S Height Weight Lotion Sa,c ö å - S log2 S ÷÷ a a ø cÎclass Sa,c = 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 1- 0.5 =0.5 = 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 = 1- 0 = 1 14 Building Decision Trees with Information Gain Until there are no inhomogeneous leaves 15 Building Decision Trees with Information Gain Until there are no inhomogeneous leaves Select an inhomogeneous leaf node 16 Building Decision Trees with Information Gain Until there are no inhomogeneous leaves Select an inhomogeneous leaf node Replace that leaf node by a test node creating subsets that yield highest information gain 17 Building Decision Trees with Information Gain Until there are no inhomogeneous leaves Select an inhomogeneous leaf node Replace that leaf node by a test node creating subsets that yield highest information gain Effectively creates set of rectangular regions Repeatedly draws lines in different axes 18 Alternate Measures Issue with Information Gain: 19 Alternate Measures Issue with Information Gain: Favors features with more values Option: 20 Alternate Measures Issue with Information Gain: Favors features with more values Option: Gain Ratio InfoGain(S, A) = SplitRatio(S, A) 21 Alternate Measures Issue with Information Gain: Favors features with more values Option: Gain Ratio InfoGain(S, A) = SplitRatio(S, A) SplitRatio = - å aÎValues( A) Sa S log Sa S Sa : elements of S with value A=a 22 Overfitting Overfitting: Model fits the training data TOO well Fits noise, irrelevant details 23 Overfitting Overfitting: Model fits the training data TOO well Fits noise, irrelevant details Why is this bad? 24 Overfitting Overfitting: Model fits the training data TOO well Fits noise, irrelevant details Why is this bad? Harms generalization Fits training data too well, fits new data badly 25 Overfitting Overfitting: Model fits the training data TOO well Fits noise, irrelevant details Why is this bad? Harms generalization Fits training data too well, fits new data badly For model m, training_error(m), D_error(m) – D=all data 26 Overfitting Overfitting: Model fits the training data TOO well Fits noise, irrelevant details Why is this bad? Harms generalization Fits training data too well, fits new data badly For model m, training_error(m), D_error(m) – D=all data If overfit, for another model m’, training_error(m) < training_error(m’), but D_error(m) > D_error(m’) 27 Avoiding Overfitting Strategies to avoid overfitting: 28 Avoiding Overfitting Strategies to avoid overfitting: Early stopping: 29 Avoiding Overfitting Strategies to avoid overfitting: Early stopping: Stop when InfoGain < threshold Stop when number of instances < threshold Stop when tree depth > threshold Post-pruning 30 Avoiding Overfitting Strategies to avoid overfitting: Early stopping: Stop when InfoGain < threshold Stop when number of instances < threshold Stop when tree depth > threshold Post-pruning Grow full tree and remove branches Which is better? 31 Avoiding Overfitting Strategies to avoid overfitting: Early stopping: Stop when InfoGain < threshold Stop when number of instances < threshold Stop when tree depth > threshold Post-pruning Grow full tree and remove branches Which is better? Unclear, both used. For some applications, post-pruning better 32 Post-Pruning Divide data into Training set: used to build the original tree Validation set: used to perform pruning 33 Post-Pruning Divide data into Training set: used to build the original tree Validation set: used to perform pruning Build decision tree based on training data 34 Post-Pruning Divide data into Training set: used to build the original tree Validation set: used to perform pruning Build decision tree based on training data Until pruning does not reduce validation set performance Compute perf. for pruning each nodes (& its children) Greedily remove nodes that do not reduce VS performance 35 Post-Pruning Divide data into Training set: used to build the original tree Validation set: used to perform pruning Build decision tree based on training data Until pruning does not reduce validation set performance Compute perf. for pruning each nodes (& its children) Greedily remove nodes that do not reduce VS performance Yields smaller tree with best performance 36 Performance Measures Compute accuracy on: 37 Performance Measures Compute accuracy on: Validation set k-fold cross-validation 38 Performance Measures Compute accuracy on: Validation set k-fold cross-validation Weighted classification error cost: Weight some types of errors more heavily 39 Performance Measures Compute accuracy on: Validation set k-fold cross-validation Weighted classification error cost: Weight some types of errors more heavily Minimum description length: 40 Performance Measures Compute accuracy on: Validation set k-fold cross-validation Weighted classification error cost: Weight some types of errors more heavily Minimum description length: Favor good accuracy on compact models MDL = error(tree) + model_size(tree) 41 Rule Post-Pruning Convert tree to rules 42 Rule Post-Pruning Convert tree to rules Prune rules independently 43 Rule Post-Pruning Convert tree to rules Prune rules independently Sort final rule set 44 Rule Post-Pruning Convert tree to rules Prune rules independently Sort final rule set Probably most widely used method (toolkits) 45 Modeling Features Different types of features need different tests Binary: Test branches on 46 Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches 47 Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches for each discrete value Continuous? 48 Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches for each discrete value Continuous? Need to discretize 49 Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches for each discrete value Continuous? Need to discretize Enumerate all values 50 Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches for each discrete value Continuous? Need to discretize Enumerate all values not possible or desirable Pick value x Branches: value < x; value >= x 51 Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches for each discrete value Continuous? Need to discretize Enumerate all values not possible or desirable Pick value x Branches: value < x; value >= x How can we pick split points? 52 Picking Splits Need split useful, sufficient split points What’s a good strategy? 53 Picking Splits Need split useful, sufficient split points What’s a good strategy? Approach: Sort all values for the feature in training data 54 Picking Splits Need split useful, sufficient split points What’s a good strategy? Approach: Sort all values for the feature in training data Identify adjacent instances of different classes Candidate split points between those instances 55 Picking Splits Need split useful, sufficient split points What’s a good strategy? Approach: Sort all values for the feature in training data Identify adjacent instances of different classes Candidate split points between those instances Select candidate with highest information gain 56 Advanced Topics Missing features: What do you do if an instance lacks a feature value? Advanced Topics Missing features: What do you do if an instance lacks a feature value? Feature costs: How do you model different costs for features? Advanced Topics Missing features: What do you do if an instance lacks a feature value? Feature costs: How do you model different costs for features? Regression trees: How do you build trees with real-valued predictions? Missing Features Problem: What if you don’t know the value for a feature? Missing Features Problem: What if you don’t know the value for a feature? Not binary presence/absence Missing Features Problem: What if you don’t know the value for a feature? Not binary presence/absence Create synthetic value: Missing Features Problem: What if you don’t know the value for a feature? Not binary presence/absence Create synthetic value: ‘blank’: allow a distinguished value ‘blank’ Missing Features Problem: What if you don’t know the value for a feature? Not binary presence/absence Create synthetic value: ‘blank’: allow a distinguished value ‘blank’ most common value: assign most common value of feature in training set at that node Missing Features Problem: What if you don’t know the value for a feature? Not binary presence/absence Create synthetic value: ‘blank’: allow a distinguished value ‘blank’ most common value: assign most common value of feature in training set at that node common value by class: assign most common value of feature in training set at that node for that class Missing Features Problem: What if you don’t know the value for a feature? Not binary presence/absence Create synthetic value: ‘blank’: allow a distinguished value ‘blank’ most common value: assign most common value of feature in training set at that node common value by class: assign most common value of feature in training set at that node for that class Assign prob pi to each possible value vi of A Assign a fraction (pi) of example to each descendant in tree Features with Cost Issue: Obtaining a value for a feature can be expensive Features with Cost Issue: Obtaining a value for a feature can be expensive i.e. Medical diagnosis: Feature value is result of some diagnostic test Features with Cost Issue: Obtaining a value for a feature can be expensive i.e. Medical diagnosis: Feature value is result of some diagnostic test Goal: Build best tree with lowest expected cost Approach: Modify feature selection Features with Cost Issue: Obtaining a value for a feature can be expensive i.e. Medical diagnosis: Feature value is result of some diagnostic test Goal: Build best tree with lowest expected cost Approach: Modify feature selection Replace information gain with measure including cost Tan & Schlimmer (1990) Gain (S, A) Cost(A) 2 Regression Trees Leaf nodes provide real-valued predictions Regression Trees Leaf nodes provide real-valued predictions i.e. level of sunburn, rather than binary Height of pitch accent, rather than +/- Regression Trees Leaf nodes provide real-valued predictions i.e. level of sunburn, rather than binary Height of pitch accent, rather than +/- Leaf nodes provide Value or linear function E.g. mean of nodes on that branch Regression Trees Leaf nodes provide real-valued predictions i.e. level of sunburn, rather than binary Height of pitch accent, rather than +/- Leaf nodes provide Value or linear function E.g. mean of nodes on that branch What measure of inhomogeneity? Variance, standard deviation,… Decision Trees: Strengths Simplicity (conceptual) Feature selection Handling of diverse features: Binary, discrete, continuous Fast decoding Perspicuousness (Interpretability) 75 Decision Trees: Weaknesses Features Assumed independent If want group effect, must model explicitly E.g. make new feature AorB Feature tests conjunctive Inefficiency of training: complex, multiple calculations Lack of formal guarantees: greedy training, non-optimal trees Inductive bias: Rectangular decision boundaries Sparse data problems: splits at each node Lack of stability/robustness 76 Train: Decision Trees Build tree by forming subsets of least disorder Predict: Traverse tree based on feature tests Assign leaf node sample label Pros: Robust to irrelevant features, some noise, fast prediction, perspicuous rule reading Cons: Poor feature combination, dependency, optimal tree build intractable 77