Data Mining and Knowledge Discovery in Business Databases
Download
Report
Transcript Data Mining and Knowledge Discovery in Business Databases
Decision Trees and
their Pruning
CART (Classific.and Regression Tree):
Key Parts of Tree Struct. Data Analysis
Tree growing
Splitting rules to generate tree
Stopping criteria: how far to grow?
Missing values: using surrogates (náhrada)
Tree pruning
Trimming off parts of the tree that don’t work
Ordering the nodes of a large tree by contribution to tree accuracy
… which nodes come off first?
Optimal tree selection
Deciding on the best tree after growing and pruning
Balancing simplicity against accuracy
Maximal Tree Example
3
Stopping criteria
for growing the tree
All instances in the node belong to the same class
The maximum tree depth has been reached
Size of the data in the node is below a threshold
(e.g. 5% of the original dataset)
The best splitting criteria is below a threshold
…
4
How to Address Overfitting
Pre-Pruning (Early Stopping Rule)
Stop the algorithm before it becomes a fully-grown tree
Typical stopping conditions for a node:
Stop if all instances belong to the same class
Stop if all the attribute values are the same
More restrictive conditions:
Stop if number of instances is less than some user-specified threshold
Stop if class distribution of instances are independent of the available
features (e.g., using 2 test)
Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
5
How to Address Overfitting…
Post-pruning
Grow decision tree to its entirety
Trim the nodes of the decision tree in a bottom-up
fashion
If generalization error improves after trimming, replace
sub-tree by a leaf node.
Class label of leaf node is determined from majority
class of instances in the sub-tree
Can use MDL for post-pruning
6
CART Pruning Method:
Grow Full Tree, Then Prune
You will never know when to stop . . . so don’t!
Instead . . . grow trees that are obviously too big
Largest tree grown is called “maximal” tree
Maximal tree could have hundreds or thousands of nodes
usually instruct CART to grow only moderately too big
rule of thumb: should grow trees about twice the size of the truly
best tree
This becomes first stage in finding the best tree
Next we will have to get rid the parts of the overgrown
tree that don’t work (not supported by test data)
Tree Pruning
Take a very large tree (“maximal” tree)
Tree may be radically over-fit
Tracks all the idiosyncrasies of THIS data set
Tracks patterns that may not be found in other data sets
At bottom of tree splits based on very few cases
Analogous to a regression with very large number of variables
PRUNE away branches from this large tree
But which branch to cut first?
CART determines a pruning sequence:
the exact order in which each node should be removed
pruning sequence determined for EVERY node
sequence determined all the way back to root node
Pruning: Which nodes come off
next?
9
Order of Pruning: Weakest
Link Goes First
Prune away "weakest link" — the nodes that add least to overall
accuracy of the tree
contribution to overall tree a function of both increase in accuracy and
size of node
accuracy gain is weighted by share of sample
small nodes tend to get removed before large ones !
If several nodes have same contribution they all prune away
simultaneously
Hence more than two terminal nodes could be cut off in one pruning
Sequence determined all the way back to root node
need to allow for possibility that entire tree is bad
if target variable is unpredictable we will want to prune back to root .
. . the no model solution
Pruning Sequence Example
24 Terminal Nodes
21 Terminal Nodes
20 Terminal Nodes
11
18 Terminal Nodes
Now we test every tree in
the pruning sequence
Take a test data set and drop it down the largest tree in
the sequence and measure its predictive accuracy
how many cases right and how many wrong
measure accuracy overall and by class
Do same for 2nd largest tree, 3rd largest tree, etc
Performance of every tree in sequence is measured
Results reported in table and graph formats
Note that this critical stage is impossible to complete
without test data
CART procedure requires test data to guide tree
evaluation
12
Training Data Vs. Test Data
Error Rates
Compare error rates
measured by
learn data
large test set
Learn R(T) always decreases
as tree grows (Q: Why?)
Test R(T) first declines then
increases (Q: Why?)
Overfitting is the result tree of
too much reliance on learn
R(T)
Can lead to disasters when
applied to new data
No.
Terminal
Nodes
71
63
58
40
34
19
**10
9
7
6
5
2
1
R(T)
Rts(T)
.00
.00
.03
.10
.12
.20
.29
.32
.41
.46
.53
.75
.86
.42
.40
.39
.32
.32
.31
.30
.34
.47
.54
.61
.82
.91
Why look at training data error
rates (or cost) at all?
First, provides a rough guide of how you are doing
Truth will typically be WORSE than training data measure
If tree performing poorly on training data error may not want to
pursue further
Training data error rate more accurate for smaller trees
So reasonable guide for smaller trees
Poor guide for larger trees
At optimal tree training and test error rates should be similar
if not something is wrong
useful to compare not just overall error rate but also within node
performance between training and test data
CART: Optimal Tree
Within a single CART run which tree
is best?
Process of pruning the maximal tree
can yield many sub-trees
Test data set or cross- validation
measures the error rate of each tree
1
The Best Pruned Subtree:
An Estimation Problem
^
R(Tk)
Current wisdom — select the tree
with smallest error rate
Only drawback — minimum may not
be precisely estimated
Typical error rate as a function of
tree size has flat region
Minimum could be anywhere in this
region
0
0
10
20
30
~
|Tk |
40
50
In what sense is the optimal
tree “best”?
Optimal tree has lowest or near lowest cost as
determined by a test procedure
Tree should exhibit very similar accuracy when applied
to new data
BUT Best Tree is NOT necessarily the one that happens to be most
accurate on a single test database
trees somewhat larger or smaller than “optimal” may be preferred
Room for user judgment
judgment not about split variable or values
judgment as to how much of tree to keep
determined by story tree is telling
willingness to sacrifice a small amount of accuracy for simplicity
Decision Tree Summary
Decision Trees
splits – binary, multi-way
split criteria – entropy, gini, …
missing value treatment
pruning
rule extraction from trees
Both C4.5 and CART are robust tools
No method is always superior –
experiment!
witten & eibe
17