Overfitting in Tree Induction
Download
Report
Transcript Overfitting in Tree Induction
Chapter 5
Overfitting and Its Avoidance
指 導 教 授 : 徐 立 群
學
生
教 授
: R16014101 陳怡齊
R16011234 吳年鑫
Overfitting
即「過適」、「超適」或稱「過度擬合」
意指在調適一個model時,使用過多參數。對比於可取得的資料總量
來說,一個荒謬的model只要足夠複雜,是可以完美地適應(fit)資料。
不合乎一般化 (Generalization)
違反奧卡姆剃刀(Occam’s Razor ) 原則
Overfitting & Generalization
A extreme example –
Customer churn or non-churn
Training data & Holdout data
Overfitting Examined
• Holdout Data and Fitting Graphs A fitting graph shows the accuracy of a model as a function of
complexity .
Figure 1. A typical fitting graph.
Overfitting Examined
Base rate What would b be ?
Figure 2. A fitting graph for the customer churn (table) model.
Overfitting in Tree Induction
Decision tree induction
overfitting starts to
the “sweet spot” in the graph .
Figure 3. A typical fitting graph for tree induction.
Overfitting in Mathematical Functions
We add more Xi, the function becomes more and more complicated.
Each Xi has a corresponding Wi, which is a learned parameter of the
model .
Two dimensions you can fit a line to any two points and in three
dimensions you can fit a plane to any three points .
This concept generalizes: as you increase the dimensionality, you can
perfectly fit larger and larger sets of arbitrary points .
Example: Overfitting Linear Functions
Data:sepal width, petal width
Types:Iris Setosa, Iris Versicolor
Two different separation lines:
a. Logistic regression
b. Support vector machine
Figure 4
Example: Overfitting Linear Functions
Figure 4
Figure 5
Example: Overfitting Linear Functions
Figure 6
Figure 7
From Holdout Evaluation to Cross-Validation
Holdout Evaluation Splits the data
into only one training and one
holdout set.
Cross-validation
computes
its
estimates over all the data by
performing multiple splits and
systematically swapping out samples
for testing. ( k folds, typically k would
be 5 or 10. )
The Churn Dataset Revisited
“Example: Addressing the Churn Problem with Tree Induction” in Chapter 3.
The logistic regression models
show slightly lower average
accuracy (64.1%) and with
higher variation ( standard
deviation of 1.3 )
Average accuracy of the folds
with classification trees is
68.6%—significantly lower than
our previous measurement of
73%. ( the standard deviation of
the fold accuracies is 1.1 )
Classification trees may be
preferable
to
logistic
regression because of their
greater
stability
and
performance.
Learning Curves
The generalization performance of data-driven
modeling generally improves as more training data
become available.
Overfitting Avoidance & Complexity Control
Concept in Tree Induction :
Tree induction commonly uses two techniques to avoid overfitting. These
strategies are :
(i) to stop growing the tree before it gets too complex, and
(ii) to grow the tree until it is too large, then “prune” it back, reducing its size (and
thereby its complexity).
Methods in Tree Induction :
To limit tree size is to specify a minimum number of instances that must be
present in a leaf.
Hypothesis test ( P-value )
Overfitting Avoidance & Complexity Control
General Method for Avoiding Overfitting
Compare the best model we can build from one family (say, classification
trees) against the best model from another family (say, logistic regression).
Nested holdout testing
Select the best model by assess by
having a complexity of 122 nodes ( the
sweet spot).
Training subset
Training set
Induce a new tree with 122 nodes
from the whole, original training data.
Validation set
Test set
( hold out )
Final hold out
Overfitting Avoidance & Complexity Control
Nested Cross-Validation
Sequential Forward Selection
Training set
Test set
Original data