Contributions to Decision Tree Induction

Download Report

Transcript Contributions to Decision Tree Induction

Overfitting, Bias/Variance tradeoff,
and
Ensemble methods
Pierre Geurts
Stochastic methods
(Prof. L.Wehenkel)
University of Liège
Content of the presentation
•
•
•
•
Bias and variance definitions
Parameters that influence bias and variance
Decision/regression tree variance
Bias and variance reduction techniques
2
Content of the presentation
• Bias and variance definitions:
– A simple regression problem with no input
– Generalization to full regression problems
– A short discussion about classification
• Parameters that influence bias and variance
• Decision/regression tree variance
• Bias and variance reduction techniques
3
Regression problem - no input
• Goal: predict as well as possible the height of a
Belgian male adult
• More precisely:
– Choose an error measure, for example the square error.
– Find an estimation ŷ such that the expectation:
over the whole population of Belgian male adult is
minimized.
180
4
Regression problem - no input
• The estimation that minimizes the error can be computed by
taking:
• So, the estimation which minimizes the error is Ey{y}. In
AL, it is called the Bayes model.
• But in practice, we cannot compute the exact value of
Ey{y} (this would imply to measure the height of every
Belgian male adults).
5
Learning algorithm
•
•
As p(y) is unknown, find an estimation y from a
sample of individuals, LS={y1,y2,…,yN}, drawn
from the Belgian male adult population.
Example of learning algorithms:
–
,
–
(if we know that the height is close to 180)
6
Good learning algorithm
• As LS are randomly drawn, the prediction ŷ will also be a
random variable
pLS (ŷ)
ŷ
• A good learning algorithm should not be good only on one
learning sample but in average over all learning samples
(of size N)  we want to minimize:
• Let us analyse this error in more detail
7
Bias/variance decomposition (1)
8
Bias/variance decomposition (2)
vary{y}
Ey{y}
y
E= Ey{(y- Ey{y})2} + ELS{(Ey{y}-ŷ)2}
= residual error = minimal attainable error
= vary{y}
9
Bias/variance decomposition (3)
10
Bias/variance decomposition (4)
bias2
Ey{y} ELS{ŷ}
ŷ
E= vary{y} + (Ey{y}-ELS{ŷ})2 + …
ELS{ŷ} = average model (over all LS)
bias2 = error between Bayes and average model
11
Bias/variance decomposition (5)
varLS{ŷ}
ELS{ŷ}
ŷ
E= vary{y} + bias2 + ELS{(ŷ-ELS{ŷ})2}
varLS{ŷ} = estimation variance = consequence of
over-fitting
12
Bias/variance decomposition (6)
bias2
vary{y}
varLS{ŷ}
Ey{y} ELS{ŷ}
ŷ
E= vary{y} + bias2 + varLS{ŷ}
13
Our simple example
•
–
–
From statistics, ŷ1 is the best estimate with zero bias
•
–
–
So, the first one may not be the best estimator because of
variance (There is a bias/variance tradeoff w.r.t. l)
14
Bayesian approach (1)
• Hypotheses :
– The average height is close to 180cm:
– The height of one individual is Gaussian around the
mean:
• What is the most probable value of
seen the learning sample ?
after having
15
Bayesian approach (2)
Bayes theorem and
P(LS) is constant
Independence of the
learning cases
16
Regression problem – full (1)
• Actually, we want to find a function ŷ(x) of several inputs
=> average over the whole input space:
• The error becomes:
• Over all learning sets:
17
Regression problem – full (2)
ELS{Ey|x{(y-ŷ(x))^2}}=Noise(x)+Bias2(x)+Variance(x)
• Noise(x) = Ey|x{(y-hB(x))2}
Quantifies how much y varies from hB(x) = Ey|x{y}, the
Bayes model.
• Bias2(x) = (hB(x)-ELS{ŷ(x)})2:
Measures the error between the Bayes model and the
average model.
• Variance(x) = ELS{(ŷ(x)-ELS{ŷ(x))2} :
Quantify how much ŷ(x) varies from one learning sample
to another.
18
Illustration (1)
• Problem definition:
– One input x, uniform random variable in [0,1]
– y=h(x)+ε where εN(0,1)
y
h(x)=Ey|x{y}
x
19
Illustration (2)
• Low variance, high bias method  underfitting
ELS{ŷ(x)}
20
Illustration (3)
• Low bias, high variance method  overfitting
ELS{ŷ(x)}
21
Illustration (4)
• No noise doesn’t imply no variance (but less variance)
ELS{ŷ(x)}
22
Classification problems (1)
• The mean misclassification error is:
• The best possible model is the Bayes model:
• The “average” model is:
• Unfortunately, there is no such decomposition of
the mean misclassification error into a bias and a
variance terms.
• Nevertheless, we observe the same phenomena
23
Classification problems (2)
One test node
A full decision tree
1
1
LS1
0
0
0
1
1
1
0
0
0
1
0
1
LS2
0
1
24
Classification problems (3)
• Bias  systematic error component (independent of the
learning sample)
• Variance  error due to the variability of the model with
respect to the learning sample randomness
• There are errors due to bias and errors due to variance
One test node
Full decision tree
1
1
0
0
0
1
0
1
25
Content of the presentation
• Bias and variance definitions
• Parameters that influence bias and variance
–
–
–
–
–
Complexity of the model
Complexity of the Bayes model
Noise
Learning sample size
Learning algorithm
• Decision/regression tree variance
• Bias and variance reduction techniques
26
Illustrative problem
• Artificial problem with 10 inputs, all uniform
random variables in [0,1]
• The true function depends only on 5 inputs:
y(x)=10.sin(π.x1.x2)+20.(x3-0.5)2+10.x4+5.x5+ε,
where ε is a N(0,1) random variable
• Experimentations:
– ELS  average over 50 learning sets of size 500
– Ex,y  average over 2000 cases
 Estimate variance and bias (+ residual error)
27
Complexity of the model
E=bias2+var
var
bias2
Complexity
Usually, the bias is a decreasing function of the
complexity, while variance is an increasing
function of the complexity.
28
Complexity of the model – neural networks
• Error, bias, and variance w.r.t. the number of
neurons in the hidden layer
8
7
6
Error
5
Error
4
Bias R
Var R
3
2
1
0
0
2
4
6
8
10
12
Nb hidden perceptrons
29
Complexity of the model – regression trees
• Error, bias, and variance w.r.t. the number of test
nodes
20
18
16
Error
14
12
Error
10
Bias R
8
Var R
6
4
2
0
0
10
20
30
40
50
Nb test nodes
30
Complexity of the model – k-NN
• Error, bias, and variance w.r.t. k, the number of
neighbors
18
16
14
Error
12
Error
Bias R
Var R
10
8
6
4
2
0
0
5
10
15
20
25
30
k
31
Learning problem
• Complexity of the Bayes model:
– At fixed model complexity, bias increases with the complexity of
the Bayes model. However, the effect on variance is difficult to
predict.
• Noise:
– Variance increases with noise and bias is mainly unaffected.
– E.g. with (full) regression trees
70
60
50
Error
Error
40
Noise
Bias R
30
Var R
20
10
0
0
1
2
3
4
5
6
Noise std. dev.
32
Learning sample size (1)
• At fixed model complexity, bias remains constant and
variance decreases with the learning sample size. E.g.
linear regression
10
9
8
Error
7
6
Error
5
Bias R
4
Var R
3
2
1
0
0
500
1000
1500
2000
LS size
33
Learning sample size (2)
• When the complexity of the model is dependant on the
learning sample size, both bias and variance decrease with
the learning sample size. E.g. regression trees
20
18
16
Error
14
12
Error
10
Bias R
8
Var R
6
4
2
0
0
500
1000
1500
2000
LS size
34
Learning algorithms – linear regression
Method
Err2
Bias2+Noise
Variance
Linear regr.
7.0
6.8
0.2
k-NN (k=1)
15.4
5
10.4
k-NN (k=10)
8.5
7.2
1.3
MLP (10)
2.0
1.2
0.8
MLP (10 – 10)
4.6
1.4
3.2
Regr. Tree
10.2
3.5
6.7
• Very few parameters : small variance
• Goal function is not linear : high bias
35
Learning algorithms – k-NN
Method
Err2
Bias2+Noise
Variance
Linear regr.
7.0
6.8
0.2
15.4
8.5
5
7.2
10.4
1.3
MLP (10)
2.0
1.2
0.8
MLP (10 – 10)
4.6
1.4
3.2
Regr. Tree
10.2
3.5
6.7
k-NN (k=1)
k-NN (k=10)
• Small k : high variance and moderate bias
• High k : smaller variance but higher bias
36
Learning algorithms - MLP
Method
Err2
Bias2+Noise
Variance
Linear regr.
7.0
6.8
0.2
k-NN (k=1)
15.4
5
10.4
k-NN (k=10)
8.5
7.2
1.3
MLP (10)
2.0
1.2
0.8
MLP (10 – 10)
4.6
1.4
3.2
Regr. Tree
10.2
3.5
6.7
• Small bias
• Variance increases with the model complexity
37
Learning algorithms – regression trees
Method
Err2
Bias2+Noise
Variance
Linear regr.
7.0
6.8
0.2
k-NN (k=1)
15.4
5
10.4
k-NN (k=10)
8.5
7.2
1.3
MLP (10)
2.0
1.2
0.8
MLP (10 – 10)
4.6
1.4
3.2
Regr. Tree
10.2
3.5
6.7
• Small bias, a (complex enough) tree can
approximate any non linear function
• High variance
38
Content of the presentation
•
•
•
•
Bias and variance definition
Parameters that influence bias and variance
Decision/regression tree variance
Bias and variance reduction techniques
39
Decision/regression tree variance (1)
• DT/RT are among the machine learning method that
present the highest variance. Even a small change of the
learning sample can result in a very different tree.
• Even small trees have a high variance
Method
E
Bias
Variance
k-NN (k=10)
8.5
7.2
1.3
MLP (10 – 10)
4.6
1.4
3.2
RT, no test
25.5
25.4
0.1
RT, 1 test
19.0
17.7
1.3
RT, 3 tests
14.8
11.1
3.7
RT, full (250 tests)
10.2
3.5
6.7
40
Decision/regression tree variance (2)
• Possible sources of variance:
– Discretization of numerical attributes
• The selected threshold has a high variance (see next slide).
– Structure choice
• Sometimes, attribute scores are very close.
– Estimation at leaf nodes
• Because of the recursive partitioning, prediction at leaf nodes
are based on very small samples of objects.
• Consequences:
– sub-optimality in terms of accuracy
– questionable interpretability since the parameters can
not be trusted
41
Decision/regression tree variance (3)
• The discretization thresholds chosen in trees are
very unstable
A1<0,82
<0,61
?
<0,48
0.48 0.61
0.82
A1
• This variance put into question the interpretability
42
Content of the presentation
•
•
•
•
Bias and variance definition
Parameters that influence bias and variance
Decision/regression tree variance
Bias and variance reduction techniques
– Introduction
– Dealing with the bias/variance tradeoff of one
algorithm
– Ensemble methods
43
Bias and variance reduction techniques
• In the context of a given method:
– Adapt the learning algorithm to find the best trade-off
between bias and variance.
– Not a panacea but the least we can do.
– Example: pruning, weight decay.
• Ensemble methods:
– Change the bias/variance trade-off.
– Universal but destroys some features of the initial
method.
– Example: bagging, boosting.
44
Variance reduction: 1 model (1)
• General idea: reduce the ability of the learning
algorithm to fit the LS
– Pruning
• reduces the model complexity explicitly
– Early stopping
• reduces the amount of search
– Regularization
• reduce the size of hypothesis space
• Weight decay with neural networks consists in penalizing high
weight values
45
Variance reduction: 1 model (2)
Optimal fitting
E=bias2+var
var
bias2
Fitting
• Selection of the optimal level of fitting
– a priori (not optimal)
– by cross-validation (less efficient): Bias2  error on the
learning set, E  error on an independent test set
46
Variance reduction: 1 model (3)
• Examples:
– Post-pruning of regression trees
– Early stopping of MLP by cross-validation
Method
E
Bias
Variance
Full regr. Tree (250)
10.2
3.5
6.7
Pr. regr. Tree (45)
9.1
4.3
4.8
Full learned MLP
4.6
1.4
3.2
Early stopped MLP
3.8
1.5
2.3
• As expected, variance decreases but bias increases
47
Ensemble methods
• Combine the predictions of several models built with a
learning algorithm in order to improve with respect to the
use of a single model
• Two important families:
– Averaging techniques
• Grow several models independantly and simply average their
predictions
• Ex: bagging, random forests
• Decrease mainly variance
– Boosting type algorithms
• Grows several models sequentially
• Ex: Adaboost, MART
• Decrease mainly bias
48
Bagging (1)
ELS{Err(x)}=Ey|x{(y-hB(x))2}+ (hB(x)-ELS{ŷ(x)})2+ELS{(ŷ(x)-ELS{ŷ(x))2}
• Idea : the average model ELS{ŷ(x)} has the same bias as
the original method but zero variance
• Bagging (Bootstrap AGGregatING) :
– To compute ELS{ŷ (x)}, we should draw an infinite number of LS
(of size N)
– Since we have only one single LS, we simulate sampling from
nature by bootstrap sampling from the given LS
– Bootstrap sampling = sampling with replacement of N objects from
LS (N is the size of LS)
49
Bagging (2)
LS
LS1
LS2
LST
ŷ1(x)
ŷ2(x)
ŷT(x)
x
In regression:
ŷ(x) = 1/k.(ŷ1(x)+ŷ2(x)+…+ŷT(x))
In classification: ŷ(x) = the majority class in {ŷ1(x),…,ŷT(x)}
50
Bagging (3)
• Usually, bagging reduces very much the variance without
increasing too much the bias.
• Application to regression trees
Method
E
Bias
Variance
3 Test regr. Tree
14.8
11.1
3.7
Bagged (T=25)
11.7
10.7
1.0
Full regr. Tree
10.2
3.5
6.7
Bagged (T=25)
5.3
3.8
1.5
• Strong variance reduction without increasing the bias
(although the model is much more complex than a single
tree)
51
Bagging (4)
y
x
52
Other averaging techniques
• Perturb and Combine paradigm:
– Perturb the data or the learning algorithm to obtain several models
that are good on the learning sample.
– Combine the predictions of these models
• Usually, these methods decrease the variance (because of
averaging) but (slightly) increase the bias (because of the
perturbation)
• Examples:
– Bagging perturbs the learning sample.
– Learn several neural networks with random initial weights
Method
E
Bias
Variance
MLP (10-10)
4.6
1.4
3.2
Average of 10 MLPs
2.0
1.4
0.6
– Random forests.
53
Random forests (1)
• Perturb and combine algorithm specifically designed for
trees
• Combine bagging and random attribute subset selection:
– Build the tree from a bootstrap sample
– Instead of choosing the best split among all attributes, select the
best split among a random subset of k attributes
(= bagging when k is equal to the number of attributes)
• There is a bias/variance tradeoff with k: The smaller k, the
greater the reduction of variance but also the higher the
increase of bias
54
Random forests (2)
• Application to our illustrative problem:
Method
E
Bias
Variance
Full regr. Tree
10.2
3.5
6.7
Bagging (k=10)
5.3
3.8
1.5
Random Forests (k=7)
4.8
3.8
1.0
Random Forests (k=5)
4.9
4.0
0.9
Random Forests (k=3)
5.6
4.7
0.8
• Other advantage: it decreases computing times with respect
to bagging since only a subset of all attributes needs to be
considered when splitting a node.
55
Boosting methods (1)
• The motivation of boosting is to combine the
ouputs of many « weak » models to produce a
powerful ensemble of models.
• Weak model = a model that has a high bias
(strictly, in classification, a model slightly better
than random guessing)
• Differences with previous ensemble methods:
– Models are built sequentially on modified versions of
the data
– The predictions of the models are combined through a
weighted sum/vote
56
Boosting methods (2)
LS
LS1
LS2
x
…
LST
…
ŷ1(x)
In regression:
ŷ2(x)
ŷT(x)
ŷ(x) = b1.ŷ1(x)+ b2.ŷ2(x)+…+ bT.ŷT(x))
In classification: ŷ(x) = the majority class in {ŷ1(x),…,ŷT(x)}
according to the weights {b1,b2,…,bT}
57
Adaboost (1)
• Assume that the learning algorithm accepts weighted
objects
• This is the case of many learning algorithms:
– With trees, simply take into account the weights when counting
objects
– In neural networks, minimize the weighted squared error
• At each step, adaboost increases the weights of cases from
the learning sample misclassified by the last model
• Thus, the algorithm focuses on the difficult cases from the
learning sample
• In the weighted majority vote, adaboost gives higher
influence to the more accurate models
58
Adaboost (2)
• Input: a learning algorithm and a learning sample {(xi,yi):
i=1,…,N}
• Initialize the weights wi=1/N, i=1,…,N
• For t=1 to T
– Build a model ŷt(x) with the learning algorithm using weights wi
– Compute the weighted error:
 wi I ( yi  ŷt ( xi ))
errt = i
 wi
– Compute bt=log((1-errt)/errt))
– Change weights:
i
wi  wi  exp[b m  I ( yi  ŷ t ( xi ))]
59
MART (multiple additive regression trees)
MART is a boosting algorithm for regression
• Input: a learning sample {(xi,yi): i=1,…,N}
• Initialize
– ŷ0(x) = 1/N i yi ; ri=yi, i=1,…,N
• For t=1 to T:
– For i=1 to N, compute the residuals
ri  ri -ŷt-1(xi)
– Build a regression tree from the learning sample {(xi,ri): i=1,…,N}
• Return the model ŷ(x)= ŷ0(x)+ŷ1(x)+…+ŷT(x)
60
Boosting methods
• Adaboost and MART are only two boosting variants. There
are many other boosting type algorithms.
• Boosting decision/regression trees improves their accuracy
often dramatically. However, boosting is more sensible to
noise than averaging techniques (overfitting).
• For boosting to work, the models need not to be perfect on
the learning sample. With trees, there are two possible
strategies:
– Use pruned trees (pre-pruned or post-pruned by cross-validation)
– Limit the number of tree tests (and split first the most impure
nodes)
•  there is again a bias/variance tradeoff with respect to the
tree size.
61
Experiment with MART
• On our illustrative problem:
Method
E
Bias
Variance
Full regr. Tree
10.2
3.5
6.7
Regr. Tree with 1 test
18.9
17.8
1.1
+ MART (T=50)
5.0
3.1
1.9
+ Bagging (T=50)
17.9
17.3
0.6
Regr. Tree with 5 tests
11.7
8.8
2.9
+ MART (T=50)
6.4
1.7
4.7
+ Bagging (T=50)
9.1
8.7
0.4
• Boosting reduces the bias but increases the variance.
However, with respect to full trees, it decreases both bias
and variance.
62
Interpretability and efficiency of ensembles
• Since we average several models, we loose
interpretability and efficiency which are two of the
main advantages of decision/regression trees
• However,
– We still can use the ensembles to compute variable
importance by averaging over all trees. Actually, this
even stabilizes the estimates.
– Averaging techniques can be parallelized and boosting
type algorithm uses smaller trees. So, the increase of
computing times is not so detrimental.
63
Experiments on Golub’s microarray data
• 72 objects, 7129 numerical attributes (gene expressions), 2 classes
(ALL and AL)
• Leave-one-out error with several variants
Method
Error
1 decision tree
22.2% (16/72)
Random forests (k=85,T=500)
9.7% (7/72)
Extra-trees (sth=0.5, T=500)
5.5% (4/72)
Adaboost (1 test node, T=500)
1.4% (1/72)
105
101
97
93
89
85
81
77
73
69
65
61
57
53
49
45
41
37
33
29
25
21
17
13
9
5
100
90
80
70
60
50
40
30
20
10
0
1
Importance
• Variable importance with boosting
Variables
64
Conclusion (1)
• The notions of bias and variance are very useful to predict
how changing the (learning and problem) parameters will
affect the accuracy. E.g. this explains why very simple
methods can work much better than more complex ones on
very difficult tasks
• Variance reduction is a very important topic:
– To reduce bias is easy, but to keep variance low is not as easy.
– Especially in the context of new applications of machine learning
to very complex domains: temporal data, biological data, Bayesian
networks learning, text mining…
• All learning algorithms are not equal in terms of variance.
Trees are among the worst methods from this criterion
65
Conclusion (2)
• Ensemble methods are very effective techniques to reduce
bias and/or variance. They can transform a not so good
method to a competitive method in terms of accuracy.
• Adaboost with trees is considered as one of the best « offthe-shelve » classification method.
• Interpretability of the model and efficiency of the method
are difficult to preserve if we want to reduce variance
significantly.
• There are other ways to tackle the variance/overfitting
problem, e.g.:
– Bayesian approaches (related to averaging techniques)
– Support vector machines (they maintain a low variance by
maximizing the classifiction margin)
66
References
•
•
About bias and variance:
– Neural networks and the bias/variance dilemma, S.Geman et al., Neural
computation 4, 1(1992), 1-58
– Neural networks for statistical pattern recognition, C.M.Bishop, Oxfored
University Press, 1994
– The elements of statistical learning, T.Hastie et al., Springer, 2001
– Contribution to decision tree induction: bias/variance tradeoff and time
series classification, P.Geurts, Phd Thesis, 2002
About ensemble methods:
– Bagging predictors, L.Breiman, Machine learning, 24, 1996
– A decision theoretic generalization of on-line learning and an application
to boosting, Y.Freund and R.Schapire, Journal of Computer and Science
Systems, 1995
– Random Forests, L.Breiman, Machine learning, 45, 2001
– Ensemble methods in machine learning, T.Dietterich, First international
workshop on multiple classifier systems, 2000
– An introduction to boosting and leveraging, R.Meir and G.Ratsch,
Advanced Lectures on Machine Learning, Springer, 2003
67
Softwares
• Random forests
– http://stat-www.berkeley.edu/users/breiman/rf.html
– R package randomForest
• Boosting:
– See www.boosting.org
68