Transcript Document

Recent developments in tree induction for KDD « Towards soft tree induction »

Louis WEHENKEL University of Liège – Belgium Department of Electrical and Computer Engineering 1

A. Supervised learning

(notation) 

x

= (

x

1 ,…,

x

m ) vector of input variables (numerical and/or symbolic) 

y

single output variable  Symbolic : classification problem  Numeric : regression problem 

LS

= ( (

x

1

,y

1 ),…,(

x

N

,y

N ) ), sample of I/O pairs  Learning (or modeling) algorithm  Mapping from sample sp.

to hypothesis sp.

H

 

Say : y

=

f(x) + e

« Guess » ‘

f LS ’

, where ‘

e’

= modeling error in H so as to minimize ‘

e’

2

Statistical viewpoint

x

and

y

are random variables distributed according to

p(x,y)

LS

is distributed according to

p

N

(x,y)

  

f LS

is a random function (selected in

e(x) = y – f LS (x) H

) is also a random variable Given a ‘metric’ to measure the error we can define the best possible model (Bayes model)  Regression :

f B (x) = E(y|x)

 Classification :

f B (x) = argmax y P(y|x)

3

B. Crisp decision trees

(what is it ?) Yes X 1 <0.6

No X 2 <1.5

Yes Y is very big Y No is small Y is big 4

B. Crisp decision trees

(what is it ?) X 1 =0.6

5

Tree induction

(Overview)  Growing the tree (uses GS, a part of LS)  Top down (until all nodes are closed)  At each step  Select open node to split (best first, greedy approach)  Find best input variable and best question  If node can be purified split , otherwise close the node  Pruning the tree (uses PS, rest of LS)  Bottom up (until all nodes are contracted)  At each step  Select test node to contract (worst first, greedy…)  Contract and evaluate 6

Tree Growing

 Demo : Titanic database  Comments  Tree growing is a local process  Very efficient  Can select relevant input variables  Cannot determine appropriate tree shape  (Just like real trees…) 7

Tree Pruning

 Strategy  To determine appropriate tree shape let tree grow too big (allong all branches), and then reshape it by pruning away irrelevant parts  Tree pruning uses global criterion to determine appropriate shape  Tree pruning is even faster than growing  Tree pruning avoids overfitting the data 8

Growing – Pruning

(graphically) Error ( GS / PS) Underfitting Overfitting Growing Final tree Pruning Tree complexity 9

C. Soft trees

(what is it ?)  Generalization of crisp trees using continuous splits and aggregation of terminal node predictions  (

x

)  1   ,  (

x

)

y

1 (

x

) 

y

1    ,  (

x

)  ˆ 2 0 2   1    ,  (

x

)  

x

10

Soft trees

(discussion)  Each split is defined by two parameters  Position  , and width  of transition region  Generalize decision/regression trees into a continuous and differentiable model w.r.t. the model parameters   Test nodes : 

j

, 

j

Terminal nodes : 

i

 Other names (of similar models)  Fuzzy trees, continuous trees  Tree structured (neural, bayesian) networks  Hierarchical models 11

Soft trees

(Motivations)  Improve performance (w.r.t. crisp trees)  Use of a larger hypothesis space  Reduced variance and bias  Improved optimization (à la backprop)  Improve interpretability  More « honest » model  Reduced parameter variance  Reduced complexity 12

D. Plan of the presentation

 Bias/Variance tradeoff (in tree induction)  Main techniques to reduce variance  Why soft trees have lower variance  Techniques for learning soft trees 13

Concept of variance

 Learning sample is random  Learned model is function of the sample  Model is also random : variance  Model predictions have variance  Model structure / parameters have variance   Variance reduces accuracy and interpretability Variance can be reduced by various ‘averaging or smoothing’ techniques 14

Theoretical explanation

 Bias, variance and residual error  Residual error  Difference between output variable and the best possible model (i.e. error of the Bayes model)  Bias  Difference between the best possible model and the average model produced by algorithm  Variance  Average variability of model around average model  Expected error 2 : res 2 +bias 2 +var  NB: these notions depend on the ‘metric’ used for measuring error 15

Regression

(locally, at point

x

)

p

(

y|x

)

y

Find

y’=f(x)

such that

E y|x

{err(

y,y

’)} is minimum, where err is an error measure.

Usually, err = squared error = (

y- y

’) 2  

f(x)=E y|x

{

y

} minimizes the error at every point Bayes model is the conditional expectation

x

16

Learning algorithm

(1) Usually,

p

(

y|x

) is unknown  Use

LS

= ( (

x

1

,y

1 ),…,(

x

N

,y

N ) ), and a learning algorithm to choose hypothesis in H  

ŷ LS (x)=f(LS,x)

At each input point

x

, the prediction

ŷ LS (x)

random variable is a  Distribution of

ŷ LS (x) N

depends on sample size and on the learning algorithm used 17

Learning algorithm

(2)

p LS

(

ŷ(x)

)

ŷ

Since

LS

is randomly drawn, estimation

ŷ(x)

is a random variable 18

Good learning algorithm

 A good learning algorithm should minimize the average (generalization) error over all learning sets  In regression, the usual error is the mean squared error. So we want to minimize (at each point

x

)

Err(x)=E

LS

{

E

y|x

{(

y ŷ

LS

(x)

) 2 }}  There exists a useful additive decomposition of this error into three (positive) terms 19

Bias/variance decomposition

(1) var

y|x

{

y

}

E y|x

{

y

}

y Err(x)

=

E y|x

{(y- E

y|x

{y}) 2 }

+ …

E y|x

{

y

} = arg min

y

E y|x

{(

y y’

) 2 }} =Bayes model var

y|x

{

y

} = residual error = minimal error 20

Bias/variance decomposition

(2) bias 2 (

x

)

E y|x

{

y

}

E LS

{

ŷ(x)

}

ŷ Err(x)

= var

y|x

{

y

} +

(E

y|x

{y}-E

LS

{ ŷ(x)}) 2

E LS

{

ŷ

(

x

)} = average model (w.r.t.

LS

) + … bias 2 (

x

) = error between Bayes and average model 21

Bias/variance decomposition

(3) var

LS

{

ŷ

}

E LS

{

ŷ

}

ŷ Err

(

x

)= var

y|x

{

y

} + bias 2 (

x

) +

E LS

{(

ŷ(x)-E LS

{ ŷ(x)}) 2 }

var

LS

{

ŷ

(

x

)} = variance 22

Bias/variance decomposition

(4) var

y|x

{

y

} bias 2 (

x

) var

LS

{

ŷ(x)

}

E y|x

{

y

}

E LS

{

ŷ(x)

} Local error decomposition

Err(x)

= var

y|x

{

y

} + bias 2 (

x

) + var

LS

{

ŷ

(

x

)}

ŷ

Global error decomposition (take average w.r.t.

p(x)

)

E

X

{

Err(x)

} =

E

X

{var

y|x

{

y

}} +

E

X

{bias 2

(x)

} +

E

X

{var

LS

{

ŷ(x)

}} 23

Illustration

(1)  Problem definition:   One input

x,

uniform random variable in [0,1]

y

=

h

(

x

)+ ε where ε  N(0,1)

h

(

x

)=

E

y|x {

y

}

x

24

Illustration

(2)  Small variance, high bias method

E LS

{

ŷ

(

x

)} 25

Illustration

(3)  Small bias, high variance method

E LS

{

ŷ

(

x

)} 26

Illustration

(Methods comparison)  Artificial problem with 10 inputs, all uniform random variables in [0,1]  The true function depends only on 5 inputs:

y

(

x

)=10.sin(

π

.

x

1 .

x

2 )+20.(

x

3 -0.5) 2 +10.

x

4 +5.

x

5 + ε, where ε is a N(0,1) random variable  Experimentation:  

E LS E x,y

  average over 50 learning sets of size 500 average over 2000 cases  Estimate variance and bias (+ residual error) 27

Illustration

(Linear regression) Method Linear regr.

k-NN (k=1) k-NN (k=10) MLP (10) MLP (10 – 10) Regr. Tree Err 2 7.0

15.4

8.5

2.0

4.6

10.2

Bias 2 +Noise 6.8

5 7.2

1.2

1.4

3.5

 Very few parameters : small variance  Goal function is not linear : high bias Variance 0.2

10.4

1.3

0.8

3.2

6.7

28

Illustration

(k-Nearest Neighbors) Method Linear regr.

k-NN (k=1) k-NN (k=10) MLP (10) MLP (10 – 10) Regr. Tree Err 2 7.0

15.4

8.5

2.0

4.6

10.2

Bias 2 +Noise 6.8

5 7.2

1.2

1.4

3.5

Variance 0.2

10.4

1.3

0.8

3.2

6.7

 Small k : high variance and moderate bias  High k : smaller variance but higher bias 29

Illustration

(Multilayer Perceptrons) Method Linear regr.

k-NN (k=1) k-NN (k=10) MLP (10) MLP (10 – 10) Regr. Tree Err 2 7.0

15.4

8.5

2.0

4.6

10.2

Bias 2 +Noise 6.8

5 7.2

1.2

1.4

3.5

Variance 0.2

10.4

1.3

0.8

3.2

6.7

 Small bias  Variance increases with the model complexity 30

Illustration

(Regression trees) Method Linear regr.

k-NN (k=1) k-NN (k=10) MLP (10) MLP (10 – 10) Regr. Tree Err 2 7.0

15.4

8.5

2.0

4.6

10.2

Bias 2 +Noise 6.8

5 7.2

1.2

1.4

3.5

Variance 0.2

10.4

1.3

0.8

3.2

6.7

 Small bias, a (complex enough) tree can approximate any non linear function  High variance (see later) 31

Variance reduction techniques

 In the context of a given method:  Adapt the learning algorithm to find the best trade off between bias and variance.

 Not a panacea but the least we can do.

 Example: pruning, weight decay.

 Wrapper techniques:  Change the bias/variance trade-off.

 Universal but destroys some features of the initial method.

 Example: bagging.

32

Variance reduction: 1 model

(1)  General idea: reduce the ability of the learning algorithm to over-fit the

LS

 Pruning  reduces the model complexity explicitly  Early stopping  reduces the amount of search  Regularization  reduce the size of hypothesis space 33

Variance reduction: 1 model

(2) Optimal fitting

E=

bias 2 +var var bias 2 Fitting  Bias 2  error on the learning set,

E

an independent test set  error on  Selection of the optimal level of tuning  a priori (not optimal)  by cross-validation (less efficient) 34

Variance reduction: 1 model

(3 )  Examples:  Post-pruning of regression trees  Early stopping of MLP by cross-validation Method Full regr. Tree (488) Pr. regr. Tree (93) Full learned MLP Early stopped MLP E 10.2

9.1

4.6

3.8

Bias 3.5

4.3

1.4

1.5

Variance 6.7

4.8

3.2

2.3

 As expected, reduces variance and increases bias 35

Variance reduction: bagging

(1) 

Idea :

the average model variance

E LS

{

ŷ

(

x

)} has the same bias as the original method but zero 

Bagging

 (

B

ootstrap

AGG

regat

ING

) : To compute

E LS

{

ŷ

(

x

)}, we should draw an infinite number of

LS

(of size N)  Since we have only one single

LS

, we simulate sampling from nature by bootstrap sampling from the given

LS

 Bootstrap sampling = sampling with replacement of

N

objects from

LS

(

N

is the size of

LS

) 36

Variance reduction: bagging

(2)

LS LS

1

LS

2

LS k x ŷ

1 (

x

)

ŷ

2 (

x

)

ŷ

(

x

) = 1/k.(

ŷ

1 (

x

)+

ŷ

2 (

x

)+…+

ŷ k

(

x

))

ŷ k

(

x

) 37

Variance reduction: bagging

(3)  Application to regression trees Method 3 Test regr. Tree Bagged Full regr. Tree Bagged E 14.8

11.7

10.2

5.3

Bias 11.1

10.7

3.5

3.8

Variance 3.7

1.0

6.7

1.5

 Strong variance reduction without increasing bias (although the model is much more complex than a single tree) 38

Dual bagging

(1)   Instead of perturbing learning sets to obtain several predictions, directly perturb the test case at the prediction stage Given a model

ŷ

(.) and a test case

x

:    Form

k

attribute vectors by adding Gaussian noise to

x

: {

x

+ ε 1 ,

x

+ ε 2 , …,

x

+ ε

k

}.

Average the predictions of the model at these points to get the prediction at point x: 1/k

.

(

ŷ

(

x

+ ε 1 )+

ŷ

(

x

+ ε 2 )+…+

ŷ

(

x

+ ε

k

) Noise level λ (variance of Gaussian noise) selected by cross-validation 39

Dual bagging

(2)  With regression trees: Noise level 0.0

0.2

0.5

2.0

E 10.2

6.3

5.3

13.3

Bias 3.5

3.5

4.4

13.1

Variance 6.7

2.8

0.9

0.2

 Smooth the function

ŷ

(.).

 Too much noise increases bias  there is a (new) trade-off between bias and variance 40

Dual bagging

(classification trees) λ = 1.5  error =4.6 % λ = 0  error =3.7 % λ = 0.3  error =1.4 % 41

Variance in tree induction

    Tree induction is among the ML methods of highest variance (together with 1-NN) Main reason  Generalization is local  Depends on small parts of the learning set Sources of variance:   Discretization of numerical attributes (60 %)  The selected thresholds have a high variance Structure choice (10 %)  Sometimes, attribute scores are very close  Estimation at leaf nodes (30 %)  Because of the recursive partitioning, prediction at leaf nodes is based on very small samples of objects Consequences:  Questionable interpretability and higher error rates 42

Threshold variance

(1)  Test on numerical attributes : [

a

(

o

)<

a th

]  Discretization: find

a th

 which minimizes score Classification: maximize information  Regression: minimize residual variance Score

a

(

o

)

a th

43

Threshold variance

(2) N=100 Score/Nb of cases 0.4

0.3

0.2

0.1

0.0

800.

900.

1000.

--- Threshold -- Empirical optimal threshold distribution Score curves 1100.

1200.

44

Threshold variance

(3) N=1000 Score/Nb of cases 0.25

0.2

0.15

0.1

5.e-2 0.0

800.

900.

1000.

--- Threshold -- Empirical optimal threshold distribution Score curves 1100.

1200.

45

Tree variance

 DT/RT are among the machine learning methods which present the highest variance Method RT, no test RT, 1 test RT, 3 tests RT, full (250 tests) E 25.5

19.0 14.8

10.2

Bias 25.4

17.7

11.1

3.5

Variance 0.1

1.3

3.7

6.7

46

DT variance reduction

     Pruning:   Necessary to select the right complexity Decreases variance but increases bias : small effect on accuracy Threshold stabilization:  Smoothing of score curves, bootstrap sampling…  Reduces parameter variance but has only a slight effect on accuracy and prediction variance Bagging:   Very efficient at reducing variance But jeopardizes interpretability of trees and computational efficiency Dual bagging:   In terms of variance reduction, similar to bagging Much faster and can be simulated by soft trees Fuzzy tree induction  Build soft trees in a full fledged approach 47

Dual tree bagging = Soft trees

 Reformulation of dual bagging as an explicit soft tree propagation algorithm  Algorithms  Forward-backward propagation in soft trees  Softening of thresholds during learning stage  Some results 48

Dual bagging = soft thresholds

 

x

+ ε<

x th

 sometimes left, sometimes right Multiple ‘crisp’ propagations can be ‘replaced’ by one ‘soft’ propagation  E.g. if ε has uniform pdf in [

a th

l /2 ,

a th

+ l /2 ] then probability of right propagation is as follows l

TS

left

TS

right

a th

49

Forward-backward algorithm

Top-down propagation of probability Root

P

(Root|

x

)=1

P

(

N

1 |

x

) =

P

(Test1|

x

)

P

(Root|

x

)

N

1 Test1

L

3

P

(

L

3 |

x

) =

P

(  Test1|

x

)

P

(Root|

x

)

L

1

P

(

L

1 |

x

) =

P

(Test2|

x

)

P

(

N

1 |

x

) Test2

L

2

P

(

L

2 |

x

) =

P

(  Test2|

x

)

P

(

N

1 |

x

) Bottom-up aggregation of predictions 50

Learning of

l

values

 Use of an independent ‘validation’ set and bisection search  One single value can be learned very efficiently (amounts to 10 full tests of a DT/RT on the validation set)  Combination of several values can also be learned with the risk of overfitting  (see fuzzy tree induction, in what follows) 51

Some results with dual bagging

Pe %

25 20 DT DT + Dual Bagg Bagging Bagging + Dual 15 10 0 5 O m ib W a ve fo rm 2-n orm VST D ig 4 4 Le tte r Se g m en t Sa te llit e Pe n dig its 52

Fuzzy tree induction

 General ideas  Learning Algorithm  Growing  Refitting  Pruning  Backfitting 53

General Ideas

 Obviously, soft trees have much lower variance than crisp trees  In the « Dual Bagging » approach, attribute selection is carried out in a cloassical way, then tests are softened in a post-processing stage  Might be more effective to combine the two methods  Fuzzy tree induction 54

Soft trees

 Samples are handled as fuzzy subsets  Each observation belongs to such a FS with a certain membership degree  SCORE measure is modified  Objects are weighted by their membership degree  Output

y

 Denotes the membership degree to a class  Goal of Fuzzy tree induction  Provide a smooth model of ‘

y’

as a function of the input variables 55

Fuzzy discretization

 Same as fuzzification  Carried out locally, at the tree growing stage  At each test node  On the basis of local fuzzy sub-training set   Select attribute, together with discriminator so as to maximize local SCORE Split in soft way and proceed recursively  Criteria for SCORE  Minimal residual variance  Maximal (fuzzy) information quantity  Etc 56

Attaching labels to leaves

 Basically, for each terminal node, we need to determine a local estimate

ŷ

i of y  During intermediate steps  Use average of ‘

y’

in local sub-learning set  Direct computation  Refitting of the labels  Once the full tree has been grown and at each step of pruning   Determine all values simultaneously To minimize Square Error  Amounts to a linear least squares problem  Direct solution 57

Refitting

(Explanation)  A leaf corresponds to a basis function m i (

x

)  Product of discriminators encountered on the path from root  Tree prediction is equivalent to a weighted average of these basis functions  

y

(x) =

ŷ

1 * m 1 (

x

) +

ŷ

2 * m 2 (

x

) + … +

ŷ

k * m k (

x

) the weights

ŷ

i terminal nodes are the labels attached to the  Refitting amounts to tune the

ŷ

i parameters to minimize square error on training set 58

Tree growing and pruning

 Grow tree  Refit leaf labels  Prune tree, while refitting at each stage leaf labels  Test sequence of pruned trees on validation set  Select best pruning level 59

Backfitting

(1)  After growing and pruning, the fuzzy tree structure has been determined  Leaf labels are globally optimal, but not the parameters of the discriminators (tuned locally)  Resulting model has 2 parameters per test node, and 1 parameter per terminal node  The output (and hence Mean square error) of the fuzzy tree is a smooth function of these parameters  The parameters can be optimized, by using a standard LSE technique, e.g. Levenberg-Marquardt 60

Backfitting

(2)  How to compute the derivatives needed by nonlinear optimization technique  Use a modified version of backpropagation to compute derivates with respect to parameters  Yields an efficient algorithm (linear in the size of the tree)  Backfitting starts from tree produced after growing and pruning  Already a good approximation of a local optimum   Only a small number of iterations are necessary to backfit Backfitting may also lead to overfitting… 61

Summary and conclusions

 Variance is the problem number one decision/regression tree induction in  It is possible to reduce variance significantly   Bagging and/or tree softening Soft trees have the advantage of preserving interpretability and computational efficiency  Two approaches have been presented to get soft trees   Dual bagging    Generic approach Fast and simple Best approach for very large databases Fuzzy tree induction  Similar to ANN type of model, but (more) interpretable  Best approach for small learning sets (probably) 62

Some references for further reading

 Variance evaluation/reduction, bagging   Contact : Pierre GEURTS (PhD student) [email protected]

Papers     Discretization of continuous attributes for supervised learning Variance evaluation and variance reduction. (Invited) L. Wehenkel.

Proc. of IFSA'97, International Fuzzy Systems Association World Congress

, Prague, June 1997, pp. 381--388.

Investigation and Reduction of Discretization Variance in Decision Tree Induction. Pierre GEURTS and Louis WEHENKEL,

Proc. of ECML’2000

Some Enhancements of Decision Tree Bagging. Pierre GEURTS,

Proc. of PKDD’2000

Dual Perturb and Combine Algorithm. Pierre GEURTS,

Proc. of AI and Statistics 2001

.

63

See also www.montefiore.ulg.ac.be/services/stochastic/  Fuzzy/soft tree induction   Contact : Cristina OLARU (PhD student) [email protected]

Papers  Automatic induction of fuzzy decision trees and its application to power system security assessment.  X. Boyen, L. Wehenkel,

Int. Journal on Fuzzy Sets and Systems

, Vol. 102, No 1, pp. 3-19, 1999.

On neurofuzzy and fuzzy decision trees approaches. C. Olaru, L. Wehenkel. (Invited)

Proc. of IPMU'98, 7th Int. Congr. on Information Processing and Management of Uncertainty in Knowledge based Systems

, 1998.

64