Tree-Based Methods (V&R 9.1)

Transcript Tree-Based Methods (V&R 9.1)

STAT 6601 Project
Tree-Based Methods
(V&R 9.1)
Demeke Kasaw, Andreas Nguyen, Mariana Alvaro
Overview of Tree-based Methods






What are they?
How do they work?
Examples…
Tree pictorials common.
Simple way to depict
relationships in data
Tree-based methods use
this pictorial to represent
relationships between
random variables.
Trees can be used for both
Classification and Regression
Presence of Surgery Complications
vs. Patient Age and Treatment Start Date
Last Eruption < 3.0 min
|
Start >= 8.5 months
Time to Next Eruption
vs. Length of Last Eruption
|
Start < 8.5
Present
Start >= 14.5
Start < 14.5
Absent
54.49
Last Eruption < 4 .1 min
Age < 12 yrs Age >= 12 yrs
Absent
Sex = F
Sex = M
Absent
Present
76.83
81.18
General Computation Issues and
Unique Solutions


Over-Fitting: When do we stop splitting? Stop generating
new nodes when subsequent splits only result in little
improvement.
Evaluate the quality of the prediction: Prune the tree to
ideally select the simplest most accurate solution.
Methods:
– Crossvalidation: Apply the tree computed from one set of
observations (learning sample) to another completely independent
set of observations (testing sample).
– V-fold crossvalidation: Repeat the analysis with different randomly
drawn samples from the data. Use the tree that shows the best
average accuracy for cross-validated predicted classifications or
predicted values.
Computational Details

Specify the criteria for predictive accuracy
–
–

Selecting Splits
–

Define a measure of impurity for a node. A node is “pure” if they
contain observations of a single class.
Determine when to stop splitting
–
–

Minimum costs: Lowest misclassification rate
Case weights
All nodes are pure or contain no more than a n cases
Until all nodes contain no more than a specified Fraction of Objects
Selecting the “right-size” tree
–
–
–
Test sample cross validation
V Fold cross validation
Tree selection after pruning: if there are several trees with costs close
to minimum, select the smallest-sized (least complex)
Computational Formulas

Estimation of Accuracy in Classification Trees
–
Resubstitution estimate




d(x) is the classifier
X=1 if X(d(xn) = jn) is true
X =0 if X(d(xn) = jn) is false
1 N
R(d )   X (d ( xn )  jn )
N i 1
Estimation of Accuracy in Regression Trees
–
Resubstitution estimate
N
1
R(d)   (y i  d(x i ))2
N i1
Computational Formulas
Estimation of Node Impurity

Gini Index
–
–
Reaches zero when only one class is present at a node
P(j/t): probability of category j at node t
g(t)  

Entropy or Information
 pik * log pik
ji
p( j /t) p(i /t)
Classification Tree Example:
What species are these flowers?
Petal Length
Petal Width
Setosa
tree
Sepal Length
Sepal Width
Versicolor
Virginica
Iris Classification Data


Iris dataset relates species to petal and sepal dimensions
reported in centimeters. Originally used by R.A. Fisher and
E. Anderson for a discriminant analysis example.
Data is pre-packaged in R dataset library and is available on
DASYL.
Sepal.Length
6.7
5.8
7.3
5.2
4.4
Sepal.Width
3.0
2.7
2.9
4.1
3.2
Petal.Length
5.0
3.9
6.3
1.5
1.3
Petal.Width
1.7
1.2
1.8
0.1
0.2
Species
versicolor
versicolor
virginica
setosa
setosa
Iris Classification
Method and Code
library(rpart)
data(iris)
# Load tree fitting package
# Load iris data
# Let x = tree object fitting Species vs. all other
# variables in iris with 10-fold cross validation
x = rpart(Species~.,iris,xval=10)
# Plot tree diagram with uniform spacing,
# diagonal branches, a 10% margin, and a title
plot(x, uniform=T, branch=0, margin=0.1,
main="Classification Tree\nIris Species by Petal and Sepal
Length")
# Add labels to tree with final counts,
# fancy shapes, and blue text color
text(x,use.n=T,fancy=T,col="blue")
Results:
Classification Tree
Iris Species by Petal and Sepal Length
Petal.Length < 2 .45
Petal.Length >= 2 .45
setosa
50/0/0
Petal.Width < 1 .75
Petal.Width >= 1 .75
versicolor
0/49/5
virginica
0/1/45
Identify this flower…
Tree-based
approach
Sepal
Length 6
Sepal
Widththe
3.4 alternative
than
Petal Length 4.5
Petal Width 1.6
Classification with Cross-validation
True Group
Setosa -85+24*6+24*3.4-16*4.5-17*1.6=41
Put into Group setosa versicolor virginica
setosa
50
0
0
Versicolor
-72+16*6+7*3.4+5*4.5+6*1.6=80
versicolor
0
48
1
virginica
0
2
49
Total
N
50
50
50
Virginica
-103+12*6+4*3.4+13*4.5+21*1.6=75
N correct
50
48
49
Proportion
1.000
0.960
0.980
Since Versicolor has
highest score,
N = 150
N Correct = 147
we classify this flower as an Iris versicolor.
Linear Discriminant Function for Groups
setosa versicolor virginica
Constant
-85.21
-71.75
-103.27
Sepal.Length
23.54
15.70
12.45
Sepal.Width
23.59
7.07
3.69
Petal.Length -16.43
5.21
12.77
Petal.Width
-17.40
6.43
21.08
much simpler
Classification Tree
Iris Species by Petal and Sepal Length
Petal
Length
< 2 .45
Petal
Length
>= 2 .45
setosa
50/0/0
Petal
Petal
Width Width
< 1 .75 >= 1 .75
versicolor
0/49/5
virginica
0/1/45
Regression Tree Example

Software used : R, rpart package

Goal:
–
Applying the regression tree method on CPU data,
and predicting the response variable, ‘performance’.
CPU Data

CPU performance of 209 different processors.
name syct
1
ADVISOR 32/60
125
2
AMDAHL 470V/7
3
mmin
6000
256
16
128
198
29
8000 32000
32
8
32
269
AMDAHL 470/7A
29
8000 32000
32
8
32
220
4
AMDAHL 470V/7B
29
8000 32000
32
8
32
172
5
AMDAHL 470V/7C
29
8000 16000
32
8
16
132
6
AMDAHL 470V/8
26
8000 32000
64
8
32
318
...
256
mmax cach chmin chmax perf
R Code
library(MASS); library(rpart); data(cpus); attach(cpus)
# Fit regression tree to data
cpus.rp <-rpart(log(perf)~.,cpus[,2:8],cp=0.001)
# Print and plot complexity Parameter (cp) table
printcp(cpus.rp); plotcp(cpus.rp)
# Prune and display tree
cpus.rp<-prune(cpus.rp,cp=0.0055)
plot(cpus.rp,uniform=T,main="Regression Tree")
text(cpus.rp,digits=3)
# Plot residual vs. predicted
plot(predict(cpus.rp),resid(cpus.rp)); abline(h=0)
Determine the Best Complexity
Parameter (cp) Value for the Model
size of tree
11
14
17
0.4
0.6
0.8
1.0
1.2
1 3 5 7
0.2
CrossValidated
Error SD
xstd
0.096838
0.048229
0.046758
0.032876
0.031560
0.030180
0.028031
0.027608
0.028788
0.026970
0.026298
0.027173
0.027023
0.027062
0.027246
0.026926
X-val Relative Error
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Complexity
Parameter # Splits
1 – R2
CP nsplit rel error
0.5492697
0
1.00000
0.0893390
1
0.45073
0.0876332
2
0.36139
0.0328159
3
0.27376
0.0269220
4
0.24094
0.0185561
5
0.21402
0.0167992
6
0.19546
0.0157908
7
0.17866
0.0094604
9
0.14708
0.0054766
10
0.13762
0.0052307
11
0.13215
0.0043985
12
0.12692
0.0022883
13
0.12252
0.0022704
14
0.12023
0.0014131
15
0.11796
0.0010000
16
0.11655
CrossValidated
Error
xerror
1.00864
0.47473
0.46518
0.33734
0.32043
0.30858
0.28526
0.27781
0.27231
0.25849
0.24654
0.24298
0.24396
0.24256
0.24351
0.24040
Inf
0.03
0.0072
cp
0.0012
Regression Tree
Regression Tree
After Pruning
Regression Tree
Before Pruning
cach< 27
|
cach< 27
|
mmax< 6100
mmax< 2.8e+04
mmax< 2.8e+04
mmax< 6100
syct>=360
mmax< 1750
mmax< 2500
2.51
cach< 96.5 cach< 56
chmin< 5.5
mmax< 1.124e+04
2.95
chmax< 4.5
cach< 0.5
chmin>=1.5
3.52
chmin< 5.5
2.51 3.29
2.95
4.695.14
mmax< 1.4e+04
3.263.54
cach< 96.5 cach< 56
chmax< 14
4.554.21
3.12
syct>=360
5.355.226.14
3.05
syct< 110
mmax< 1750
3.89
4.044.31
5.22
mmax< 1.1e+04
5.35
cach< 0.5
4.55
4.21 4.92
3.52 4.03
6.14
How well does it fit?
-0.5
0.0
0.5
1.0
Plot of residuals
resid(cpus.rp)

3
4
predict(cpus.rp)
5
6
Summary


Advantages of C & RT
Simplicity of results:
–
–
–

The interpretation of results summarized in a tree is very
simple.
This simplicity is useful for purposes of rapid classification
of new observations
It is much easier to evaluate just one or two logical
conditions.
Tree methods are nonparametric and nonlinear
–
There is no implicit assumption that the underlying
relationships between the predictor variables and the
dependent variable are linear, follow some specific nonlinear link function
References



Venables, Ripley (2002), Modern Applied Statistics with S,251-266.
StatSoft (2003) “Classification and Regression Trees”, Electronic
Textbook, StatSoft, 2003, retrieved on 11/8/2004 from
http://www.statsoft.com/textbook/stcart.html
Fisher, R. A. (1936) “The use of multiple measurements in taxonomic
problems”. Annals of Eugenics, 7, Part II, 179-188.
Using Trees in R (the 30 second version)
Using Trees in R (the 30 second version)
1)Load the rpart library
library(rpart)
2)For classification trees, make sure the response is of the type factor. If you don’t
know how to do this lookup help(as.factor)or consult a general R reference.
y=as.factor(y)
3)Fit the tree model
f=rpart(y~x1+x2+…,data=…,cp=0.001)
If using an unattached dataframe, you must specify data.
If using global variables, then data= can be omitted.
A good starting point for cp, which controls the complexity of the tree, is given.
4)Plot and check the model
plot(f,uniform=T,margin=0.1); text(f,use.n=T)
plotcp(f); printcp(f)
Look at the xerrors in the summary and choose the smallest number of splits that
achieve the smallest xerror. Consider the tradeoff between model fit and
complexity (ie overfitting). Based on your judgement, repeat step 3 with the cp
value of your choice.
5)Predict results
predict(f,newdata,type=“class”)
where newdata is a dataframe with the independent variables.
1)Load the rpart library
library(rpart)
2)For classification trees, make sure the response is of the type factor. If you don’t
know how to do this lookup help(as.factor)or consult a general R reference.
y=as.factor(y)
3)Fit the tree model
f=rpart(y~x1+x2+…,data=…,cp=0.001)
If using an unattached dataframe, you must specify data.
If using global variables, then data= can be omitted.
A good starting point for cp, which controls the complexity of the tree, is given.
4)Plot and check the model
plot(f,uniform=T,margin=0.1); text(f,use.n=T)
plotcp(f); printcp(f)
Look at the xerrors in the summary and choose the smallest number of splits that
achieve the smallest xerror. Consider the tradeoff between model fit and
complexity (ie overfitting). Based on your judgement, repeat step 3 with the cp
value of your choice.
5)Predict results
predict(f,newdata,type=“class”)
where newdata is a dataframe with the independent variables.
Using Trees in R (the 30 second version)
Using Trees in R (the 30 second version)
1)Load the rpart library
library(rpart)
2)For classification trees, make sure the response is of the type factor. If you don’t
know how to do this lookup help(as.factor)or consult a general R reference.
y=as.factor(y)
3)Fit the tree model
f=rpart(y~x1+x2+…,data=…,cp=0.001)
If using an unattached dataframe, you must specify data.
If using global variables, then data= can be omitted.
A good starting point for cp, which controls the complexity of the tree, is given.
4)Plot and check the model
plot(f,uniform=T,margin=0.1); text(f,use.n=T)
plotcp(f); printcp(f)
Look at the xerrors in the summary and choose the smallest number of splits that
achieve the smallest xerror. Consider the tradeoff between model fit and
complexity (ie overfitting). Based on your judgement, repeat step 3 with the cp
value of your choice.
5)Predict results
predict(f,newdata,type=“class”)
where newdata is a dataframe with the independent variables.
1)Load the rpart library
library(rpart)
2)For classification trees, make sure the response is of the type factor. If you don’t
know how to do this lookup help(as.factor)or consult a general R reference.
y=as.factor(y)
3)Fit the tree model
f=rpart(y~x1+x2+…,data=…,cp=0.001)
If using an unattached dataframe, you must specify data.
If using global variables, then data= can be omitted.
A good starting point for cp, which controls the complexity of the tree, is given.
4)Plot and check the model
plot(f,uniform=T,margin=0.1); text(f,use.n=T)
plotcp(f); printcp(f)
Look at the xerrors in the summary and choose the smallest number of splits that
achieve the smallest xerror. Consider the tradeoff between model fit and
complexity (ie overfitting). Based on your judgement, repeat step 3 with the cp
value of your choice.
5)Predict results
predict(f,newdata,type=“class”)
where newdata is a dataframe with the independent variables.