Transcript CART

Additive Models,Trees,and
Related Models
Prof. Liqing Zhang
Dept. Computer Science & Engineering,
Shanghai Jiaotong University
Introduction
9.1:
9.2:
9.3:
9.4:
Generalized Additive Models
Tree-Based Methods
PRIM:Bump Hunting
MARS: Multivariate Adaptive Regression
Splines
9.5: HME:Hieraechical Mixture of Experts
CART
• Overview
– Principle behind: Divide and conquer
– Partition the feature space into a set of rectangles
• For simplicity, use recursive binary partition
– Fit a simple model (e.g. constant) for each rectangle
– Classification and Regression Trees (CART)
• Regress Trees
• Classification Trees
– Popular in medical applications
CART
• An example (in regression case):
Basic Issues in Tree-based Methods
• How to grow a tree?
• How large should we grow the tree?
Regression Trees
• Partition the space into M regions: R1, R2, …,
RM.
M
f ( x)   cm I ( x  Rm )
m 1

where,
c m  average( y i | xi  Rm )
Note that this is still an additive model
Regression Trees– Grow the Tree
• The best partition: to minimize the sum of squared error:
N
2
(
y

f
(
x
))
 i
i
i 1
• Finding the global minimum is computationally infeasible
• Greedy algorithm: at each level choose variable j and value s
as:
arg min[min
j ,s
c1
2
(
y

c
)
 i 1  min
xi R1 ( j , s )
c2
2
(
y

c
)
 i 2 ]
xi R2 ( j , s )
• The greedy algorithm makes the tree unstable
– The error made at the upper level will be propagated to the lower
level
Regression Tree – how large?
• Trade-off between bias and variance
– Very large tree: overfit (low bias, high variance)
– Small tree (low variance, high bias): might not
capture the structure
• Strategies:
– 1: split only when we can decrease the error
(usually short-sighted)
– 2: Cost-complexity pruning (preferred)
Regression Tree - Pruning
• Cost-complexity pruning:
– Pruning: collapsing some internal nodes
– Cost complexity:
|T |
C (T )   N mQm (T )   | T |
m 1
Cost: sum of
squared errors
Penalty on the
complexity/size of
the tree
– Choose best alpha: weakest link pruning
• Each time collapse an internal node which
add smallest error
• Choose from this tree sequence the best one
by cross-validation
Classification Trees
• Classify the observations in node m
to the major class in the node:

k (m)  arg maxk pmk
– Pmk is the proportion of observation of
class k in node m
• Define impurity for a node:
– Misclassification error:
pmk
1

Nm

1  pmk
K
– Entropy:



  pmk log pmk
k 1
– Gini index :
K

p
k 1
mk

(1  pmk )
I(y
xi Rm
i
 k)
Classification Trees
Node impurity measures versus class proportion for 2-class problem
• Entropy and Gini are more sensitive
• To grow the tree: use Entropy or Gini
• To prune the tree: use Misclassification rate (or any other method)
Tree-based Methods: Discussions
• Categorical Predictors
– Problem: Consider splits of sub tree t into tL and tR
based on categorical predictor x which has q
possible values: 2(q-1)-1 ways !
– Treat the categorical predictor as ordered by say
proportion of class 1
Tree-based Methods: Discussions
• Linear Combination Splits
– Split the node based on  a j X j  s
– Improve the predictive power
– Hurt interpretability
• Instability of Trees
– Inherited from the hierarchical nature
– Bagging (section 8.7 of [HTF]) can reduce the
variance
Bootstrap Trees
Construct B number of trees from B bootstrap samples– bootstrap trees
Bootstrap Trees
Bagging The Bootstrap Trees
B
1
fˆbag ( x)   fˆ *b ( x)
B b1
fˆ *b ( x)
is computed from the bth bootstrap sample
in this case a tree
Bagging reduces the variance of the original tree by aggregation
Bagged Tree Performance
Majority
vote
Average
Results for spam example
9.3 PRIM: Bump Hunting
• The patient rule induction method(PRIM) finds boxes
in the feature space and seeks boxes in which the
response average is high. Hence it looks for maxima
in the target function, an exercise known as bump
hunting.
PRIM(cont.)
PRIM(cont.)
PRIM(cont.)
9.4 MARS: Multivariate Adaptive Regression Splines
• In multi-dimensional spline the basis functions grow
exponentially– curse of dimensionality
• A partial remedy is a greedy forward search
algorithm
– Create a simple basis-construction dictionary
– Construct basis functions on-the-fly
– Choose the best-fit basis function at each step
Basis functions
• 1-dim linear spline (t represents the knot)
• Basis collections C:
|C| = 2 * N * p
MARS(cont.)
• The idea of MARS is to form reflected pairs for each
input Xj with knots at each observed value Xij of that
input. Therefore, the collection of basis function is:
• The model has the form:
where each hm(x) is a function in C or a product of
two or more such functions.
The MARS procedure
1.
2.
3.
4.
st
(1
stage)
Initialize basis set M with a constant function
Form candidates (cross-product of M with set C)
Add the best-fit basis pair (decrease residual error the
most) into M
Repeat from step 2 (until e.g. |M| >= threshold)
M (old)
C
M (new)
MARS(cont.)
• We start with only the constant function h0(x)=1 in
the model set M and all functions in the set C are
candidate functions. At each stage we add to the
model set M the term of the form:
that produces the largest decrease in training error.
The process is continued until the model set M
contains some preset maximum number of terms.
MARS(cont.)
The MARS procedure
nd
(2
stage)
The final model M typically overfits the data
=>Need to reduce the model size (# of terms)
Backward deletion procedure
1. Remove term which causes the smallest increase in
residual error
2. Compute
3. Repeat step 1
Choose the model size with minimum GCV.
Generalized Cross Validation (GCV)
• M(.) measures effective # of parameters:
– r: # of linearly independent basis functions
– K: # of knots selected
–c=3
Discussion
• Piecewise linear reflected basis
– Allow operation on local region
– Fitting N reflected basis pairs takes O(N) instead of
O(N^2)
• Left-part is zero, right-part differs by a constant
X[i-1]
X[i]
X[i+1]
X[i+2]
MARS & CART relationship
IF
– replace piecewise linear basis by step functions
– keep only the newly formed product terms in M
(leaf nodes of a binary tree)
THEN
MARS forward procedure
= CART tree growing procedure