Transcript Document

Comp 540
Chapter 9: Additive Models,
Trees, and Related Methods
Ryan King
Overview
9.1
9.2
9.4
9.6
9.7
Generalized Additive Models
Tree-based Methods (CART)
MARS
Missing Data
Computational Considerations
Generalize Additive Models
Generally have the form:
EY | X1, X 2 ,..., X p     f1 ( X p )  ... f p ( X p )
Example: logistic regression
 X  
    X 1  ...  X p
log
 1 X  
becomes additive Logistic regression:
 X  
    f1  X 1   ... _ f p X p 
log
 1 X  
Link Functions
•
The conditional mean related to an additive
function of the predictors via a link function
g  X     f1 ( X p )  ... f p ( X p )
Identity:
g    
•
Logit:
  

g    log
 1  
•
Log:
g    log 
•
(Gaussian)
(binomial)
(Poisson)
9.1.1 Fitting Additive Models
•
Ex: Additive Cubic Splines
Y     f j X j   
p
j 1
•
Penalized Sum of Squares Criterion (PRSS)
PRSS , f1 , f 2 ,, f p  
2
p


2
''




y



f
x


f
t
 i



j
ij 
j j
j dt j
i 1 
j 1
j 1

N
p
9.1 Backfitting Algorithm
1.
Initialize:
2.
Cycle:
1
ˆ 
N

N
1
yi , fˆ j  0, i, j
j  1,2,, p,1,2, p,
N



fˆ j  S j  yi  ˆ   fˆk xik  

k j
1 
N
1
fˆ j  fˆ j   fˆ j xij 
N i 1
Until the functions fˆ j change less than a threshold
9.1.3
Summary
•
Additive models extend linear models
•
Flexible, but still interpretable
•
Simple, modular, backfitting procedure
•
Limitations for large data-mining applications
9.2
Tree-Based Methods
•
Partition the feature space
•
Fit a simple model
(constant) in each partition
•
Simple, but powerful
•
CART: Classification and
Regression Trees, Breiman
et al, 1984
9.2
Binary Recursive
Partitions
f
e
c
d
x2
a
a
b
b
x1
c
d e
f
9.2
Regression Trees
•
CART is a top down (divisive) greedy procedure
•
Partitioning is a local decision for each node
•
A partition on variable j at value s creates
regions:
R1  j, s  X X j  s and R2  j, s  X X j  s
9.2
•
Regression Trees
Each node chooses j,s to solve:

2
2
minmin   yi  c1   min   yi  c2  
j ,s
c
c2
xi  R2  j , s 
 1 xi  R1  j , s 

•
For any choice j,s the inner minimization is
solved by: c  ave y x  R
m

i
i
m

•
Easy to scan through all choices of j,s to find
optimal split
•
After the split, recur on R1 and R2
9.2
Cost-Complexity Pruning
•
How large do we grow the tree? Which nodes
should we keep?
•
Grow tree out to fixed depth, prune back based
on cost-complexity criterion.
9.2
Terminology
T  T0 implies T is a pruned version
•
A subtree:
of T0
•
Tree T has M leaf nodes, each indexed by m
•
Leaf node m maps to region
•
T
•
N m is the number of data points in region Rm
Rm
denotes the number of leaf nodes of T
9.2
1
cˆm 
Nm
•
Cost-Complexity Pruning
1
Qm 
Nm
 yi
xi  Rm
2
ˆ
  yi  cm 
xi  Rm
We define the cost complexity criterion:
T
C T    N mQm T    T
m1
•
For   0, find T  T0 to minimize C T 
•
Choose

by cross-validation
9.2
Classification Trees
k m   arg max pˆ mk
k
•
pˆ mk
1

Nm
 I y
xi  Rm
i
 k
We define the same cost complexity criterion:
T
C T    N mQm T    T
m1
•
But choose different measure of node impurity
Qm T 
9.2
1.
Misclassification Error
Qm T  
2.
Impurity Measures Qm T 
1
Nm
 I y
xi  Rm
i
ˆ mk
 k (m)   1  p
Gini index
ˆ mk p
ˆ mk '  k 1 p
ˆ mk 1  pˆ mk 
Qm T   k  k ' p
K
3.
Cross-entropy
Qm T   k 1 pˆ mk log  pˆ mk 
K
9.2
Categorical Predictors
1.
How do we handle categorical variables?
2.
In general, 2 q 1  1 possible partitions of
q values into two groups
3.
Trick for 0-1 case: sort the predictor classes by
proportion falling in outcome class 1, then
partition as normal
X1  a, b, c, d 
9.2
1.
Examples…
CART Example
9.3
PRIM-Bump Hunting
•
Partition based, but
not tree-based
•
Seeks boxes where
the response average
is high
•
Top-down algorithm
Patient Rule Induction Method
1.
2.
3.
4.
5.
6.
Start with all data, and maximal box
Shrink the box by compressing one face, to peel off
factor alpha of observations. Choose peeling that
produces highest response mean.
Repeat step 2 until some minimal number of
observations remain
Expand the box along any face, as long as the resulting
box mean increases.
Steps 1-4 give a sequence of boxes, use cross-validation
to choose a member of the sequence, call that box B1
Remove B1 from dataset, repeat process to find another
box, as desired.
9.3
PRIM Summary
•
Can handle categorical predictors, as CART does
•
Designed for regression, can 2 class
classification can be coded as 0-1
•
Non-trivial to deal with k>2 classes
•
More patient than CART
9.4 Multivariate Adaptive
Regression Splines (MARS)
•
Generalization of stepwise linear regression
•
Modification of CART to improve regression
performance
•
Able to capture additive structure
•
Not tree-based
9.4 MARS Continued
•
•
Additive model with adaptive set of basis vectors
Basis built up from simple piecewise linear functions
t  x 
x  t 
t
•
Set “C” represents candidate set of linear splines, with
“knees” at each data point Xi. Models built with elements
from C or their products.


C  X j  t  , t  X j  t x , x ,x  j 1, 2,, p
1j 2 j
Nj
9.4 MARS Procedure
Model has form:
M
f  X    0    m hm  X 
m 1
1.
2.
Given a choice for the
linear regression.
hm, the coefficients  chosen by standard
Start with h0  X   1 All functions in C are candidate functions.
3.
At each stage consider as a new basis function pair all products
of a function hm in the model set M, with one of the reflected
pairs in C.
4.
We add add to the model terms of the form:
M 1hl X  X j  t   M 2hl X  t  X j  , hl  M
hm  X  t  X j 
hm  X  X j  t 
9.4 Choosing Number of Terms
•
Large models can overfit.
•
Backward deletion procedure: delete terms which cause the
smallest increase in residual squared error, to give sequence of
models.
•
Pick Model using Generalized Cross Validation:

GCV   
N
i 1


yi  fˆ xi 
2
1  M   N 2
•
M   is the effective number of parameters in the model. C=3, r
is the number of basis vectors, and K knots
M    r  cK
•
Choose the model which minimizes GCV ( )
9.4 MARS Summary
•
Basis functions operate locally
•
Forward modeling is hierarchical, multiway products are built up
only from existing terms
•
Each input appears only once in each product
•
Useful option is to set limit on order of operations. Limit of two
allows only pairwise products. Limit of one results in an additive
model
9.5
Hierarchical Mixture of
Experts (HME)
•
Variant of tree based methods
•
Soft splits, not hard decisions
•
At each node, an observation goes left or right
with probability depending on its input values
•
Smooth parameter optimization, instead of
discrete split point search
9.5
HMEs continued
•
Linear (or logistic) regression model fit at each
leaf node (Expert)
•
Splits can be multi-way, instead of binary
•
Splits are probabilistic functions of linear
combinations of inputs (gating network), rather
than functions of single inputs
•
Formally a mixture model
9.6
•
•
•
Missing Data
Quite common to have data with missing
values for one or more input features
Missing values may or may not distort
data
For response vector y, Xobs is the
observed entries, let Z=(X,y) and
Zobs=(Xobs,y)
9.6
•
•
•
Missing Data
Quite common to have data with missing
values for one or more input features
Missing values may or may not distort
data
For response vector y, Xobs is the
observed entries, let Z=(X,y) and
Zobs=(Xobs,y), R is an indicator matrix
for missing values
9.6
•
Missing Data
Missing at Random(MAR):
PR | Z ,   P( R | ZObs , )
•
Missing Completely at Random(MCAR)
PR | Z ,   P( R |  )
•
MCAR is a stronger assumption
9.6
Dealing with Missing Data
Three approaches for handling MCAR data:
1.
Discard observations with missing features
2.
Rely on the learning algorithm to deal with
missing values in its training phase
3.
Impute all the missing values before training
9.6
Dealing…MCAR
•
If few values are missing, (1) may work
•
For (2), CART can work well with missing values via
surrogate splits. Additive models can assume average
values.
•
(3) is necessary for most algorithms. Simplest tactic is to
use the mean or median.
•
If features are correlated, can build predictive models for
missing features in terms of known features
9.6
Computational Considerations
•
For N observations, p predictors
•
•
•
•
Additive Models:
Trees:
MARS:
HME, at each step:
O pN log N  mpN
O pN log N 

ONp K 
O NM 3  pM 2 N
2
2
