present - QuantifyRisk.com

Transcript present - QuantifyRisk.com

A Manager’s Guide
Through Random Forests
Kirk Monteverde
Student Loan Risk Advisors
February 2012
Inferno: Canto I
Midway upon the journey of our life
I found myself within a forest dark,
Leo Breiman(1928-2005)
Our Virgil
Breiman’s Intellectual Journey

Breiman, L., Friedman, J., Olshen, R. and Stone,
C. (1984). Classification and Regression Trees,
(CART) Wadsworth, New York.

Breiman, L. (1996a). Bagging predictors,
Machine Learning 26: 123–140.

Breiman, L. (2001). Random forests, Machine
Learning 45: 5–32.
What is CART?

CART is one of a number of recursive “tree-structure”
techniques (others include CHAID and C4.5)

DVs can be categorical or continuous (we concentrate
here on dichotomous DV models)

Idea is to start by choosing the predictor variable (and
its cut-point) that separates your development sample
into two piles that best differentiate on the DV

Repeat this process for each of the daughter nodes
and continue until some stopping rule is reached
CART topics

Non-parametric (non-linear) modeling

Nearest Neighbor approach to modeling

Bias / Variance trade-off and Over-fitting
Contrast with Logistic Regression
(linear model for categorical DVs)
CART topics: non-linearity
Contrast with Logistic Regression
(linear model for categorical DVs)
Modeling with Predictor A only
Modeling with both A and B
CART topics: non-linearity
Contrast with Logistic Regression
(Predicted Good Odds)
CART topics: non-linearity
Contrast with Logistic Regression
(perturbed data)
CART topics: non-linearity
Recursive Tree (CART) approach
(Perturbed Data: Split Depth 1)
CART topics: non-linearity
Recursive Tree (CART) approach
(Perturbed Data: Split Depth 2)
CART topics: non-linearity
Recursive Tree (CART) approach

“After putting the rabbit into the hat in the full view of the
audience, it does not seem necessary to make so much
fuss about drawing it out again.”
Robinson, Joan (1966), “Comment on Samuelson and Modigliani”,
Review of Economic Studies, 33, 307-8.

Unfair characterization of standard tree techniques
◦ precisely because of the concern for over-fitting, trees typically not
grown to their full depth (and/or they are “pruned”)

Yet Random Forests does grow maximum depth trees
◦ controlling for over-fitting in another manner
CART topics: non-linearity
Linear Fit
CART topics: nearest neighbor
1-Nearest Neighbor
CART topics: nearest neighbor
15-Nearest Neighbors
CART topics: nearest neighbor
Model bias and variance:
using a development sample

Measure of a model’s “usefulness” has two parts
Model Bias: How far from the population’s true value do we
expect the model’s prediction to be? A model that perfectly fits it’s
random sample is unbiased, but an unbiased model need not fit
perfectly as long as we do not expect it to be off target.
Model Variance: How far off do we expect this model’s
predictions (however biased) to be from the mean of all
predictions that the model would give were we to draw multiple
samples from the population and average all the model’s
predictions over those samples?

Mean Squared Error (MSE)= Bias(θ*)² + Var(θ*)
It is MSE that one should seek to minimize
CART topics: bias/variance trade-off
Model bias and variance:
using a development sample

In Ordinary Least Squares (OLS) the bias term is zero
(again, a model need not perfectly fit its sample to be unbiased)

“Goodness of fit” reduces to finding the model with
the smallest variance

OLS models are Best Linear Unbiased Estimators
(“BLUE”); Gauss-Markov Theorem

BLUE models may not be the best overall estimators
◦ Non-linear models may have lower MSE
◦ Biased models may have lower MSE (e.g. Ridge Regression)
CART topics: bias/variance trade-off
Problem of Over-Fitting

For prediction, model “usefulness” needs to be
assessed using data not used to develop the model
◦ Task is to minimize MSE of Test or “Hold-out” sample
◦ Over-fitting is related to the bias/variance trade-off

Expected Prediction Error (EPE) example using “k”NN
◦ Assume Y = f(x) + ε, with E(ε) = 0 and Variance (ε) = σ2
◦ EPE(xi)= σ2+[f(xi)–average of k nearest neighbors)]2 +σ2/k

As k increases, bias increases and variance decreases
CART topics: bias/variance trade-off
Over-fitting

Over-fitting issue is not restricted to non-traditional
methods

Issue is famously illustrated by the fitting an OLS
model to variables first transformed to higher level
polynomials
◦ A line can be perfectly fit through two development sample points
◦ A quadratic curve (2nd degree polynomial) can be perfectly fit through 3 pts
◦ An n-degree polynomial can be passed perfectly through n+1 data points

Adding “extra” variables to an OLS model can over-fit
◦ Occam razor
◦ Albert Einstein’s advise on modeling
CART topics: bias/variance trade-off
Over-fitting
e.g. value of traditional tree-growing techniques is their focus on
avoiding over-fitting using such approaches as cross-validation
CART topics: bias/variance trade-off
Bagging topics

Aggregation

Bootstrapping
B agg ing = Bootstrap Aggregating
Aggregation as Variance Reduction

“Bagging helps… in short because averaging reduces
variance and leaves bias alone”
Hastie, Tibshirani, and Friedman, Element of Statistical
Learning, 2nd edition, p. 285

Therefore aggregation typically helps improve
performance of high variance/low bias techniques
(e.g., trees) and does not improve linear models

And variance reduction derives from the fundamental
observation that the square of an average is less
than (or equal to) the average of the squares
mars.csie.ntu.edu.tw/~cychen/papersurvey/Bagging.ppt
(see especially the slide entitled Why Bagging Works (2))
BAGGING topics: aggregation
The Wisdom of Crowds
James Surowiecki, 2004

Francis Galton’s experience at the 1906 West of
England Fat Stock and Poultry Exhibition

Jack Treynor’s jelly-beans-in-the-jar experiment
◦ Only one of 56 student guessers came closer to the
truth than the average of the class’s guesses

Who Wants to Be a Millionaire?
◦ Call an expert?
◦ Ask the audience?
65% correct
91% correct
BAGGING topics: aggregation
Scott E. Page, The Difference, 2007
Which person from the following list was not a member of the
Monkees (a 1960s pop band)?
(A) Peter Tork
(B) Davy Jones
(C) Roger Noll
(D) Michael Nesmith
The non-Monkee is Roger Noll, a Stanford economist. Now
imagine a crowd of 100 people with knowledge distributed as:
7 know all 3 of the Monkees
10 know 2 of the Monkees
15 know 1 of the Monkees
68 have no clue
So Noll will garner, on average, 34 votes versus 22 votes for each
of the other choices.
BAGGING topics: aggregation
Crowd Wisdom:
more than reduced variance

Implication of Surowiecki’s examples is that one
should not expend energy trying to identify an
expert within a group but instead rely on the
group’s collective wisdom, but
◦ Opinions must be independent
◦ Some knowledge of the truth must reside with some
group members

Kindergartners guessing the weight of a 747
◦ The square of the average of bad guesses, no matter
how bad, is still no further from the truth (and is usually
much closer) than the averaged squared distance from
truth of each of guesses
BAGGING topics: aggregation
Hastie, Tibshirani, and Friedman
BAGGING topics: aggregation
Bootstrapping:
a method, not a concept

First used to quantify, via a simulation-like approach,
the accuracy of statistical estimates (e.g. the variance
of a predicted y value at a specific value of x)

METHOD
◦ Draw from one’s development sample a selection of records, one-at–atime, returning each selected record back into the pool each time,
giving it a chance to be selected repeatedly
◦ Make this new “bootstrapped” sample the same size as the original
development sample
◦ Repeat to construct a series of such “bootstrapped” samples
BAGGING topics: bootstrapping
Bootstrapping and Aggregation

Bootstrapping is used to create different predictive
models which are then aggregated

The same approach (e.g., CART/ tree navigation)
can lead to very different models (i.e., different y
values predicted for the same x value) when
different bootstrapped samples are used

An example of a situation which can benefit from
bagging is one where model predictors are highly
correlated and the modeling technique is CART
BAGGING topics: bootstrapping
Hastie, Tibshirani, and Friedman

5 candidate predictor variables (standardized normal)
correlated to one another at .95 (multicollinearity)

Only one is in reality related to the dichotomous DV
◦ If “true predictor” value is greater than its mean, the DV has
an 80% chance of being “YES”, otherwise a 20% chance
◦ “YES” values associated with high values of “true predictor”;
“NO” values associated with low values of “true predictor”

When run with bootstrapped sample data, CART often
uses some of the four causally unrelated (but
deceptively correlated) variables to parse the tree

Development sample size: 30
BAGGING topics: bootstrapping
Digression on Bayes Error Rate
If “true predictor” value is greater than its mean, then the DV has
an 80% chance of being “YES”, otherwise it has only a 20% chance
Bayes Error Rate: The expected misclassification rate even
though the true data generation algorithm and the values of
the predictors that matter are known: .1 + .1 = .2
BAGGING topics: bootstrapping
Hastie, Tibshirani, and Friedman
Original and first 5 bootstrapped CART models
BAGGING topics: bootstrapping
Hastie, Tibshirani, and Friedman
BAGGING topics: bootstrapping
Random Forests
“This concept [bagged classifiers] has been popularized
outside of statistics as the Wisdom of Crowds (Surowiecki,
2004) — the collective knowledge of a diverse and
independent body of people typically exceeds the
knowledge of any single individual, and can be harnessed
by voting. Of course, the main caveat here is independent,”
and bagged trees not.”
Hastie, Tibshirani, and Friedman, Element of Statistical Learning, 2nd
edition, p. 286
The Random Forests technique focuses on
making tree models (more) independent
before bagging them
Disclaimer

Random Forests has attracted “neural net”-like buzz

It’s strength is in its easy-to-understand pedigree

Other recent advanced non-linear techniques are
arguably “better” and are certainly more robust
(e.g., Gradient Stochastic Boosting)

Features touted in the method (and implemented in
most software packages) not covered here include
◦ Out-of-Bag samples
◦ Variable Importance algorithm
◦ Proximity Plots
RANDOM FORESTS
De-correlating tree algorithms

Trees built using bootstrapped samples are correlated,
meaning that they tend to give the same estimates
for the Dependant Variable
◦ they are built, after-all, using the same set of the predictors,
varying only the composition of their bootstrapped samples

The trick to de-correlating trees is to randomly select
only a subset from among available predictors as one
builds the trees

Rule-of thumb for random forest classifiers is to use a
number of predictors equal to the square root of the
number of available predictors
RANDOM FORESTS
Important Predecessor Paper
Ho, Tin Kam (1995). Random Decision Forests,
in M. Kavavaugh and P. Storms (eds), Proc. Third International Conference on Document
Analysis and Recognition, Vol. 1, IEEE Computer Society Press, New York, pp. 278–282.

Short, readable paper from Bell Labs scientist; an engineer’s interest in
solving a practical problem (handwritten digit recognition)

Introduced idea of using only a subset of candidate predictors for each
of an ensemble of tree growths (but all against single training data set)

Tree navigation not using CART, but rather two geometric-inspired
classifying algorithms (reminiscent of Nearest Neighbor)

Trees are grown to full depth (no pruning) with each tree “voting”

Insight that for all trees, the training set is 100% correctly classified
(except where 2 sample points with the same IV values are associated
with a different DV classification); technique is only useful for classifying
points not present in the sample (points lying “between” sample points)
RANDOM FORESTS
Recalling problems of trees
grown to their full depth

For categorical (unordered) independent variables, a fully grown
tree is nothing more than a multi-dimensional crosstab; CART, or
any tree navigation method, is simply the means to a known end
◦ Social Scientist’s Saturated Model
◦ Logistic Regression run by including all possible interactive effects

And if all cells of this multi-dimensional crosstab are populated
(e.g., all possible predictor level combinations are represented in
the sample data), then there are no levels of predictor variable
combinations “between” those levels observed in the sample

For ordinal level predictor data (“good, bad, or ugly”) and interval
level data, tree navigation (e.g., using CART) still involves forming
groupings but the mechanism of group formation is constrained
and, for interval level data, groups are defined as ranges
RANDOM FORESTS
Add bootstrapping and use 1-NN
(not CART) for prediction
RANDOM FORESTS
First Bootstrapped “Collective”
6-member collective (collective defined as all
combinations of 4 variables taken 2 at a time)
RANDOM FORESTS
First Collective’s voting on TEST
sample point .75.,75,.75,.75, BLUE
RANDOM FORESTS
RF: the complete algorithm
(from Hastie, Tibshirani, and Friedman)
RANDOM FORESTS
Summary and Final Caution

CART topics
◦ Non-parametric (non-linear) modeling
◦ Nearest Neighbor approach to modeling
◦ Bias / Variance trade-off and Over-fitting

Bagging Topics
◦ Aggregation and the Wisdom of Crowds
◦ Bootstrapping

Random Forests

Importance of explicitly addressing the relative
costs of misclassification when using non-linear
classification algorithms (e.g. CART and RF)