ICS 278: Data Mining Lectures 10,11: Classification Algorithms Padhraic Smyth

Download Report

Transcript ICS 278: Data Mining Lectures 10,11: Classification Algorithms Padhraic Smyth

ICS 278: Data Mining
Lectures 10,11: Classification Algorithms
Padhraic Smyth
Department of Information and Computer Science
University of California, Irvine
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Notation
• Variables X, C….. with values x, y (lower case)
• Vectors indicated by X
• Components of X indicated by Xj with values xj
• “Matrix” data set D with n rows and p columns
– jth column contains values for variable Xj
– ith row contains a vector of measurements on object i, indicated by x(i)
– The jth measurement value for the ith object is xj(i)
• Unknown parameter for a model = q
– Can also use other Greek letters, like a, b, d, g ew
– Vector of parameters = q
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Classification
•
Predictive modeling: predict Y given X
– Y is real-valued => regression
– Y is categorical => classification
• Often use C rather than Y to indicate the “class variable”
•
Classification
– Many applications: speech recognition, document classification, OCR,
loan approval, face recognition, etc
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Classification v. Regression
•
Similar in many ways…
– both learn a mapping from X to C or Y
– Both sensitive to dimensionality of X
– Generalization to new data is important in both
• Test error versus model complexity
– Many models can be used for either classification or regression, e.g.,
• trees, neural networks
•
Most important differences
– Categorical Y versus real-valued Y
– Different score functions
• E.g., classification error versus squared error
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Decision Region Terminlogy
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE
6
Decision
Region 1
5
Decision
Region 2
Feature 2
4
3
2
1
0
Decision
Boundary
-1
Data Mining Lectures
2
3
4
5
6
Feature 1
7
Lectures 10/11: Classification
8
9
10
Padhraic Smyth, UC Irvine
Probabilistic view of Classification
•
Notation: let there be K classes c1,…..cK
•
Class marginals: p(ck) = probability of class k
•
Class-conditional probabilities
p( x | ck ) = probability of x given ck , k = 1,…K
•
Posterior class probabilities (by Bayes rule)
p( ck | x ) = p( x | ck ) p(ck) / p(x) , k = 1,…K
where p(x) = S p( x | cj ) p(cj)
In theory this is all we need….in practice this may not be best approach.
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Example of Probabilistic Classification
p( x | c2 )
Data Mining Lectures
Lectures 10/11: Classification
p( x | c1 )
Padhraic Smyth, UC Irvine
Example of Probabilistic Classification
p( x | c2 )
p( x | c1 )
1
p( c1 | x )
0.5
0
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Example of Probabilistic Classification
p( x | c2 )
p( x | c1 )
1
p( c1 | x )
0.5
0
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Decision Regions and Bayes Error Rate
p( x | c2 )
Class c2
Class c1
Class c2
p( x | c1 )
Class c1
Class c2
Optimal decision regions = regions where 1 class is more likely
Optimal decision regions  optimal decision boundaries
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Decision Regions and Bayes Error Rate
p( x | c2 )
Class c2
Class c1
Class c2
p( x | c1 )
Class c2
Class c1
Optimal decision regions = regions where 1 class is more likely
Optimal decision regions  optimal decision boundaries
Bayes error rate = fraction of examples misclassified by optimal classifier
= shaded area above (see equation 10.3 in text)
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Procedure for optimal Bayes classifier
•
For each class learn a model p( x | ck )
– E.g., each class is multivariate Gaussian with its own mean and covariance
•
Use Bayes rule to obtain p( ck | x )
=> this yields the optimal decision regions/boundaries
=> use these decision regions/boundaries for classification
•
Correct in theory…. but practical problems include:
– How do we model p( x | ck ) ?
– Even if we know the model for p( x | ck ), modeling a distribution or
density will be very difficult in high dimensions (e.g., p = 100)
•
Alternative approach: model the decision boundaries directly
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
3 categories of classifiers in general
•
Generative (or class-conditional) classifiers:
– Learn models for p( x | ck ), use Bayes rule to find decision boundaries
– Examples: naïve Bayes models, Gaussian classifiers
•
Regression (or posterior class probabilities):
– Learn a model for p( ck | x ) directly
– Example: logistic regression (see lecture 5/6), neural networks
•
Discriminative classifiers
–
–
–
No probabilities
Learn the decision boundaries directly
Examples:
• Linear boundaries: perceptrons, linear SVMs
• Piecewise linear boundaries: decision trees, nearest-neighbor classifiers
• Non-linear boundaries: non-linear SVMs
– Note: one can usually “post-fit” class probability estimates p( ck | x ) to a
discriminative classifier
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Which type of classifier is appropriate?
•
Lets look at the score functions:
– c(i) = true class, c(x(i) ; q) = class predicted by the classifier
Class-mismatch loss functions:
S(q) = 1/n
Si Cost [c(i),
c(x(i) ; q) ]
where cost(i, j) = cost of misclassifying true class i as predicted class j
e.g., cost(i,j) = 0 if i=j, = 1 otherwise (misclassification error or 0-1 loss)
and more generally cost(i,j) is a matrix of K x K losses (e.g., surgery, spam email, etc)
Class-probability loss functions:
S(q) = 1/n
Si log p(c(i) | x(i) ; q ) (log probability score)
or S(q) = 1/n Si [ c(i) – p(c(i) | x(i) ; q ) ]2
Data Mining Lectures
Lectures 10/11: Classification
(Brier score)
Padhraic Smyth, UC Irvine
Example: classifying spam email
• 0-1 loss function
– Appropriate if we just want to maximize accuracy
• Asymmetric cost matrix
– Appropriate if missing non-spam emails is more “costly” than failing to
detect spam emails
•
Probability loss
– Appropriate if we wanted to rank all emails by p(spam | email features),
e.g., to allow the user to look at emails via a ranked list.
•
In general: don’t solve a harder problem than you need to, or don’t model
aspects of the problem you don’t need to (e.g., modeling p(x|c)) - Vapnik,
1996.
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Examples of classifiers
•
Generative/class-conditional/probabilistic, based on
p( x | ck ),
– Naïve Bayes (simple, but often effective in high dimensions)
– Parametric generative models, e.g., Gaussian (can be effective in lowdimensional problems: leads to quadratic boundaries in general)
•
Regression-based,
p( ck | x ) directly
– Logistic regression: simple, linear in “odds” space
– Neural network: non-linear extension of logistic, can be difficult to work with
•
Discriminative models, focus on locating optimal decision boundaries
– Linear discriminants, perceptrons: simple, sometimes effective
– Support vector machines: generalization of linear discriminants, can be quite
effective, computational complexity is an issue
– Nearest neighbor: simple, can scale poorly in high dimensions
– Decision trees: “swiss army knife”, often effective in high dimensionis
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Naïve Bayes Classifiers
•
Generative probabilistic model with conditional independence assumption
on p( x | ck ), i.e.
p( x | ck ) = P p( xj | ck )
•
Typically used with nominal variables
– Real-valued variables discretized to create nominal versions
– (alternative is to model each p( xj | ck ) with a parametric model – less widely used)
•
Comments:
– Simple to train (just estimate conditional probabilities for each feature-class pair)
– Often works surprisingly well in practice
• e.g., state of the art for text-classification, basis of many widely used spam filters
– Feature selection can be helpful, e.g., information gain
– Note that even if CI assumptions are not met, it may still be able to approximate the
optimal decision boundaries (seems to happen in practice)
– However…. on most problems can usually be beaten with a more complex model (plus
more work)
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Link between Logistic Regression and Naïve Bayes
Naïve Bayes
P(C | d )
P (C )
P( w | C )
log
 log
  log
P(C | d )
P (C ) wd
P( w | C )
Logistic Regression
P (C | d )
log
 a   bw  w
P (C | d )
wd
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Imbalanced Class Distributions
•
Common in data mining to have one class be much less likely than the
others
– e.g., 0.1% of examples are fraudulent or have a disease
•
If we train a standard classifier on a random sample of data it is very
difficult to beat the “majority classifier” in terms of accuracy
•
Approaches:
– Stratified sampling: artificially create training data with 50% of each class being
present, and then “correct” for this in prediction
• E.g., learn p(x|c) on stratified data and use true p( c ) when predicting with a
probabilistic model
– Use a different score function:
• We are often interested in scoring/screening/ranking cases when using the model
• Thus, scores such as “how many of the class of interest are ranked in the top 1% of
predictions” may be more relevant than overall accuracy (e.g., in document retrieval)
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Ranking and Lift Curves
• Many problems where we are interested in ranking examples in
terms of how likely they are to the “positive” class
– E.g., credit scoring, fraud detection, medical screening, document
retrieval
– E.g., use classifier to rank N test examples according to p(c|x) and then
pick the top K, where K is much smaller than N
• Lift curve
– n = number of true positives that appear in top K% of ranked list
– r = number of true positives that would appear if we ranked randomly
– n/r is the “lift” provided by the classifier for top K%
• e.g., K = 10%, r = 200, n = 300, lift = 1.5, or 50% increase in lift
• Random ranking gives lift = 1, or 0% increase in lift
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
• Target variable = response/no-response from mailing campaign
• Training and test sets each of size 250k
• Standard model had 80 variables: variable selection reduced this to 7
• Note non-monotonicity in lower curve (undesirable)
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
ROC plots
• Rank the N test examples by p(c|x)
– or whatever real-number our classifier produces that indicates likelihood
of belonging to class 1
• Let k = number of examples classified in class 1, and m = number
in class 0, and k+m = N
• For all possible thresholds for this ranked list
– count number of positives kt
• true positive rate = kt /k
– count number of “false alarms”, mt
• false positive rate = mt /m
– ROC plot = plot of true positive rate kt v false positive rate mt
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
ROC Example
N = 10 examples,
k = 6 true class 1’s,
m = 4 class 0’s
The first column is a
possible ranking
from a classifier
Data Mining Lectures
Rank
True Class True
Positives
False
Positives
1
1
1
0
2
1
2
0
3
1
3
0
4
1
4
0
5
0
4
1
6
1
5
1
7
0
5
2
8
1
6
2
9
0
6
3
10
0
6
4
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
ROC Plot
Diagonal line
corresponds
to random
ranking
•
•
Area under curve (AUC) often used as a metric to summarize ROC
Online example at http://www.anaesthetist.com/mnm/stats/roc/
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Example: Link Prediction in Coauthor Graphs
O Madadhain, Hutchins, Smyth, SIGKDD, 2005
• Binary classification problem
– Training data:
• graph of coauthor links, 100k authors, 300k links
• data over several years
– Test data: coauthor graph for same authors in a future year
– Classification problem:
• predict if pair(A,B) will coauthor
• Training and test pairs selected in various ways
• Compared a variety of different classifiers and evaluation metrics
– Skewed class distribution
• No link present (class 0) in 93.8 % of test examples
• Link present (class 1) in 6.2 % of test examples
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Evaluation Metrics
• Classification error
– If p(link[A,B]) > 0.5, predict a link
• Brier Score
–
S
[ p(link[A,B] – I(A,B) ]2
• ROC Area
– area under ROC plot
Data Mining Lectures
(between 0 and 1)
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Link Prediction Evaluation
Classification
Error
Baseline
6.2
Single Feature
6.2
Naïve Bayes
15.5
Logistic
6.1
Boosting
6.4
Averaged
6.2
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Link Prediction Evaluation
Classification
Error
ROC
Area
Baseline
6.2
0.50
Single feature
6.2
0.54
15.5
0.78
Logistic
6.1
0.80
Boosting
6.4
0.79
Averaged
6.2
0.80
Naïve Bayes
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Link Prediction Evaluation
Classification
Error
ROC
Area
Brier
Score
Baseline
6.2
0.50
100.0
Single feature
6.2
0.54
98.6
15.5
0.78
211.7
Logistic
6.1
0.80
83.1
Boosting
6.4
0.79
83.4
Averaged
6.2
0.80
82.2
Naïve Bayes
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Lift Curves for Different Models
Base Rate of links = 6.2%
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Interpretation of Ranking at Top of Ranked List
• Top 50 ranked candidates
– Averaged: contains 44 true links
– Logistic: contains 40 true links
– Baseline: contains 3 true links
• Top 500 ranked candidates
– Averaged: contains 300 true links
– Logistic: contains 298 true links
– Baseline: contains 31 true links
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Lift Curves for Different Models
Base Rate of links = 0.2%
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Calibration
• In addition to ranking we may be interested in how accurate our
estimates of p(c|x) are,
– i.e., if the model says p(c|x) = 0.9, how accurate is this number?
• Calibration:
–
a model is well-calibrated if its probabilistic predictions match realworld empirical frequencies
– i.e., if a classifier predicts p(c|x) = 0.9 for 100 examples, then on
average we would expect about 90 of these examples to belong to class
c, and 10 not to.
– We can estimate calibration curves by binning a classifier’s probabilistic
predictions, and measuring how many
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Calibration in Probabilistic Prediction
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Linear Discriminants
•
Discriminant -> method for computing class decision boundaries
– Linear discriminant -> linear decision boundaries
•
Linear Discriminant Analysis (LDA)
– Earliest known classifier (1936, R.A. Fisher)
– See section 10.4 for math details
– Find a projection onto a vector such that means for each class (2 classes) are
separated as much as possible (with variances taken into account appropriately)
– Reduces to a special case of parametric Gaussian classifier in certain situations
– Many subsequent variations on this basic theme (e.g., regularized LDA)
•
Other linear discriminants
– Decision boundary = (p-1) dimensional hyperplane in p dimensions
– Perceptron learning algorithms (pre-dated neural networks)
• Simple “error correction” based learning algorithms
– Linear SVMs: use a sophisticated “margin” idea for selecting the hyperplane
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Nearest Neighbor Classifiers
•
kNN: select the k nearest neighbors to x from the training data and select
the majority class from these neighbors
•
k is a parameter:
•
Comments
– Small k: “noisier” estimates, Large k: “smoother” estimates
– Best value of k often chosen by cross-validation
– Virtually assumption free
– Gives piecewise linear boundaries (i.e., non-linear overall)
– Interesting theoretical properties:
Bayes error < error(kNN) < 2 x Bayes error (asymptotically)
•
Disadvantages
– Can scale poorly with dimensionality: sensitive to distance metric
– Requires fast lookup at run-time to do classification with large n
– Does not provide any interpretable “model”
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Local Decision Boundaries
Boundary? Points that are equidistant
between points of class 1 and 2
Note: locally the boundary is
(1) linear (because of Euclidean distance)
(2) halfway between the 2 class points
(3) at right angles to connector
1
2
Feature 2
1
2
?
2
1
Feature 1
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Finding the Decision Boundaries
1
2
Feature 2
1
2
?
2
1
Feature 1
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Finding the Decision Boundaries
1
2
Feature 2
1
2
?
2
1
Feature 1
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Finding the Decision Boundaries
1
2
Feature 2
1
2
?
2
1
Feature 1
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Overall Boundary = Piecewise Linear
Decision Region
for Class 1
Decision Region
for Class 2
1
2
Feature 2
1
2
?
2
1
Feature 1
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Example: Choosing k in kNN
(example from G. Ridgeway, 2003)
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Decision Tree Classifiers
– Widely used in practice
• Can handle both real-valued and nominal inputs (unusual)
• Good with high-dimensional data
– similar algorithms as used in constructing regression trees
– historically, developed both in statistics and computer science
• Statistics:
– Breiman, Friedman, Olshen and Stone, CART, 1984
• Computer science:
– Quinlan, ID3, C4.5 (1980’s-1990’s)
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Decision Tree Example
Debt
Income
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Decision Tree Example
Debt
Income > t1
??
t1
Data Mining Lectures
Income
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Decision Tree Example
Debt
Income > t1
t2
Debt > t2
t1
Income
??
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Decision Tree Example
Debt
Income > t1
t2
Debt > t2
t3
t1
Income
Income > t3
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Decision Tree Example
Debt
Income > t1
t2
Debt > t2
t3
Income
t1
Income > t3
Note: tree boundaries are piecewise
linear and axis-parallel
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Binary split selection criteria
•
Q(t) = N1Q1(t) + N2Q2(t), where t is the threshold
~ “average” quality of the split
•
Let p1k be the proportion of class k points in region 1
•
Error criterion for a branch
Q1(t) = 1 - p1k*
•
Gini index:
•
Cross-entropy:
Q1(t) = Sk p1k log p1k
•
Cross-entropy and Gini work better in practice than direct minimization
of classification error at each node
Data Mining Lectures
Q1(t) = Sk p1k (1 - p1k)
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
How to Choose the Right-Sized Tree?
Predictive
Error
Error on Test Data
Error on Training Data
Size of Decision Tree
Ideal Range
for Tree Size
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Choosing a Good Tree for Prediction
• General idea
– grow a large tree
– prune it back to create a family of subtrees
• “weakest link” pruning
– score the subtrees and pick the best one
• Massive data sizes (e.g., n ~ 100k data points)
– use training data set to fit a set of trees
– use a validation data set to score the subtrees
• Smaller data sizes (e.g., n ~1k or less)
– use cross-validation
– use explicit penalty terms (e.g., Bayesian methods)
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Example: Spam Email Classification
• Data Set: (from the UCI Machine Learning Archive)
– 4601 email messages from 1999
– Manually labelled as spam (60%), non-spam (40%)
– 54 features: percentage of words matching a specific word/character
• Business, address, internet, free, george, !, $, etc
– Average/longest/sum lengths of uninterrupted sequences of CAPS
• Error Rates (Hastie, Tibshirani, Friedman, 2001)
–
–
–
–
Data Mining Lectures
Training: 3056 emails, Testing: 1536 emails
Decision tree = 8.7%
Logistic regression: error = 7.6%
Naïve Bayes = 10% (typically)
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Treating Missing Data in Trees
• Missing values are common in practice
• Approaches to handing missing values
– During training
• Ignore rows with missing values (inefficient)
– During testing
• Send the example being classified down both branches and average
predictions
– Replace missing values with an “imputed value” (can be suboptimal)
• Other approaches
– Treat “missing” as a unique value (useful if missing values are
correlated with the class)
– Surrogate splits method
• Search for and store “surrogate” variables/splits during training
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Other Issues with Classification Trees
•
Why use binary splits?
– Multiway splits can be used, but cause fragmentation
•
Linear combination splits?
– can produces small improvements
– optimization is much more difficult (need weights and split point)
– Trees are much less interpretable
•
Model instability
– A small change in the data can lead to a completely different tree
– Model averaging techniques (like bagging) can be useful
•
Tree “bias”
– Poor at approximating non-axis-parallel boundaries
•
Producing rule sets from tree models (e.g., c5.0)
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Why Trees are widely used in Practice
•
Can handle high dimensional data
– builds a model using 1 dimension at time
•
Can handle any type of input variables
– categorical, real-valued, etc
– most other methods require data of a single type (e.g., only real-valued)
•
Invariant to monotonic transformations of input variables
– E.g., using x, 10x + 2, log(x), 2^x, etc, will not change the tree
•
Trees are (somewhat) interpretable
– domain expert can “read off” the tree’s logic
•
Tree algorithms are relatively easy to code and test
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Limitations of Trees
•
Representational Bias
– classification: piecewise linear boundaries, parallel to axes
– regression: piecewise constant surfaces
•
Trees do not scale well to massive data sets (e.g., N in millions)
– repeated (unpredictable) access of subsets of the data
– e.g., compare to “linear scanning”
•
High Variance
– trees can be “unstable” as a function of the sample
• e.g., small change in the data -> completely different tree
– causes two problems
• 1. High variance contributes to prediction error
• 2. High variance reduces interpretability
– Trees are good candidates for model combining
• Often used with boosting and bagging
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Decision Trees are not stable
Moving just one
example slightly
may lead to quite
different trees and
space partition!
Lack of stability
against small
perturbation of data.
Figure from
Duda, Hart & Stork,
Chap. 8
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Example of Tree Instability
2 trees fit to 2 splits of data, from G. Ridgeway, 2003
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Model Averaging
• Can average over parameters and models
– E.g., weighted linear combination of predictions from multiple models
y = S wk yk
– Why? Any predictions from a point estimate of parameters or a single
model has only a small chance of the being the best
– Averaging makes our predictions more stable and less sensitive to
random variations in a particular data set (good for less stable models
like trees)
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Model Averaging
•
Model averaging flavors
– Fully Bayesian: average over uncertainty in parameters and models
– “empirical Bayesian”: learn weights over multiple models
• E.g., stacking and bagging
– Build multiple simple models in a systematic way and combine them, e.g.,
• Bagging
– Build models on random subsets of the data and then combine
– E.g., Random forests: stochastically perturb the data, learn multiple trees, and
then combine for prediction
• Stacking/Ensemble methods
– Build multiple different models and then learn to combine them
– Combining weights learned on a different data set than parameter estimation
• Boosting:
– Start with a simple model
– Reweight the training data to emphasize where the model makes errors
– Learn an additional model to “correct” and add to the original model
– Repeat for many iterations
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Bagging for Combining Classifiers
•
Training data sets of size N
•
Generate B “bootstrap” sampled data sets of size N
•
Build B models (e.g., trees), one for each bootstrap sample
•
For prediction, combine the predictions from the B models
– Bootstrap sample = sample with replacement
– e.g. B = 100
– Intuition is that the bootstrapping “perturbs” the data enough to make the
models more resistant to true variability
– E.g., for classification p(c | x) = fraction of B models that predict c
– Plus: generally improves accuracy on models such as trees
– Negative: lose interpretability
– Related techniques: random forests, boosting.
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
green = majority vote
purple = averaging
the probabilities
From Hastie, Tibshirani,
and Friedman, 2001
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Illustration of Boosting:
Color of points = class label
Diameter of points = weight at each iteration
Dashed line: single stage classifier. Green line: combined, boosted classifier
Dotted blue in last two: bagging
(from G. Rätsch, Phd thesis, 2001)
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Support Vector Machines
•
Support vector machines
– Use a specific loss function, the “margin”
• Results in convex optimization problem, solvable by quadratic programming
– Decision boundary represented by examples (“support vectors”) in training data
– Linear version:
• Uses clever placement of the hyperplane
• Very useful in high-dimensional problems, e.g., text classification
– Non-linear version:
• “kernel trick” for high-dimensional problems
– Some parameter tuning required, e.g., using validation data
– Computational complexity can be O(N3) without speedups
• Heuristic approximations
• e.g., Platt (1999), Sequential Minimal Optimization (SMO)
– Will discuss SVMs again in future lecture on text classification
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Experiments by Komarek and Moore, 2005
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Accuracies and Training Time
Komarek and Moore, 2005
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Accuracies and Training Time
Komarek and Moore, 2005
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
From Caruana and
Niculescu-Mizil,
2005
Results
averaged
over 8
well-known
classification
data sets
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Comparison of accuracy across three classifiers: Naive Bayes, Maximum Entropy and Linear
SVM, using three data sets: 20 newsgroups, the Recreation sub-tree of the Open Directory,
and University Web pages from WebKB. From Chakrabarti, 2003, Chapter 5.
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Summary on Classifiers
• Simple models (can be effective on some problems)
– Logistic regression
– Naïve Bayes
– K nearest-neighbors
• Decision trees
– Good for high-dimensional problems with different data types
• State of the art:
– Support vector machines
– Boosted trees (e.g., boosting with decision stumps)
• Many tradeoffs in interpretability, score functions, etc
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Decision Tree Classifiers
Task
Representation
Data Mining Lectures
Classification
Decision boundaries =
hierarchy of axis-parallel
Score Function
Cross-validated
error
Search/Optimization
Greedy search in
tree space
Data
Management
None specified
Models,
Parameters
Tree
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Naïve Bayes Classifier
Task
Representation
Conditional independence
probability model
Score Function
Likelihood
Search/Optimization
Closed form
probability estimates
Data
Management
None specified
Models,
Parameters
Data Mining Lectures
Classification
Conditional
probability tables
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Logistic Regression
Task
Representation
Score Function
Search/Optimization
Data Mining Lectures
Classification
Log-odds(C) = linear
function of X’s
Log-likelihood
Iterative (Newton) method
Data
Management
None specified
Models,
Parameters
Logistic
weights
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Nearest Neighbor Classifier
Task
Representation
Data Mining Lectures
Classification
Memory-based
Score Function
Cross-validated error
(for selecting k)
Search/Optimization
None
Data
Management
None specified
Models,
Parameters
None
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Support Vector Machines
Task
Representation
Hyperplanes
Score Function
“Margin”
Search/Optimization
Data Mining Lectures
Classification
Convex optimization
(quadratic programming)
Data
Management
None specified
Models,
Parameters
None
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Neural Networks
Task
Representation
Score Function
Search/Optimization
Data Mining Lectures
Regression
Y = nonlin function of X’s
Least-squares
Gradient descent
Data
Management
None specified
Models,
Parameters
Network
weights
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Multivariate Linear Regression
Task
Representation
Data Mining Lectures
Regression
Y = Weighted linear sum
of X’s
Score Function
Least-squares
Search/Optimization
Linear algebra
Data
Management
None specified
Models,
Parameters
Regression
coefficients
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Autoregressive Time Series Models
Task
Representation
Data Mining Lectures
Time Series Regression
X = Weighted linear sum
of earlier X’s
Score Function
Least-squares
Search/Optimization
Linear algebra
Data
Management
None specified
Models,
Parameters
Regression
coefficients
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Software for Predictive Modeling
•
Research software implementations
•
Weka
•
MATLAB
•
R
•
Commercial tools
–
–
–
–
–
–
–
–
–
–
–
Data Mining Lectures
Many very good implementations of algorithms available on the Web from researchers
E.g., SVMLight by Thorsten Joachims
Free package, useful for classification and regression
Many free “toolboxes” on the Web for regression and prediction
e.g., see http://lib.stat.cmu.edu/matlab/
and in particular the CompStats toolbox
General purpose statistical computing environment (successor to S)
Free (!)
Widely used by statisticians, has a huge library of functions and visualization tools
SAS, other statistical packages
Data mining packages
Often are not progammable/customizable: offer a fixed menu of items
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Additional Reading
•
Chapters 10 and 11 in the text
•
Suggested background reading for further information:
–
Review paper by Greg Ridgeway on the class Web site
•
–
Elements of Statistical Learning,
•
–
B Schoelkopf and A. Smola, MIT Press, 2003.
Classification Trees,
•
Data Mining Lectures
T. Hastie, R. Tibshirani, and J. Friedman, Springer Verlag, 2001
Learning from Kernels,
•
–
Thorough and informative
Breiman, Friedman, Olshen, and Stone, Wadsworth Press, 1984.
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Backup Slides (not used)
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Decision Tree Pseudocode
node = tree-design(Data = {X,C})
For i = 1 to d
quality_variable(i) = quality_score(Xi, C)
end
node = {X_split, Threshold } for max{quality_variable}
{Data_right, Data_left} = split(Data, X_split, Threshold)
if node == leaf?
return(node)
else
node_right = tree-design(Data_right)
node_left = tree-design(Data_left)
end
end
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Computational Complexity for a Binary Tree
• At the root node, for each of p variables
– Sort all values, compute quality for each split
– O(pN log N) time for real-valued or ordinal variables
• Subsequent internal node operations each take O(N’ log N’)
- e.g., balanced tree of depth K requires
….. Homework 2 problem
• This assumes data are in main memory
– If data are on disk then repeated access of subsets at different nodes
may be very slow (impossible to pre-index)
– Note: time difference between retrieving data in RAM and data on disk
may be O(103) or more.
Data Mining Lectures
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine
Splitting on a nominal attribute
• Nominal attribute with m values
– e.g., the name of a state or a city in marketing data
• 2m-1 possible subsets => exhaustive search is O(2m-1)
– For small m, a simple approach is to branch on specific values
– But for large m this may not work well
• Neat trick for the 2-class problem:
–
–
–
–
Data Mining Lectures
For each predictor value calculate the proportion of class 1’s
Order the m values according to these proportions
Now treat as an ordinal variable and select the best split (linear in m)
This gives the optimal split for the Gini index, among all possible 2m-1
splits (Breiman et al, 1984).
Lectures 10/11: Classification
Padhraic Smyth, UC Irvine