Classification and Supervised Learning

Download Report

Transcript Classification and Supervised Learning

Classification and Supervised Learning
Credits
Hand, Mannila and Smyth
Cook and Swayne
Padhraic Smyth’s notes
Shawndra Hill notes
Data Mining - Volinsky - 2011 - Columbia University
1
Classification
• Classification or supervised learning
– prediction for categorical response
• T/F, color, etc
• Often quantized real values, or non-scaled numeric
– can be used with categorical predictors
– Can be used for missing data – as a response in itself!
– methods for fitting can be
• Parametric (e.g. linear discriminant)
• Algorithmic (e.g. trees)
– Logistic regression (with threshold on response prob)
Data Mining - Volinsky - 2011 - Columbia University
2
• Because labels are known, you can
build parametric models for the
classes
• can also define decision regions
and decision boundaries
Data Mining - Volinsky - 2011 - Columbia University
3
Types of classification models
• Probabilistic, based on p( x | ck ),
– Naïve Bayes
– Linear discriminant analysis
• Regression-based, based on p( ck | x )
– Logistic regression: linear predictor of logit
– Neural network non-linear extension of logistic
• Discriminative models, focus on locating optimal decision boundaries
– Decision trees: Most popular
– Support vector machines (SVM): currently trendy, computationally complex
– Nearest neighbor: simple, elegant
Data Mining - Volinsky - 2011 - Columbia University
4
Evaluating Classifiers
• Classifiers predict class for new data
– Some models give probability class estimates
• Simplest: accuracy = % classified correctly (in-sample or out-of-sample)
– Not always a great idea – e.g. fraud
• Recall: ROC Area
– area under ROC plot
Data Mining - Volinsky - 2011 - Columbia University
5
Linear Discriminant Analysis
• LDA - parametric classification
– assume multivariate normal distribution
of each class –w/ equal covariance
structure
– Decision boundaries are a linear
combination of the variables
– compare the difference between class
means with the variance in each class
– pros:
•
•
•
•
easy to define likelihood
easy to define boundary
easy to measure goodness of fit
interpretation easy
– cons:
• very rare for data come close to a multinormal!
• works only on numeric predictors
Data Mining - Volinsky - 2011 - Columbia University
6
LDA
• Flea Beetles data
– Clear classification rule for new data
1
2
3
Error
1
20
0
1
0.048
2
0
22
0
0.00
3
3
0
28
0.097
Total
0.054
In-sample misclassification rate = 5.4%
Better to do X-val
Courtesy Cook/Swayne
Data Mining - Volinsky - 2011 - Columbia University
7
Classification (Decision) Trees
• Trees are one of the most popular and useful of all data
mining models
• Algorithmic version of classification
• Pros:
–
–
–
–
–
–
no distributional assumptions
can handle real and nominal inputs
speed and scalability
robustness to outliers and missing values
interpretability
compactness of classification rules
• Cons
– interpretability ?
– several tuning parameters to set with little guidance
– decision boundary is non-continuous
Data Mining - Volinsky - 2011 - Columbia University
8
Decision Tree Example
Example: Do people pay bills?
Courtesy P.Smyth
Debt
Income
Data Mining - Volinsky - 2011 - Columbia University
9
Decision Tree Example
Debt
Income > t1
??
t1
Income
Data Mining - Volinsky - 2011 - Columbia University
10
Decision Tree Example
Debt
Income > t1
t2
Debt > t2
t1
Income
??
Data Mining - Volinsky - 2011 - Columbia University
11
Decision Tree Example
Debt
Income > t1
t2
Debt > t2
t3
t1
Income
Income > t3
Data Mining - Volinsky - 2011 - Columbia University
12
Decision Tree Example
Debt
Income > t1
t2
Debt > t2
t3
Income
t1
Income > t3
Note: tree boundaries are piecewise
linear and axis-parallel
Data Mining - Volinsky - 2011 - Columbia University
13
Example: Titanic Data
• On the Titanic
–
–
–
–
1313 passengers
34% survived
was it a random sample?
or did survival depend on features of the individual?
• sex
• age
• class
pclass survived
name age embarked sex
1 1st
1
Allen, Miss Elisabeth Walton 29.0000 Southampton female
2 1st
0
Allison, Miss Helen Loraine 2.0000 Southampton female
3 1st
0
Allison, Mr Hudson Joshua Creighton 30.0000 Southampton male
4 1st
0 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.0000 Southampton female
5 1st
1
Allison, Master Hudson Trevor 0.9167 Southampton male
6 2nd
1
Anderson, Mr Harry 47.0000 Southampton male
Data Mining - Volinsky - 2011 - Columbia University
14
Decision trees
• At first ‘split’ decide which is the best variable to create separation
between the survivors and non-survivors cases:
N:1313
p: 0.34
Male
Sex?
N:850
Y:
150
N:0.16
p:
1500
Greater than 12
N: 646
p:0.10
Class
N:463
Y:
50
N:0.66
p:
3500
Class
Age
N: 821
p: 0.15
2nd or 3rd
Female
Less Than 12
N:29
p:0.73
1st Class
N: 175
p: 0.31
3rd Class
N=213
p: 0.37
1st or 2nd Class
N: 250
p: 0.912
Goodness of split is
determined by the ‘purity’ of
the leaves
Data Mining - Volinsky - 2011 - Columbia University
15
Decision Tree Induction
•
Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Examples are partitioned recursively to maximize purity
•
Conditions for stopping partitioning
– All samples belong to the same class
– Leaf node smaller than a specified threshold
– Tradeoff between complexity and generalizability
• Predictions for new data:
– Classification by majority voting is employed for classifying all members of the leaf
– Probability based on training data that ended up in that leaf.
– Class Probability estimates can be used also
Data Mining - Volinsky - 2011 - Columbia University
16
Determining optimal splits via Purity
• Can be measured by Gini Index or Entropy
– For node n with m classes:
The goodness of a split s (resulting in two nodes s1 and s2 is
assessed by the weighted gini from s1 and s2 :
Purity(s) :
Data Mining - Volinsky - 2011 - Columbia University
17
Example
•
•
•
•
Two class problem:
400 observations in each class (400, 400)
Caluclate Gini:
Split A:
– (300, 100)
(100, 300)
• Split B:
– (200, 400)
(200, 0)
• What about the misclassification rate?
Data Mining - Volinsky - 2011 - Columbia University
18
Finding the right size
• Use a hold out sample (n fold cross-validation)
• Overfit a tree - with many leaves
• snip the tree back and use the hold out sample for
prediction, calculate predictive error
• record error rate for each tree size
• repeat for n folds
• plot average error rate as a function of tree size
• fit optimal tree size to the entire data set
Data Mining - Volinsky - 2011 - Columbia University
19
Finding the right size: Iris data
Data Mining - Volinsky - 2011 - Columbia University
20
Data Mining - Volinsky - 2011 - Columbia University
21
Multi-class example
• Be careful with examples with class > 2
– Might not predict all cases
Data Mining - Volinsky - 2011 - Columbia University
22
Notes on X-Validation with Trees
To do n-fold x-validation:
split into n-folds
use each fold to find the optimal number of nodes
average results of folds to pick the overall optimum k
‘Final Model’ is the tree fit on ALL data of size k
However,
if the best trees in each fold are very different (eg
different terminal nodes), this is a cause for alarm.
Data Mining - Volinsky - 2011 - Columbia University
23
Regression Trees
• Trees can also be used for regression: when the
response is real valued
– leaf prediction is mean value instead of class probability
estimates
– Can use variance as a purity measure
– helpful with categorical predictors
Data Mining - Volinsky - 2011 - Columbia University
24
Tips data
Data Mining - Volinsky - 2011 - Columbia University
25
Treating Missing Data in Trees
• Missing values are common in practice
• Approaches to handing missing values
– Algorithms can handle missing data automatically
– During training and testing
• Send the example being classified down both branches and average
predictions
– Treat “missing” as a unique value (if variable is categorical)
Data Mining - Volinsky - 2011 - Columbia University
26
Extensions of with Classification Trees
• Can use non-binary splits
–
–
–
–
Multi-way
Tend to increase complexity substantially, and don’t improve performance
Binary splits are interpretable, even by non-experts
Easy to compute, visualize
• Can also consider linear combination splits
– Can improve predictive performance, but hurts interpretability
– Harder to optimize
• Loss function
– Some errors may be more costly than others
– Can incorporate into Gini calculation
• Plain old trees usually work quite well
Data Mining - Volinsky - 2011 - Columbia University
27
Why Trees are widely used in Practice
• Can handle high dimensional data
– builds a model using 1 dimension at time
• Can handle any type of input variables
– Categorical predictors
• Invariant to monotonic transformations of input variables
– E.g., using x, 10x + 2, log(x), 2^x, etc, will not change the tree
– So, scaling is not a factor - user can be sloppy!
• Trees are (somewhat) interpretable
– domain expert can “read off” the tree’s logic as rules
• Tree algorithms are relatively easy to code and test
Data Mining - Volinsky - 2011 - Columbia University
28
Limitations of Trees
• Difficulty in modelling linear structure
• Lack of smoothness
• High Variance
– trees can be “unstable” as a function of the sample
• e.g., small change in the data -> completely different tree
– causes two problems
• 1. High variance contributes to prediction error
• 2. High variance reduces interpretability
– Trees are good candidates for model combining
• Often used with boosting and bagging
Data Mining - Volinsky - 2011 - Columbia University
29
Decision Trees are not stable
Moving just one
example slightly
may lead to quite
different trees and
space partition!
Lack of stability
against small
perturbation of data.
Figure from
Duda, Hart & Stork,
Chap. 8
Data Mining - Volinsky - 2011 - Columbia University
30
Random Forests
• Another con for trees:
– trees are sensitive to the primary split, which can lead the tree in
inappropriate directions
– one way to see this: fit a tree on a random sample, or a bootstrapped
sample of the data -
• Solution:
–
–
–
–
random forests: an ensemble of unpruned decision trees
each tree is built on a random subset (or bootstrap) of the training data
at each split point, only a random subset of predictors are selected
prediction is simply majority vote of the trees ( or mean prediction of the
trees).
• Has the advantage of trees, with more robustness, and a
smoother decision rule.
• More on this later, worth knowing about now
Data Mining - Volinsky - 2011 - Columbia University
31
Other Models: k-NN
• k-Nearest Neighbors (kNN)
• to classify a new point
– look at the kth nearest neighbor(s) from the training set
– what is the class distribution of these neighbors?
Data Mining - Volinsky - 2011 - Columbia University
32
K-nearest neighbor
Data Mining - Volinsky - 2011 - Columbia University
33
K-nearest neighbor
• Advantages
– simple to understand
– simple to implement - nonparametric
• Disadvantages
– what is k?
• k=1 : high variance, sensitive to data
• k large : robust, reduces variance but blends everything together - includes ‘far
away points’
– what is near?
• Euclidean distance assumes all inputs are equally important
• how do you deal with categorical data?
– no interpretable model
• Best to use cross-validation to pick k.
Data Mining - Volinsky - 2011 - Columbia University
34
Probabilistic (Bayesian) Models for Classification
Bayes rule (as applied to classification):
If you belong to class ck, you have a distribution over input vectors x:
If given priors p(ck), we can get posterior distribution on classes p(ck|x)
At each point in the x space, we have a predicted class vector, allowing for
decision boundaries
Data Mining - Volinsky - 2011 - Columbia University
35
Example of Probabilistic Classification
p( x | c2 )
p( x | c1 )
1
p( c1 | x )
0.5
0
Data Mining - Volinsky - 2011 - Columbia University
36
Example of Probabilistic Classification
p( x | c2 )
p( x | c1 )
1
p( c1 | x )
0.5
0
Data Mining - Volinsky - 2011 - Columbia University
37
Decision Regions and Bayes Error Rate
p( x | c2 )
Class c2
Class c1
Class c2
p( x | c1 )
Class c1
Class c2
Optimal decision regions = regions where 1 class is more likely
Optimal decision regions  optimal decision boundaries
Data Mining - Volinsky - 2011 - Columbia University
38
Decision Regions and Bayes Error Rate
p( x | c2 )
Class c2
Class c1
Class c2
p( x | c1 )
Class c1
Class c2
Under certain conditions, we can estimate the BEST case error IF our model is correct.
Bayes error rate = fraction of examples misclassified by optimal classifier
(shaded area above).
If max=1, then there is no error. Hence:
Data Mining - Volinsky - 2011 - Columbia University
39
Procedure for optimal Bayes classifier
• For each class learn a model p( x | ck )
– E.g., each class is multivariate Gaussian with its own mean and covariance
• Use Bayes rule to obtain p( ck | x )
=> this yields the optimal decision regions/boundaries
=> use these decision regions/boundaries for classification
• Correct in theory…. but practical problems include:
– How do we model p( x | ck ) ?
– Even if we know the model for p( x | ck ), modeling a distribution or
density will be very difficult in high dimensions (e.g., p = 100)
Data Mining - Volinsky - 2011 - Columbia University
40
Naïve Bayes Classifiers
• To simplify things in high dimension, make a conditional
independence assumption
on p( x| ck ), i.e.
• Typically used with categorical variables
– Real-valued variables discretized to create nominal versions
• Comments:
– Simple to train (estimate conditional probabilities for each feature-class pair)
– Often works surprisingly well in practice
• e.g., state of the art for text-classification, basis of many widely used spam filters
Data Mining - Volinsky - 2011 - Columbia University
41
Play-tennis example: estimating P(C=win|x)
outlook
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
Humidity
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
P(y) = 9/14
P(n) = 5/14
Windy
false
true
false
false
false
true
true
false
false
false
true
true
false
true
Win?
N
N
Y
Y
Y
N
Y
N
Y
Y
Y
Y
Y
N
P(sunny|y) = 2/9
P(sunny|n) = 3/5
P(overcast|y) = 4/9
P(overcast|n) = 0
P(rain|y) = 3/9
P(rain|n) = 2/5
temperature
P(hot|y) = 2/9
P(hot|n) = 2/5
P(mild|y) = 4/9
P(mild|n) = 2/5
P(cool|y) = 3/9
P(cool|n) = 1/5
humidity
P(high|y) = 3/9
P(high|n) = 4/5
P(normal|y) = 6/9
P(normal|n) = 2/5
windy
P(true|y) = 3/9
P(true|n) = 3/5
P(false|y) = 6/9
P(false|n) = 2/5
Data Mining - Volinsky - 2011 - Columbia University
42
Play-tennis example: classifying X
• An unseen sample X = <rain, hot, high, false>
• P(X|y)·P(y) =
P(rain|y)·P(hot|y)·P(high|y)·P(false|y)·P(y) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
• P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14
= 0.018286
• Sample X is classified in class n (you’ll lose!)
Data Mining - Volinsky - 2011 - Columbia University
43
The independence hypothesis…
• … makes computation possible
• … yields optimal classifiers when satisfied
• … but is seldom satisfied in practice, as attributes (variables) are
often correlated.
• Yet, empirically, naïve bayes performs really well in practice.
Data Mining - Volinsky - 2011 - Columbia University
44
Naïve Bayes
estimate of the prob that a point x will belong to ck:
p
p(ck | x)  p(ck ) p( x j | ck )
j 1
If there are two classes, we look at the ratio of the two probabilit
“weights of
evidence”
Data Mining - Volinsky - 2011 - Columbia University
45