Transcript Document

Machine Learning
Georg Pölzlbauer
December 11, 2006
Outline






Exercises
Data Preparation
Decision Trees
Model Selection
Random Forests
Support Vector Machines
Exercises





Groups of 2 or 3 students
UCI ML Repository: pick 3 data sets
(different characteristics, i.e. number of
samples, number of dimensions, number
of classes)
Estimate classification error with 3
classifiers of choice; compare results
Estimate appropriate parameters for
these classifiers
Implement in Matlab, R, WEKA, YALE,
KNIME
Exercises: Software





Matlab
YALE
http://rapid-i.com/
WEKA
http://www.cs.waikato.ac.nz/ml/weka/
KNIME
http://www.knime.org/
R
http://www.r-project.org/
Exercises: Software




WEKA: recommended; easy to use,
easy to learn, no programming
KNIME, YALE: also easy to use
R: most advanced and powerful
software; do not use if you do not
know R really well!
Matlab: not recommended; requires
installation of packages from
internet etc.
Exercises: Written Report





Report should be 5-10 pages
Discuss characteristics of data sets
(i.e. handling of missing values,
scaling etc.)
Summarize classifiers used (one
paragraph each)
Discuss experimental results
(tables, figures)
Do not include code in report
Exercises: How to proceed



It is not necessary to implement
anything; rely on libraries, modules
etc
UCI ML Repository:
http://www.ics.uci.edu/~mlearn/ML
Summary.html
Import data file, scale data, apply
model selection, write down any
problems/findings
Grading




No written/oral exam
End of January submission of report
Ca. 15 minutes discussion of results
and code (individually for each
group)
Grading bonus: Use of sophisticated
models, detailed comparision of
classifiers, thorough discussion of
experiments, justification of choices
Questions?

Questions regarding theory:



[email protected]
[email protected]
Questions regarding R, Weka, …:

Forum
Machine Learning: Setting
gender
age
smoker
eye
color
lung
cancer
male
19
yes
green
no
female
44
yes
gray
yes
male
49
yes
blue
yes
male
12
no
brown
no
female
37
no
brown
no
female
60
no
brown
yes
male
44
no
blue
no
female
27
yes
brown
no
female
51
yes
green
yes
female
81
yes
gray
no
male
22
yes
brown
no
male
29
no
blue
no
male
77
yes
gray
?
male
19
yes
green
?
female
44
no
gray
?
Machine Learning: Setting
gender
age
smoker
eye
color
lung
cancer
male
19
yes
green
no
female
44
yes
gray
yes
male
49
yes
blue
yes
male
12
no
brown
no
female
37
no
brown
no
female
60
no
brown
yes
male
44
no
blue
no
female
27
yes
brown
no
female
51
yes
green
yes
female
81
yes
gray
no
male
22
yes
brown
no
male
29
no
blue
no
male
77
yes
gray
?
male
19
yes
green
?
female
44
no
gray
?
Train
ML Model
Machine Learning: Setting
gender
age
smoker
eye
color
lung
cancer
male
19
yes
green
no
female
44
yes
gray
yes
male
49
yes
blue
yes
male
12
no
brown
no
female
37
no
brown
no
female
60
no
brown
yes
male
44
no
blue
no
female
27
yes
brown
no
female
51
yes
green
yes
female
81
yes
gray
no
male
22
yes
brown
no
male
29
no
blue
no
male
77
yes
gray
yes
male
19
yes
green
no
female
44
no
gray
no
Train
ML Model
Data Preparation





-> Example adult census data
Table format data (data matrix)
Missing values
Categorical data
Quantitative (continuous) data with
different scales
Categorical variables




Non-numeric variables with a finite
number of levels
E.g. "red", "blue", "green"
Some ML algorithms can only
handle numeric variables
Solution: 1-to-N coding
1-to-N Coding
feature
red
blue green
red
1
0
0
blue
0
1
0
green
0
0
1
red
1
0
0
red
1
0
0
green
0
0
1
blue
0
1
0
Scaling of continuous variables



Many ML algorithms rely on measuring
the distance between 2 samples
There should be no difference if a length
variable is measured in cm, inch, or km
To remove the unit of measure (e.g. kg,
mph, …) each variable dimension is
normalized:


subtract mean
divide by standard deviation
Scaling of continuous variables


Data set now has mean 0,
variance 1
Chebyshev's inequality:



75% of data between -2 and +2
89% of data between -3 and +3
94% of data between -4 and +4
Output variables


ML requires categorical output
(continuous output = regression)
ML methods can be applied by
binning continuous output (loss of
prediction accuracy)
Household income
$10.000
very
low
$200.000
low
average
high
very
high
Binary Decision Trees



Rely on Information Theory
(Shannon)
Recursive algorithm that splits
feature space into 2 areas at each
recursion step
Classification works by going
through the tree from the root node
until arriving at a leaf node
Decision Trees: Example
x < 12.3
y < 4.6
blue
y < 3.9
x < 11.7
red
x < 13.1
blue
y < 1.7
red
red
red
blue
Information Theory, Entropy




Introduced by Claude Shannon
Applications in data compression
Concerned with measuring actual
information vs. redundancy
Measures information in bits
What is „Entropy“?



In Machine Learning, Entropy is a
measure for the impurity of a set
High Entropy => bad for prediction
High Entropy => needs to be reduced
(Information Gain)
n
H ( X )    p( xi ) log2 p( xi )
i 1
Calculating H(X)
4
p ( xred )   0.33
12
8
p ( xblue )   0.67
12
n
H ( X )    p( xi ) log2 p( xi )
i 1
H ( X )   0.33 log2 0.33 0.67 log2 0.67
 ( 0.33)  (1.59)

(0.67)  (0.59)

0.53

0.39

0.92
H(X): Case studies
1
0.9
p(xred) p(xblue)
H(X)
I
0.5
0.5
1
II
0.3
0.7
0.88
III
0.7
0.3
0.88
IV
0
1
0
0.8
n
H ( X )    p( xi ) log2 p( xi )
0.7
H(X)
0.6
i 1
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
p(x red)
0.6
0.7
0.8
0.9
1
H(X): Relative vs. absolute frequencies
vs.
red
blue
I
8
4
II
18
9
=> H(XI) = H(XII)
Only relative frequencies matter!
Information Gain
Given
Information
a set and
Gain:
a choice
Sets that
between
minimize
possible
Entropy by
subsets,
largestwhich
amount
one is
preferable?
IG( X A , X B )  H ( X )  p( xA ) H ( X A )  p( xB ) H ( X B )
H(X) = 1
A (green)
B (yellow)
Points
5
9
6
5
1
4
p(X.)
0.5
0.9
0.6
0.5
0.1
0.4
p(xred)
0.2
0.44
0.33
0.8
1
0.75
p(xblue)
0.8
0.56
0.67
0.2
0
0.25
H(X.)
0.72
0.99
0.92
0.72
0
0.81
IG
0.28
0.11
0.12
Informatin Gain (Properties)




IG is at most as large as the
Entropy of the original set
IG is the amount by which the
original Entropy can be reduced by
splitting into subsets
IG is at least zero (if Entropy is not
reduced)
0 <= IG <= H(X)
Building (binary) Decision Trees


Data set: categorical or quantitative
variables
Iterate variables, calculate IG for every
possible split



categorical variables: one variable vs. the rest
quantitative variables: sort values, split
between each pair of values
recurse until prediction is good enough
Decision Trees: Quantitative variables
0.07
0.00
0.01
0.03
0.08
0.03
0.00
0.00
0.01
0.13
0.06
0.43 0.17 0.26
0.06 0.01 0.11
0.13 0.05
0.29 0.28 0.09 0.16
original H: 0.99
Decision Trees: Quantitative variables
x < 12.3
4.6 blue
yy<<4.6
blue
blue
blue
y 3.9
< 3.9
redy <
11.7
xx<<11.7
blue
red
red
xx<<13.1
red
13.1
blue
blue
y?< 1.7
red
red
red
red
redred
blue
Decision Trees: Classification
x < 12.3
y < 4.6
blue
y < 3.9
x < 11.7
red
x < 13.1
blue
y < 1.7
red
red
red
blue
Decision Trees: Classification
x < 12.3
y < 4.6
blue
y < 3.9
x < 11.7
red
x < 13.1
blue
y < 1.7
red
red
red
blue
Decision Trees: Classification
x < 12.3
y < 4.6
blue
y < 3.9
x < 11.7
red
x < 13.1
blue
y < 1.7
red
red
red
blue
Decision Trees: More than 2 classes
x < 12.3
y < 4.6
blue
y < 3.9
x < 11.7
orange
x < 13.1
blue
y < 1.7
red
m
IG( X 1 ,..., X m )  H ( X )   p( x j ) H ( X j )
n
j 1
H ( X )    p( xi ) log2 p( xi )
i 1
red
orange
blue
Decision Trees: Non-binary trees
drive-wheels?
fwd
rwd
4wd
…
…
…
m
IG( X 1 ,..., X m )  H ( X )   p( x j ) H ( X j )
n
j 1
H ( X )    p( xi ) log2 p( xi )
i 1
Decision Trees: Overfitting

Fully grown trees are usually too
complicated
Decision Trees: Stopping Criteria





Stop when absolute number of
samples is low (below a threshold)
Stop when Entropy is already
relatively low (below a threshold)
Stop if IG is low
Stop if decision could be random
(Chi-Square test)
Threshold values are
hyperparameters
Decision Trees: Pruning



"Pruning" means removing nodes
from a tree after training has
finished
Stopping criteria are sometimes
referred to as "pre-pruning"
Redundant nodes are removed,
sometimes tree is remodeled
Example: Pruning
x < 12.3
4.6 blue
yy<<4.6
blue
blue
blue
y 3.9
< 3.9
redy <
11.7
xx<<11.7
blue
red
red
xx<<13.1
red
13.1
blue
blue
y?< 1.7
red
red
red
red
redred
blue
Decision Trees: Stability
Decision Trees: Stability
Decision Trees: Stability
Decision Trees: Stability
Decision Trees: Stability
Decision Trees: Stability
Decision Trees: Stability
Decision Trees: Stability
Decision Trees: Stability
Decision Trees: Stability
Decision Trees: Stability
Decision Trees: Stability
Decision Trees: Stability
Model Selection



General ML Framework
Takes care of estimating
hyperparameters
Takes care of selecting good model
(avoid local minima)
Why is Generalization an Issue?
140
150
160
170
180
190
200
Why is Generalization an Issue?
140
150
160
170
180
190
200
150
160
170
180
190
200
Why is Generalization an Issue?
150
160
170
180
190
200
Why is Generalization an Issue?
f
150
m f
160
170
m
180
190
200
Why is Generalization an Issue?
140
150
160
f
150
170
m f
160
170
180
190
200
190
200
m
180
Bayes Optimal Classifier
140
150
160
170
180
190
200
150
160
170
180
190
200
Training Set, Test Set

Solution:





Split data into training and test sets
~80% training, 20% test
Train different models
Classify test set
Pick the one model that has the
least test set error
Trade-off complexity vs. generalization
Minimum of
Test Set Error
Error
Test Set
Training Set
Model Complexity
Estimation of Generalization Error



Test set is used in model selection
and tuning of parameters
thus, test set error is not an
accurate estimate of generalization
error
Generalization error is the expected
error that the classifier will make on
a given data set
Training-Test-Validation


Save part of the data set for validation
Split: E.g.






60% training set
20% test set
20% validation set
Train classifiers on training set
Select classifier based on test set
performance
Estimate generalization error on
validation set
Crossvalidation



Split data into 10 parts of equal
sizes
This is called 10-fold crossvalidation
repeat 10 times:



use 9 parts for training/tuning
calculate performance on remaining
part (validation set)
Estimate of generalization error is
average of the validation set errors
Bootstrapping





A bootstrap sample is a random subset of
the data sample
Validation set is also random sample
In the sampling process, data points may
be selected repeatedly (with replacement)
An arbitrary number of bootstrap samples
may be used
Bootstrapping is more reliable than
crossvalidation, training-test-validation
Example: Bootstrapping
Example: Bootstrapping
Random Forests




Combination of decision trees and
bootstrapping concepts
A large number of decision trees is
trained, each on a different bootstrap
sample
At each split, only a random number of
the original variables is available (i.e.
small selection of columns)
Data points are classified by majority
voting of the individual trees
Example: Random Forests
12
A
0.1
501
red
I
8
B
1.2
499
red
II
9
B
1.1
504
blue
II
15
A
1.8
480
green
II
2
C
1.0
511
red
I
-2
C
0.7
512
green
II
7
C
0.4
488
cyan
I
7
A
0.6
491
cyan
I
10
A
1.5
500
cyan
I
0
C
0.3
505
blue
II
9
B
1.9
502
blue
II
Example: Random Forests
12
A
0.1
501
red
I
8
B
1.2
499
red
II
15
A
1.8
480
green
II
2
C
1.0
511
red
I
7
C
0.4
488
cyan
I
7
A
0.6
491
cyan
I
0
C
0.3
505
blue
II
Bootstrap sample
Example: Random Forests
12
A
0.1
501
red
I
8
B
1.2
499
red
II
?
15
A
1.8
480
green
II
2
C
1.0
511
red
I
7
C
0.4
488
cyan
I
7
A
0.6
491
cyan
I
0
C
0.3
505
blue
II
Example: Random Forests
12
A
0.1
501
red
I
8
B
1.2
499
red
II
?
15
A
1.8
480
green
II
2
C
1.0
511
red
I
7
C
0.4
488
cyan
I
7
A
0.6
491
cyan
I
0
C
0.3
505
blue
II
Example: Random Forests
12
A
0.1
501
red
I
8
B
1.2
499
red
II
var1 < 8
15
A
1.8
480
green
II
2
C
1.0
511
red
I
7
C
0.4
488
cyan
I
7
A
0.6
491
cyan
I
0
C
0.3
505
blue
II
?
?
Example: Random Forests
12
A
0.1
501
red
I
8
B
1.2
499
red
II
var1 < 8
15
A
1.8
480
green
II
2
C
1.0
511
red
I
7
C
0.4
488
cyan
I
7
A
0.6
491
cyan
I
0
C
0.3
505
blue
II
?
?
Example: Random Forests
12
A
0.1
501
red
I
8
B
1.2
499
red
II
var1 < 8
15
A
1.8
480
green
II
2
C
1.0
511
red
I
7
C
0.4
488
cyan
I
7
A
0.6
491
cyan
I
0
C
0.3
505
blue
II
?
?
Example: Random Forests
12
A
0.1
501
red
I
8
B
1.2
499
red
II
var1 < 8
15
A
1.8
480
green
II
2
C
1.0
511
red
I
7
C
0.4
488
cyan
I
7
A
0.6
491
cyan
I
0
C
0.3
505
blue
II
var3 < 0.4
II
?
I
Example: Random Forests
12
A
0.1
501
red
I
8
B
1.2
499
red
II
var1 < 8
15
A
1.8
480
green
II
2
C
1.0
511
red
I
7
C
0.4
488
cyan
I
7
A
0.6
491
cyan
I
0
C
0.3
505
blue
II
var3 < 0.4
II
?
I
Example: Random Forests
12
A
0.1
501
red
I
8
B
1.2
499
red
II
var1 < 8
15
A
1.8
480
green
II
2
C
1.0
511
red
I
7
C
0.4
488
cyan
I
7
A
0.6
491
cyan
I
0
C
0.3
505
blue
II
var3 < 0.4
II
?
I
Example: Random Forests
12
A
0.1
501
red
I
8
B
1.2
499
red
II
var1 < 8
15
A
1.8
480
green
II
2
C
1.0
511
red
I
7
C
0.4
488
cyan
I
7
A
0.6
491
cyan
I
0
C
0.3
505
blue
II
var3 < 0.4
II
I
var3 < 0.7
I
II
Example: Random Forests
12
A
0.1
501
red
I
8
B
1.2
499
red
II
var1 < 8
15
A
1.8
480
green
II
2
C
1.0
511
red
I
7
C
0.4
488
cyan
I
7
A
0.6
491
cyan
I
0
C
0.3
505
blue
II
var3 < 0.4
II
I
var3 < 0.7
I
II
Classification with Random Forests
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Green = Class I
Blue = Class II
3 votes for I
2 votes for II
=> classify as I
.
.
Properties of Random Forests





Easy to use ("off-the-shelve"), only 2
parameters (no. of trees, %variables for
split)
Very high accuracy
No overfitting if selecting large number of
trees (choose high)
Insensitive to choice of split% (~20%)
Returns an estimate of variable
importance
Support Vector Machines



Introduced by Vapnik
Rather sophisticated mathematical
model
Based on 2 concepts:


Optimization (maximization of margin)
Kernel (non-linear separation)
Linear separation
Linear separation
Linear separation
Linear separation
Largest Margin
Largest Margin



Finding optimal hyperplane can be
expressed as an optimization
problem
Solved by quadratic programming
Soft margin: Not necessarily 100%
classification accuracy on test set
Non-linearly separable data
Non-linearly separable data
Non-linearly separable data
Additional coordinate z=x2
3
2
1
y
0
-1
-2
-3
-4
-4
-3
-2
-1
0
x
1
2
3
4
Additional coordinate z=x2
15
10
z=x2
5
0
4
2
y
0
-2
-4
-4
-3
-2
-1
1
0
x
2
3
4
Additional coordinate z=x2
3
2
1
y
0
-1
-2
-3
-4
0
2
4
6
z=x2
8
10
12
Kernels




Projection of data space into higher
dimensional space
Data may be separable in this high
dimensional space
Projection = multiplication of
vectors with kernel matrix
Kernel matrix determines shape of
possible separators
Common Kernels




Quadratic Kernel
Radial Basis Kernel
General Polynomial Kernel (arbitrary
degree)
Linear Kernel (=no kernel)
Kernel Trick




Other ML algorithms could work with
projected (high dimensional) data, so why
bother with SVM?
Working with high dimensional data is
problematic (complexity)
Kernel Trick: The optimization problem
can be restated such that it uses only
distances in high dimensional data
This is computationally very inexpensive
Properties of SVM



High classification accuracy
Linear kernels: Good for sparse,
high dimensional data
Much research has been directed at
SVM, VC-dimension etc. => solid
background
The End
Additional topics



Confusion Matrix (weights)
Prototype based methods (LVQ,…)
k-NN