Transcript SVM
Course
Summer
Mining
Data
Summer Course: Data Mining
Vector Machines
SupportSupport
Vector Machines
and other penalization classifiers
Presenter: Georgi Nalbantov
Presenter: Georgi Nalbantov
August 2009
2/20
Course
Summer
Mining
Data
Contents
Purpose
Linear Support Vector Machines
Nonlinear Support Vector Machines
(Theoretical justifications of SVM)
Marketing Examples
Other penalization classification methods
Conclusion and Q & A
(some extensions)
3/20
Course
Summer
Mining
Data
Purpose
Task to be solved (The Classification Task):
Classify cases (customers) into “type 1” or “type 2” on the basis of
some known attributes (characteristics)
Chosen tool to solve this task:
Support Vector Machines
4/20
Course
Summer
Mining
Data
The Classification Task
Given data on explanatory and explained variables, where the explained variable
can take two values { 1 }, find a function that gives the “best” separation between
the “-1” cases and the “+1” cases:
Given:
( x1, y1 ), … , ( xm , ym )
Find:
: {1}
n
{1}
n
“best function” = the expected error on unseen data ( xm+1, ym+1 ), … , ( xm+k , ym+k )
is minimal
Existing techniques to solve the classification task:
Linear and Quadratic Discriminant Analysis
Logit choice models (Logistic Regression)
Decision trees, Neural Networks, Least Squares SVM
5/20
Course
Summer
Mining
Data
Support Vector Machines: Definition
Support Vector Machines are a non-parametric tool for classification/regression
Support Vector Machines are used for prediction rather than description purposes
Support Vector Machines have been developed by Vapnik and co-workers
6/20
Course
Summer
Mining
Data
Linear Support Vector Machines
Number of art books purchased
∆
●
“The Art History of Florence”
buyers
non-buyers
∆
∆
Nissan Levin and Jacob Zahavi in Lattin,
Carroll and Green (2003).
Problem: How to identify buyers and nonbuyers using the two variables:
Months since last purchase
Number of art books purchased
∆
∆
∆
∆
●
∆
●
●
∆
●
●
A direct marketing company wants to sell a
new book:
●
●
∆ ●
●
●
●
Months since last purchase
7/20
Course
Summer
Mining
Data
Linear SVM: Separable Case
Main idea of SVM:
Number of art books purchased
separate groups by a line.
∆
●
buyers
non-buyers
∆
∆
∆
However: There are infinitely many lines
that have zero training error…
∆
∆
∆
●
●
●
●
●
●
●
●
Months since last purchase
… which line shall we choose?
8/20
Course
Summer
Mining
Data
Number of art books purchased
Linear SVM: Separable Case
∆
●
buyers
non-buyers
∆
∆
∆
SVM use the idea of a margin around the
separating line.
The thinner the margin,
the more complex the model,
The best line is the one with the
largest margin.
margin
∆
∆
∆
●
●
●
●
●
●
●
●
Months since last purchase
9/20
Course
Summer
Mining
Data
Linear SVM: Separable Case
w1x1 + w2x2 + b = 0
x2
Number of art books purchased
The line having the largest margin is:
w
∆
∆
∆
∆
∆
∆
●
margin
●
●
●
●
●
●
●
x1 = months since last purchase
x2 = number of art books purchased
Note:
x1
Months since last purchase
Where
w1xi 1 + w2xi 2 + b +1
w1xj 1 + w2xj 2 + b –1
for i ∆
for j ●
10/20
Course
Summer
Mining
Data
Linear SVM: Separable Case
Number of art books purchased
x2
The width of the margin is given by:
w
∆
∆
margin
∆
1 ( 1)
w12 w 22
2
|| w ||
∆
∆
∆
●
margin
●
●
●
●
2 w
●
maximize
the margin
●
●
x1
Months since last purchase
Note:
w 2
minimize
w
2
2
minimize
11/20
Course
Summer
Mining
Data
Linear SVM: Separable Case
2 w
x2
maximize
the margin
∆
∆
w 2
w
2
2
minimize
minimize
∆
∆
∆
margin
∆
minimize L( w ) w
●
●
●
●
●
The optimization problem for SVM is:
●
●
x1
2
subject to:
●
2
w1xi 1 + w2xi 2 + b +1
w1xj 1 + w2xj 2 + b –1
for i ∆
for j ●
12/20
Course
Summer
Mining
Data
Linear SVM: Separable Case
“Support vectors”
x2
∆
∆
∆
“Support vectors” are those points that lie
on the boundaries of the margin
∆
∆
∆
●
●
●
●
●
●
●
●
x1
The decision surface (line) is determined
only by the support vectors. All other
points are irrelevant
13/20
Course
Summer
Mining
Data
Linear SVM: Nonseparable Case
Non-separable case: there is no line
separating errorlessly the two groups
Here, SVM minimize L(w,C) :
Training set: 1000 targeted customers
x2
∆
●
buyers
non-buyers
∆
∆
L( w,C ) w
∆
∆
∆
∆
●
●
∆
●
●
●
2
C i
i
∆
●
2
maximize
the margin
minimize the
training errors
●
L(w,C) = Complexity +
∆ ●
●
●
●
subject to:
x1
Errors
w1xi 1 + w2xi 2 + b +1 – i
w1xj 1 + w2xj 2 + b –1 + i
I,j 0
for i ∆
for j ●
14/20
Course
Summer
Mining
Data
Linear SVM: The Role of C
x2
∆
x2
∆
∆
●
C=5
∆
●
∆
●
∆
∆
●
●
●
x1
Bigger C
increased complexity
( thinner margin )
∆
C=1
∆
●
●
∆
●
●
x1
Smaller C
decreased complexity
( wider margin )
smaller number errors
bigger number errors
( better fit on the data )
( worse fit on the data )
Vary both complexity and empirical error via C … by affecting the optimal w and optimal
number of training errors
15/20
Course
Summer
Mining
Data
Bias – Variance trade-off
16/20
Course
Summer
Mining
Data
From Regression into Classification
We have a linear model, such as
y b * x const
We have to estimate this relation using our training data set and having in mind
the so-called “accuracy”, or “0-1” loss function (our evaluation criterion).
The training data set we have consists of only MANY observations, for instance:
Training data:
Output (y)
Input (x)
-1
0.2
1
0.5
1
0.7
...
..
-1
-0.7
17/20
Course
Summer
Mining
Data
From Regression into Classification
We have a linear model, such as
y b * x const
We have to estimate this relation using our
training data set and having in mind the socalled “accuracy”, or “0-1” loss function (our
evaluation criterion).
The training data set we have consists of
only MANY observations, for instance:
y
1
-1
Training data:
Output (y)
Input (x)
-1
0.2
1
0.5
1
0.7
...
..
-1
-0.7
x
x
Support vector
“margin”
Support vector
18/20
Course
Summer
Mining
Data
From Regression into Classification:
Support Vector Machines
flatter line greater penalization
equivalently:
smaller slope bigger margin
y
y b * x const
1
-1
x
x
“margin”
19/20
Course
Summer
Mining
Data
From Regression into Classification:
Support Vector Machines
y b1 * x1 b2 * x2 const
y b1 * x1 b2 * x2 const
y
x2
x2
x1
x1
flatter line greater penalization
“margin”
equivalently:
smaller slope bigger margin
20/20
Course
Summer
Mining
Data
Nonlinear SVM: Nonseparable Case
Mapping into a higher-dimensional space
x2
∆
∆
∆
∆
∆
x11
x
21
xl1
∆
●
∆
●
●
∆
●
●
●
●
∆ ●
●
●
x12
x 22
xl 2
●
x1
2 x11 x12
2 x 21 x 22
2 xl 1xl 2
x122
2
x 22
2
xl 2
Optimization task: minimize L(w,C)
L(w,C) w
x112
2
x 21
2
xl 1
subject to:
2
2
Ci
i
w1xi21 w2 2 xi 1xi 2 w3 xi22 b 1 i
∆
w1x 2j 1 w 2 2 x j 1x j 2 w 3 x 2j 2 b 1 j
●
21/20
Course
Summer
Mining
Data
Nonlinear SVM: Nonseparable Case
Map the data into higher-dimensional space:
x1
x
2
1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1
x
2 x1 x2
x2
2
2
1
x2
●
(-1,1)
(-1,-1)
∆
∆
●
●
x22
∆
∆
(1,1)
x1
∆
2 3
●
●
x12
(1,-1)
2 x1 x2
22/20
Course
Summer
Mining
Data
Nonlinear SVM: Nonseparable Case
Find the optimal hyperplane in the transformed space
x1
x
2
1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1
x
2 x1 x2
x2
2
2
1
x2
●
(-1,1)
(-1,-1)
∆
●
●
x22
∆
∆
(1,1)
x1
∆
∆
●
●
x12
(1,-1)
2 x1 x2
23/20
Course
Summer
Mining
Data
Nonlinear SVM: Nonseparable Case
Observe the decision surface in the original space (optional)
x1
x
2
1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1
x
2 x1 x2
x2
2
2
1
x2
●
∆
●
●
x22
∆
∆
●
x1
∆
∆
●
x12
2 x1 x2
24/20
Course
Summer
Mining
Data
Nonlinear SVM: Nonseparable Case
Dual formulation of the (primal) SVM minimization problem
Primal
m in
w
2
2
Dual
C i
yi w xi b 1 i
yi 1
i
i
i
Subject to
i 0
max
Subject to
0 i C
i
i
yi 0
yi 1
1
2
i
i
j
j
yi yj xi xj
25/20
Course
Summer
Mining
Data
Nonlinear SVM: Nonseparable Case
Dual formulation of the (primal) SVM minimization problem
x1
x
2
x12
2 x1 x2
x2
2
( xi ) ( x j )
x ,
2
i1
( x
2 xi1 xi 2 , xi22
x
2
j1
i1 , xi 2 ) ( x j1 , x j 2 )
x x
i
Dual
max
i
1
2
i
i
i
j
i
j
yi yj xi xj
, 2 x j1 x j 2 , x 2j 2
2
2
j
K ( xi , x j ) ( xi ) ( x j )
(kernel function)
Subject to
0 i C
i
yi 0
yi 1
26/20
Course
Summer
Mining
Data
Nonlinear SVM: Nonseparable Case
Dual formulation of the (primal) SVM minimization problem
x1
x
2
x12
2 x1 x2
x2
2
Dual
max
x ,
( x
i1
2 xi1 xi 2 , x
2
j1
, xi 2 ) ( x j1 , x j 2 )
x x
i
x
2
i2
, 2 x j1 x j 2 , x
2
i
1
2
i
i
( xi ) ( x j )
2
i1
2
j2
max
i
yi yj xi xj
j
1
i
2 i j yi yj ( xi ) ( xj )
i
i
j
2
j
max
i 12 i j yi yj xi xj
i
K ( xi , x j ) ( xi ) ( x j )
(kernel function)
j
i
j
i
2
Subject to
0 i C
i
yi 0
yi 1
27/20
Course
Summer
Mining
Data
Strengths and Weaknesses of SVM
Strengths of SVM:
Training is relatively easy
No local minima
It scales relatively well to high dimensional data
Trade-off between classifier complexity and error can be controlled
explicitly via C
Robustness of the results
The “curse of dimensionality” is avoided
Weaknesses of SVM:
What is the best trade-off parameter C ?
Need a good transformation of the original space
28/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem
Two types of ketchup: Heinz and Hunts
Seven Attributes
Feature Heinz
Feature Hunts
Display Heinz
Display Hunts
Feature&Display Heinz
Feature&Display Hunts
Log price difference between Heinz and Hunts
Training Data: 2498 cases (89.11% Heinz is chosen)
Test Data: 300 cases (88.33% Heinz is chosen)
29/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem
Cross-validation mean squared
errors, SVM with RBF kernel
Choose a kernel mapping:
K (xi , x j ) (xi x j )
Linear kernel
K (xi , x j ) (xi x j 1)d
Polynomial kernel
K (xi, xj ) e
C
min
max
σ
xi xj / 2 2
2
RBF kernel
Do (5-fold ) cross-validation procedure to
find the best combination of the manually
adjustable parameters (here: C and σ)
30/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem – Training Set
Model
Linear Discriminant
Analysis
Heinz
Predicted Group
Membership
Hunts
Original
Count
%
Total
Heinz
Hit Rate
Hunts
68
204
272
Heinz
58
2168
2226
Hunts
25.00%
75.00%
100.00%
Heinz
2.61%
97.39%
100.00%
89.51%
31/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem – Training Set
Model
Logit Choice Model
Heinz
Predicted Group
Membership
Hunts
Original
Count
%
Total
Heinz
Hit Rate
Hunts
214
58
272
Heinz
497
1729
2226
Hunts
78.68%
21.32%
100.00%
Heinz
22.33%
77.67%
100.00%
77.79%
32/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem – Training Set
Model
Support Vector
Machines
Heinz
Predicted Group
Membership
Hunts
Original
Count
%
Total
Heinz
Hit Rate
Hunts
255
17
272
Heinz
6
2220
2226
Hunts
93.75%
6.25%
100.00%
Heinz
0.27%
99.73%
100.00%
99.08%
33/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem – Training Set
Model
Majority Voting
Heinz
Predicted Group
Membership
Hunts
Original
Count
%
Total
Heinz
Hit Rate
Hunts
0
272
272
Heinz
0
2226
2226
Hunts
0%
100%
100.00%
Heinz
0%
100%
100.00%
89.11%
34/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem – Test Set
Model
Linear Discriminant
Analysis
Heinz
Predicted Group
Membership
Hunts
Original
Count
%
Total
Heinz
Hit Rate
Hunts
3
32
35
Heinz
3
262
265
Hunts
8.57%
91.43%
100.00%
Heinz
1.13%
98.87%
100.00%
88.33%
35/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem – Test Set
Model
Logit Choice Model
Heinz
Predicted Group
Membership
Hunts
Original
Count
%
Total
Heinz
Hit Rate
Hunts
29
6
35
Heinz
63
202
265
Hunts
82.86%
17.14%
100.00%
Heinz
23.77%
76.23%
100.00%
77%
36/20
Course
Summer
Mining
Data
The Ketchup Marketing Problem – Test Set
Model
Support Vector
Machines
Heinz
Predicted Group
Membership
Hunts
Original
Count
%
Total
Heinz
Hit Rate
Hunts
25
10
35
Heinz
3
262
265
Hunts
71.43%
28.57%
100.00%
Heinz
1.13%
98.87%
100.00%
95.67%
•37/36
•Part II
•Penalized classification and regression methods
Support Hyperplanes
Nearest Convex Hull classifier
Soft Nearest Neighbor
Application:
An example Support Vector Regression financial study
Conclusion
•38/36
•Classification:
•Support Hyperplanes
•+
•+
•+
•+
•+ •+
•+
•+
•+
•+
•+
•+
•Consider a (separable) binary
classification case: training data (+,-)
and a test point x.
•There are infinitely many
hyperplanes that are semi-consistent
(= commit no error) with the training
data.
•39/36
•Classification:
•Support Hyperplanes
•+
•Support hyperplane
of x
•+
•+
•+ •+
•+
•+
•+
•+
•+
•+
•+
•For the classification of the test
point x, use the farthest-away hplane that is semi-consistent with
training data.
•The SH decision surface. Each
point on it has 2 support h-planes.
•40/36
•Classification:
•Support Hyperplanes
SH, linear kernel
•+
•+ •+
•+ •+
•+
SVM, linear kernel
•+
•+ •+
•+ •+
•+
SH, RBF kernel, =5
•+
•+ •+
•+ •+
•+
SVM, RBF kernel, =5
•+
•+ •+
•+ •+
•+
SH, RBF kernel,=35
•+
•+ •+
•+ •+
•+
SVM, RBF kernel,=35
•+
•+ •+
•+ •+
•+
•Toy Problem Experiment with Support Hyperplanes and Support Vector Machines
•41/36
•Classification:
•Support Vector Machines and Support Hyperplanes
•
Support Vector
Machines
•
Support Hyperplanes
•42/36
•Classification:
•Support Vector Machines and Nearest Convex Hull cl.
•
Support Vector
Machines
•
Nearest Convex Hull
classification
•43/36
•Classification:
•Support Vector Machines and Soft Nearest Neighbor
•
Support Vector
Machines
•
Soft Nearest Neighbor
•44/36
•Classification: Support Hyperplanes
•
Support Hyperplanes
•
(bigger penalization)
•
Support Hyperplanes
•45/36
•Classification: Nearest Convex Hull classification
•
Nearest Convex Hull classification
•
(bigger penalization)
•
Nearest Convex Hull
classification
•46/36
•Classification: Soft Nearest Neighbor
Soft Nearest Neighbor
•
•
(bigger penalization)
•
Soft Nearest Neighbor
•47/36
•Classification: Support Vector Machines,
•Nonseparable Case
•
Support Vector Machines
•48/36
•Classification: Support Hyperplanes,
•Nonseparable Case
•
Support Hyperplanes
•49/36
•Classification: Nearest Convex Hull classification,
•Nonseparable Case
•
Nearest Convex Hull
classification
•50/36
•Classification: Soft Nearest Neighbor,
•Nonseparable Case
•
Soft Nearest Neighbor
•51/36
Summary: Penalization Techniques for Classification
•Penalization methods for classification: Support Vector Machines (SVM), Support Hyperplanes (SH), Nearest
Convex Hull classification (NCH), and Soft Nearest Neighbour (SNN). In all cases, the classificarion of test point x is
dete4rmined using the hyperplane h. Equivalently, x is labelled +1 (-1) if it is farther away from set S_ (S+).
52/20
Course
Summer
Mining
Data
Conclusion
Support Vector Machines (SVM) can be applied in the binary
and multi-class classification problems
SVM behave robustly in multivariate problems
Further research in various Marketing areas is needed to justify
or refute the applicability of SVM
Support Vector Regressions (SVR) can also be applied
http://www.kernel-machines.org
Email: [email protected]