Transcript Document

Support Vector
Machines
Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon University
Copyright © 2001, Andrew W. Moore
Outline
•
•
•
•
Linear SVMs
The definition of a maximum margin classifier
What QP
How Maximum Margin can be turned into a QP
problem
• How we deal with noisy (non-separable) data
• How we permit non-linear boundaries
• How SVM Kernel functions permit
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 2
Outline
•
•
•
•
Linear SVMs
The definition of a maximum margin classifier
What QP
How Maximum Margin can be turned into a QP
problem
• How we deal with noisy (non-separable) data
• How we permit non-linear boundaries
• How SVM Kernel functions permit
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 3
Linear Classifiers
x
denotes +1
a
f
yest
f(x,w,b) = sgn(w. x - b)
denotes -1
How would you
classify this data?
Learning machine f takes an
input x and transforms it,
somehow using weight α into
a predicted output yest = +/-1
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 4
Linear Classifiers
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
Perceptron rule
stops only if no
misclassification
remains:
Dw = h(t-y)x
 Any of these
lines are possible
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 5
Linear Classifiers
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
LMS rule uses
linear elements
and minimizes
E= (t-y)2
Not necessarily
guarantee no
misclassification

Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 6
Linear Classifiers
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
If the objective is
to assure zero
sample error…
Any of these
would be fine..
..but which is
best?
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 7
Classifier Margin
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
Margin of the separator is the width
of separation between classes
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 8
Maximum Margin
a
x
denotes +1
f
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
denotes -1
Support Vectors
are those
datapoints that
the margin
pushes up
against
Linear SVM
Copyright © 2001, Andrew W. Moore
yest
This is the
simplest kind of
SVM (Called an
LSVM)
Support Vector Machines: Slide 9
Outline
• Linear SVMs
• The definition of a maximum margin
classifier
• What QP
• How Maximum Margin can be turned into a QP
problem
• How we deal with noisy (non-separable) data
• How we permit non-linear boundaries
• How SVM Kernel functions permit
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 10
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane
• How do we represent this mathematically?
• …in m input dimensions?
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 11
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
if
w . x + b >= 1
-1
if
w . x + b <= -1
Universe
explodes
if
-1 < w . x + b < 1
Classify as.. +1
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 12
Computing the margin width
M = Margin Width
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Let u and v be two vectors on the
Plus Plane. What is w . ( u – v ) ?
And so of course the vector w is also
perpendicular to the Minus Plane
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 13
Computing the margin width
The line from x- to x+ is
perpendicular to the planes.
So to get from x- to x+ travel
some distance in direction w
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 14
Computing the margin width
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 15
Computing the margin width
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 16
Outline
•
•
•
•
Linear SVMs
The definition of a maximum margin classifier
What QP
How Maximum Margin can be turned into a
QP problem
• How we deal with noisy (non-separable) data
• How we permit non-linear boundaries
• How SVM Kernel functions permit
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 17
Learning the Maximum Margin Classifier
x+
M = Margin Width =
2
w.w
x-
Given a guess of w and b we can
• Compute whether all data points in the correct half-planes
• Compute the width of the margin
So now we just need to write a program to search the space of w’s and
b’s to find the widest margin that matches all the datapoints. How?
• Gradient descent? Simulated Annealing? Matrix Inversion? EM?
Newton’s Method?
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 18
Learning via Quadratic Programming
• QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
some real-valued variables subject to linear
constraints.
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 19
The Optimization Problem
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 20
Learning the Maximum Margin Classifier
M= 2
w.w
The objective?
• To maximize M
• While ensuring all data
points are in the
correct half-planes
Q1: What is our quadratic optimization criterion? Minimize w.w
Q2: What are constraints in our QP?
- How many constraints will we have?
- What should they be?
Copyright © 2001, Andrew W. Moore
R
• Constraints:
w . xk + b >= 1 if yk = 1
w . xk + b <= -1 if yk = -1
Support Vector Machines: Slide 21
Outline
•
•
•
•
Linear SVMs
The definition of a maximum margin classifier
What QP
How Maximum Margin can be turned into a QP
problem
• How we deal with noisy (non-separable)
data
• How we permit non-linear boundaries
• How SVM Kernel functions permit
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 22
Uh-oh!
This is going to be a problem!
denotes +1
denotes -1
What should we do?
Idea :
Minimize
w.w + C (distance of error
points to their
correct place)
Minimize trade-off between
margin and training error
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 23
Learning Maximum Margin with Noise
e2
e11
2
w.w
e7
Our QP criterion:
Minimize
R
1
w.w  C  εk
2
k 1
Copyright © 2001, Andrew W. Moore
• How many constraints? 2R
• What should they be?
w . xk + b >= 1-ek if yk = 1
w . xk + b <= -1+ek if yk = -1
ek >= 0 for all k
Support Vector Machines: Slide 24
Hard/Soft Margin Separation
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 25
Controlling Soft Margin Separation
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 26
Outline
•
•
•
•
Linear SVMs
The definition of a maximum margin classifier
What QP
How Maximum Margin can be turned into a QP
problem
• How we deal with noisy (non-separable) data
• How we permit non-linear boundaries
• How SVM Kernel functions permit
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 27
Suppose we’re in 1-dimension
What would
SVMs do with
this data?
x=0
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 28
Suppose we’re in 1-dimension
Not a big surprise
x=0
Positive “plane”
Copyright © 2001, Andrew W. Moore
Negative “plane”
Support Vector Machines: Slide 29
Harder 1-dimensional dataset
That’s wiped the
smirk off SVM’s
face.
What can be
done about
this?
x=0
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 30
Harder 1-dimensional dataset
Remember how
permitting nonlinear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too
x=0
Copyright © 2001, Andrew W. Moore
z k  ( xk , x )
2
k
Support Vector Machines: Slide 31
Harder 1-dimensional dataset
Remember how
permitting nonlinear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too
x=0
Copyright © 2001, Andrew W. Moore
z k  ( xk , x )
2
k
Support Vector Machines: Slide 32
Outline
•
•
•
•
Linear SVMs
The definition of a maximum margin classifier
What QP
How Maximum Margin can be turned into a QP
problem
• How we deal with noisy (non-separable) data
• How we permit non-linear boundaries
• How SVM Kernel functions permit
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 33
Common SVM basis functions
zk = ( polynomial terms of xk of degree 1 to q )
zk = ( radial basis functions of xk )
 | xk - c j | 

z k [ j ]  φ j (x k )  KernelFn
 KW 
zk = ( sigmoid functions of xk )
This is sensible.
Is that the end of the story?
No…there’s one more trick!
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 34
The “Kernel Trick”
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 35
Quadratic Dot
Products
1
1

 


 

2a1  
2b1 


2 a2  
2b2 

 

:
:

 


 

2
a
2
b
m
m

 

2
2

  b1

a1

 

2
2
a
b
2
2

 


 

:
:

 

2
2
a
b

 

m
m
Φ(a)  Φ(b)  

2a1a2   2b1b2 

 

 2a1a3   2b1b3 

 

:
:

 

 2a1am   2b1bm 

 

2
a
a
2
b
b
2 3  
2 3 


 

:
:

 

 2a1am   2b1bm 

 

:
:

 

 2am -1am   2bm -1bm 
Copyright © 2001, Andrew W. Moore
Constant Term
1
+
m
 2a b
i i
i 1
Linear Terms
+
m
a b
i 1
2 2
i i
Pure
Quadratic
Terms
+
Quadratic
  2ai a j bib j Cross-Terms
m
m
i 1 j i 1
Support Vector Machines: Slide 36
Quadratic Dot
Products
(a.b  1) 2
 (a.b) 2  2a.b  1
Φ(a)  Φ(b) 
m
m
m
m
1  2 ai bi   a b    2ai a j bi b j
i 1
i 1
2
m


   ai bi   2 ai bi  1
i 1
 i 1

m
2 2
i i
i 1 j i 1
m
m
m
  ai bi a j b j  2 ai bi  1
i 1 j 1
i 1
m
m
m
m
  (ai bi )  2  ai bi a j b j  2 ai bi  1
2
i 1
i 1 j i 1
i 1
They’re the same!
And this is only O(m) to
compute!
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 37
SVM Kernel Functions
• K(a,b)=(a . b +1)d is an example of an SVM
Kernel Function
• Beyond polynomials there are other very high
dimensional basis functions that can be made
practical by finding the right Kernel Function
• Radial-Basis-style Kernel Function:
 (a - b) 2 

K (a, b)  exp 2
2 

• Neural-net-style Kernel Function:
K (a, b)  tanh( a.b -  )
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 38
Non-linear SVM with Kernels
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 39
Doing multi-class classification
• SVMs can only handle two-class outputs (i.e. a
categorical output variable with arity 2).
• What can be done?
• Answer: with output arity N, learn N SVM’s
•
•
•
•
SVM 1 learns “Output==1” vs “Output != 1”
SVM 2 learns “Output==2” vs “Output != 2”
:
SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input, just
predict with each SVM and find out which one puts
the prediction the furthest into the positive region.
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 40
What You Should Know
• Linear SVMs
• The definition of a maximum margin classifier
• What QP can do for you (but, for this class, you
don’t need to know how it does it)
• How Maximum Margin can be turned into a QP
problem
• How we deal with noisy (non-separable) data
• How we permit non-linear boundaries
• How SVM Kernel functions permit us to pretend
we’re working with ultra-high-dimensional basisfunction terms
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 41
Remaining…
• Actually, I skip
• ERM(Empirical Risk Minimization)
• VC(Vapnik/Chervonenkis) dimension
• But, It needed at data processing time
• Handling unbalanced datasets
• Selecting C
• Selecting Kernel
• What is a Valid Kernel?
• How to Construct Valid Kernel?
• It needed at tuning time.
Now, the real game is starting in earnest !!
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 42
SVMlight
• http://svmlight.joachims.org (Thorsten Joachims @ cornell Univ.)
• Commands
• svm_learn [options] [train_file] [model_file]
• svm_learn –c 1.5 –x 1 train_file model_file
• svm_classify [test_file] [model_file] [prediction_file]
• svm_classify test_file model_file prediction_file
• File Format
• <y> <featureNumber>:<value> … <featureNumber>:<value>
#comment
• 1 23:0.5 105:0.1 1023:0.8 1999:0.34
Idx
1
…
23
…
105
…
1023
1024
…
1998
1999
…
cls
Value
0
0
.5
0
.1
0
.8
0
0
0
.34
0
1
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 43
Options
•
•
•
•
•
•
•
•
•
•
•
-z <char> : selects whether the SVM is trained in classification (c) or
regression (r) mode
-c <float> : controls the trade-off between training error and margin – the
lower the value of C, the more training error is tolerated. The best value of C
depends on the data and must be determined emperically.
-w <float> : width of tube (i.e Є) Є-intensive loss function used in regression.
The value must be non-negative.
-j <float> : specifies cost-factors for the loss functions both in classification
and regression.
-b <int> : switches between a biased (1) or an unbiased (0) hyperplane.
-i <int> : selects how training errors are treated. (1-4)
-x <int> : selects whether to compute leave-one-out estimates of the
generalization performance.
-t <int> : selects the type of kernel functions (0-4)
-q <int> : maximum size of working set.
-m <int> : specifies the amount of memory (in megabytes) available for
caching kernel evaluations.
……………………………………..
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 44
Example 1: Sense Disambiguation
• AK has 3 senses
• above knee, acetate kinase, artificial kidney
• Ex)
• w13, w345, w874, w8, w7345, AK(acetate kinase),
w123, w13, w8, w14, w8
• => 2 w8:3 w13:2 w14:1 w123:1 w345:1 w874:1
w7345:1
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 45
Example of training set
1 109:0.060606 193:0.090909 211:0.052632 3079:1.000000 5882:0.200000
1 109:0.030303 143:1.000000 377:0.500000 439:0.500000 1012:0.500000 3808:1.000000
3891:1.000000 4789:0.333333 18363:0.333333 23106:1.000000
1 174:0.166667 244:0.500000 321:0.500000 332:0.500000 723:0.333333 3064:0.500000
3872:1.000000 16401:1.000000 19369:1.000000 23109:1.000000
2 108:0.250000 148:0.333333 202:0.250000 313:0.200000 380:1.000000 3303:1.000000
8944:1.000000 11513:1.000000 23110:1.000000 23111:1.000000
1 100:0.125000 109:0.030303 129:0.200000 131:0.071429 135:0.050000 543:0.500000
5880:0.200000 17153:0.100000 23112:0.100000
1 100:0.125000 109:0.060606 125:0.500000 131:0.071429 193:0.045455 2553:0.200000
4790:0.100000 5880:0.200000
2 382:1.000000 410:1.000000 790:1.000000 1160:0.500000 2052:1.000000 3260:1.000000
7323:0.500000 23113:1.000000 23114:1.000000 23115:1.000000
2 133:1.000000 147:0.250000 148:0.333333 177:0.500000 214:1.000000 960:1.000000
1131:1.000000 1359:1.000000 6378:0.333333 22842:1.000000
1 109:0.060606 917:1.000000 2747:0.500000 2748:0.500000 2749:1.000000
3252:1.000000 21699:1.000000
1 543:0.500000 557:1.000000 563:0.500000 1077:1.000000 1160:0.500000 2747:0.500000
2748:0.500000 12805:1.000000 23116:1.000000 23117:1.000000
1 593:1.000000 1124:0.500000 1615:1.000000 3607:1.000000 13443:1.000000
23118:1.000000 23119:1.000000 23120:1.000000
…………………………………………………………
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 46
Trained model of SVM
SVM-multiclass Version V1.01
2 # number of classes
24208 # number of base features
0 # loss function
3 # kernel type
3 # kernel parameter -d
1 # kernel parameter -g
1 # kernel parameter -s
1 # kernel parameter -r
empty# kernel parameter -u
48584 # highest feature index
167 # number of training documents
335 # number of support vectors plus 1
0 # threshold b, each following line is a SV (starting with alpha*y)
0.010000000000000000208166817117217 24401:0.2 24407:0.071428999 24756:1 24909:0.
25 25283:1 25424:0.33333299 25487:0.043478001 25773:0.5 31693:1 37411:1 #
-0.010000000000000000208166817117217 193:0.2 199:0.071428999 548:1 701:0.25 1075
:1 1216:0.33333299 1279:0.043478001 1565:0.5 7485:1 13203:1 #
0.010000000000000000208166817117217 24407:0.071428999 24419:0.0625 25807:0.16666
7 26905:0.166667 27631:1 27664:1 35043:1 35888:1 36118:1 48085:1 #
-0.010000000000000000208166817117217 199:0.071428999 211:0.0625 1599:0.166667 26
97:0.166667 3423:1 3456:1 10835:1 11680:1 11910:1 23877:1 #
……………………………………………………
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 47
Example 2: Text Chunking
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 48
Example 3: Text Classification
• Corpus: Reuters-21578
• 12,902 news story
• 118 category
• ModApte split (75%trainig:9,603doc / 25%test 3,299 doc)
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 49
Q&A
Copyright © 2001, Andrew W. Moore
Support Vector Machines: Slide 50