Visualizing and Exploring Data
Download
Report
Transcript Visualizing and Exploring Data
Kernel Methods and SVM’s
Predictive Modeling
Goal: learn a mapping: y = f(x;)
Need: 1. A model structure
2. A score function
3. An optimization strategy
Categorical y {c1,…,cm}: classification
Real-valued y: regression
Note: usually assume {c1,…,cm} are mutually exclusive and
exhaustive
Simple Two-Class Perceptron
h( x) sgn( w x b) sgn( wj x j b), 1 j p
Initialize weight vector w0 0; b0 0;
Repeat one or more times (indexed by k):
For each training data point xi
If
yi ( wk xi bk ) 0 then
wk 1 wk yi xi
b k 1 bk yi R
…
endIf
“gradient descent”
yi 1,1
2
Perceptron Dual Form
Notice that w ends up as a linear combination of yjxj:
n
w j y j x j
j 1
Thus:
h( x) sgn( w x b)
sgn(
n
j 1
n
j
+ve; bigger for
“harder” examples
y j x j x b)
sgn( j y j x j x b)
j 1
This leads to a dual form of the learning algorithm:
Perceptron Dual Form
Initialize weight vector 0 0; b0 0;
Repeat until no more mistakes
For each training data point xi
If
N
yi ( j y j x j xi b) 0 then
j 1
i i
b b yi R
2
…
endIf
Note: the training data only enter the algorithm via
x j xi
This is generally true for linear models (eg linear regression,
ridge regression).
Learning in Feature Space
We have already seen the idea of changing the representation
of the predictors:
x ( x1,, x p ) ( x) (1,,P )
F ( x) : x X
is called the feature space
Linear Feature Space Models
Now consider models of the form:
P
equivalently:
f ( x) wii ( x) b
i 1
N
f ( x) i yi ( xi ) ( x) b
i 1
A kernel is a function K, such that for all x,z X
K ( x, z) ( x) ( z)
where is a mapping from X to an inner product feature
space F
just need to know K, not !
Making Kernels
What properties must K satisfy to be a kernel?
1. Symmetry
K ( x, z) ( x) ( z) ( z) ( x) K ( z, x)
2. Cauchy-Schwarz
K ( x, z ) ( x ) ( z )
2
+ other conditions
2
( x ) ( z ) K ( x, x ) K ( z , z )
2
2
Mercer’s Theorem
K “pos. semi-definite”
Mercer’s Theorem gives necessary and sufficient conditions for a
continuous symmetric function K to admit this representation:
K ( x, z ) ( x) ( z ) ii ( x)i ( z )
“Mercer Kernels”
i 1
This kernel defines a set of functions HK, elements of which
have an expansion as:
N
f ( x) cii ( x) i K ( x, xi )
i 1
i 1
So, some kernels correspond to infinite numers of transformed
predictor variables
Reproducing Kernel Hilbert Space
Define an inner product in this function space as:
c ( x), d ( x)
i 1
i i
i 1
i i
i 1
ci di
i
Note then that:
f ( x), K ( x, y)
c ( x), ( x) ( y)
i i
i i
i
f ( y)
also K (, x), K (, y) K ( x, y )
This is the reproducing property of HK
Also note, Mercer kernel implies:
f
2
K
i 1
ci2
i
Regularization and RKHS
A general class of regularization problems has the form:
N
min L( yi , f ( xi )) J ( f )
f H
i 1
Some loss function
(e.g. squared loss)
Penalize complex f
Suppose f lives in a RKHS with
K ( x, y) ii ( x)i ( y) and f ( x)
i 1
Let: J ( f ) f
2
N
c ( x) K ( x, x )
i 1
i i
K
Then need to solve this “easy” problem:
min L( y, K ) T K
i 1
i
i
RKHS Examples
For regression with squared error loss, have
N
ˆ ( K I ) 1 y and fˆ ( x) ˆ j K ( x, xi )
j 1
so that:
fˆ Kˆ K (K I )1 y (I K 1 )1 y
generalizes smoothing splines…
Choosing:
K ( x, y ) x y log( x y )
2
leads to the thin-plate spline models
Support Vector Machine
Two-class classifier with the form:
N
f ( x) 0 i K ( x, xi )
i 1
parameters chosen to minimize:
N
T
min 1 yi f ( xi ) K
0 ,
i 1
Many of the fitted ’s are usually zero; x’s corresponding the the
non-zero ’s are the support vectors.