Visualizing and Exploring Data

Transcript Visualizing and Exploring Data

Kernel Methods and SVM’s
Predictive Modeling
Goal: learn a mapping: y = f(x;)
Need: 1. A model structure
2. A score function
3. An optimization strategy
Categorical y  {c1,…,cm}: classification
Real-valued y: regression
Note: usually assume {c1,…,cm} are mutually exclusive and
exhaustive
Simple Two-Class Perceptron
h( x)  sgn( w  x  b)  sgn( wj x j  b), 1  j  p
Initialize weight vector w0  0; b0  0;
Repeat one or more times (indexed by k):
For each training data point xi
If
yi ( wk  xi  bk )  0 then
wk 1  wk  yi xi
b k 1  bk  yi R
…
endIf
“gradient descent”
yi   1,1
2
Perceptron Dual Form
Notice that w ends up as a linear combination of yjxj:
n
w   j y j x j
j 1
Thus:
h( x)  sgn( w  x  b)
 sgn(
n

j 1
n
j
+ve; bigger for
“harder” examples
y j x j  x  b)
 sgn( j y j x j  x  b)
j 1
This leads to a dual form of the learning algorithm:
Perceptron Dual Form
Initialize weight vector  0  0; b0  0;
Repeat until no more mistakes
For each training data point xi
If
N
yi ( j y j x j  xi  b)  0 then
j 1
i  i  
b  b  yi R
2
…
endIf
Note: the training data only enter the algorithm via
x j  xi
This is generally true for linear models (eg linear regression,
ridge regression).
Learning in Feature Space
We have already seen the idea of changing the representation
of the predictors:
x  ( x1,, x p )   ( x)  (1,,P )
F   ( x) : x  X 
is called the feature space
Linear Feature Space Models
Now consider models of the form:
P
equivalently:
f ( x)   wii ( x)  b
i 1
N
f ( x)   i yi  ( xi )   ( x)  b
i 1
A kernel is a function K, such that for all x,z X
K ( x, z)   ( x)  ( z)
where  is a mapping from X to an inner product feature
space F
just need to know K, not  !
Making Kernels
What properties must K satisfy to be a kernel?
1. Symmetry
K ( x, z)   ( x)  ( z)   ( z)  ( x)  K ( z, x)
2. Cauchy-Schwarz
K ( x, z )   ( x )   ( z )
2
+ other conditions
2
  ( x )  ( z )  K ( x, x ) K ( z , z )
2
2
Mercer’s Theorem
K “pos. semi-definite”
Mercer’s Theorem gives necessary and sufficient conditions for a
continuous symmetric function K to admit this representation:

K ( x, z )   ( x)   ( z )   ii ( x)i ( z )
“Mercer Kernels”
i 1
This kernel defines a set of functions HK, elements of which
have an expansion as:

N
f ( x)   cii ( x)   i K ( x, xi )
i 1
i 1
So, some kernels correspond to infinite numers of transformed
predictor variables
Reproducing Kernel Hilbert Space
Define an inner product in this function space as:



 c  ( x),  d  ( x)  
i 1
i i
i 1
i i
i 1
ci di
i
Note then that:
f ( x), K ( x, y) 
 c  ( x),   ( x) ( y)
i i
i i
i
 f ( y)
also K (, x), K (, y)  K ( x, y )
This is the reproducing property of HK
Also note, Mercer kernel implies:
f
2
K


i 1
ci2
i

Regularization and RKHS
A general class of regularization problems has the form:
N

min L( yi , f ( xi ))  J ( f )
f H
 i 1

Some loss function
(e.g. squared loss)
Penalize complex f
Suppose f lives in a RKHS with

K ( x, y)   ii ( x)i ( y) and f ( x) 
i 1
Let: J ( f )  f
2

N
 c  ( x)   K ( x, x )
i 1
i i
K
Then need to solve this “easy” problem:

min L( y, K )   T K


i 1
i
i
RKHS Examples
For regression with squared error loss, have
N
ˆ  ( K  I ) 1 y and fˆ ( x)  ˆ j K ( x, xi )
j 1
so that:
fˆ  Kˆ  K (K  I )1 y  (I  K 1 )1 y
generalizes smoothing splines…
Choosing:
K ( x, y )  x  y log( x  y )
2
leads to the thin-plate spline models
Support Vector Machine
Two-class classifier with the form:
N
f ( x)   0   i K ( x, xi )
i 1
parameters chosen to minimize:

N




T
min 1  yi f ( xi )   K 
 0 , 

 i 1

Many of the fitted ’s are usually zero; x’s corresponding the the
non-zero ’s are the support vectors.

Visualizing and Exploring Data

Transcript Visualizing and Exploring Data

Directory