Input Space versus Feature Space in KernelBased Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of Computer Science and.
Download ReportTranscript Input Space versus Feature Space in KernelBased Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of Computer Science and.
Input Space versus Feature Space in Kernel Based Methods
Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola
presented by:
Joe Drish Department of Computer Science and Engineering University of California, San Diego
Goals
Objectives of the paper 1) 2) 3) 4) 5) 6) Introduce and illustrate the kernel trick
F
Review kernel algorithms: SVMs and kernel PCA Discuss interpretation of the return from
F
Discuss the form of constructing sparse approximations of feature space expansions Evaluate and discuss the performance of SVMs and PCA Applications of kernel methods 1) 2) 3) Handwritten digit recognition Face recognition De-noising: this paper
Definition
A reproducing kernel
k
2 R.
• • • The domain of
k
consists of the data patterns {x
1
, …, x
l
} is a compact set in which the data lives is typically a subset of R
N
Computing
k
is equivalent to mapping data patterns into a higher dimensional space
F
, and then taking the dot product there.
A feature map : R
N
F
is a function that maps the input data patterns into a higher dimensional space
F
.
Illustration
Using a feature map dimensional feature space
F
: O X O X X O O X
F
Φ(X) Φ(X) Φ(O) Φ(O) Φ(O) Φ(O) Φ(X) Φ(X)
Kernel Trick
We would like to compute the dot product in the higher dimensional space, or (x) · (y).
To do this we only need to compute
k
(
x
,
y
), since
k
(
x
,
y
) = (x) · (y).
Note that the feature map is never explicitly computed. We avoid this, and therefore avoid a burdensome computational task.
Example kernels
Gaussian: Polynomial: Sigmoid:
k
( x , y )
k
(
x
,
y
) ( exp(
x
y
x 2
c
)
d
, y
c
2 2 ) 0
k
( x , y ) ( ( x y ) ), , R Nonlinear separation can be achieved.
Nonlinear Separation
Mercer Theory
Input Space to Feature Space Necessary condition for the kernel-mercer trick:
k
(x, y)
N
F
i
i
i
( x )
i
( y )
N
F is equal to the rank of u
i
u
i
T – the outer product A
i
i
u
i
u
i
is the normalized eigenfunction – analogous to a normalized eigenvector
Mercer :: Linear Algebra
Linear algebra analogy: Eigenvector problem
Ax
x
A Eigenfunction problem
k
(
x
,
y
)
f
(
y
)
dy
f
(
x
)
k
(
x
,
y
) u,
x
and
y
are vectors u is the normalized eigenvector is the eigenvalue is the normalized eigenfunction ,
RKHS, Capacity, Metric
• Reproducing kernel Hilbert space (RKHS) Hilbert space of functions
f
on some set
X
such that all evaluation functions are continuous, and the functions can be reproduced by the kernel • Capacity of the kernel map Bound on the how many training examples are required for learning, measured by the VC-dimension
h
• Metric of the kernel map Intrinsic shape of the manifold to which the data is mapped
Support Vector Machines
The decision boundary takes the form: • Similar to single layer perceptron • Training examples x
i
with non-zero coefficients
i
are support vectors
Kernel Principal Component Analysis
KPCA carries out a linear PCA in the feature space
F
The extracted features take the nonlinear form
f k
( x )
i l
1
i k k
( x
i
x ) , The
i k
are the components of the
k
-th eigenvector of the matrix (
k
( x
i
x
j
))
ij
KPCA and Dot Products
Wish to find eigenvectors
V
and eigenvalues of the covariance matrix
C
l
1
i l
1 (
x
i
) (
x
i
)
T
.
Again, replace (x) · (y).
with
k
(
x
,
y
).
From Feature Space to Input Space
Pre-image problem: Here, is not in the image.
Projection Distance Illustration
Approximate the vector
F
:
Minimizing Projection Distance z
is an approximate pre-image for if: Maximize: For kernels where
k
(
z
,
z
) = 1 (Gaussian), this reduces to:
Fixed-point iteration
So assuming a Gaussian kernel: •
i
are the eigenvectors of the centered Gram matrix • • x
i
are the input space is the width Requiring no step-size, we can iterate:
Kernel PCA Toy Example
Generated an artificial data set from three point sources, 100 point each.
De-noising by Reconstruction, Part One
• Reconstruction from projections onto the eigenvectors from previous example • Generated 20 new points from each Gaussian • Represented by their first n = 1, 2, …, 8 nonlinear principal components
De-noising by Reconstruction, Part Two
• Original points are moving in the direction of de-noising
De-noising in 2-dimensions
• A half circle and a square in the plane • De-noised versions are the solid lines
De-noising USPS data patterns
Patterns 7291 train 2007 test Size: 16 x 16 Linear PCA Kernel PCA