(chapters 1,2,3,4) Introduction to Kernels Max Welling October 1 2004 Introduction • What is the goal of (pick your favorite name): - Machine Learning - Data.

Download Report

Transcript (chapters 1,2,3,4) Introduction to Kernels Max Welling October 1 2004 Introduction • What is the goal of (pick your favorite name): - Machine Learning - Data.

(chapters 1,2,3,4)
Introduction to Kernels
Max Welling
October 1 2004
1
Introduction
• What is the goal of (pick your favorite name):
- Machine Learning
- Data Mining
- Pattern Recognition
- Data Analysis
- Statistics
Automatic detection of non-coincidental structure in data.
• Desiderata:
- Robust algorithms insensitive to outliers and wrong
model assumptions.
- Stable algorithms: generalize well to unseen data.
- Computationally efficient algorithms: large datasets.
2
Let’s Learn Something
Find the common characteristic (structure) among the following
statistical methods?
1. Principal Components Analysis
2. Ridge regression
3. Fisher discriminant analysis
4. Canonical correlation analysis
Answer:
We consider linear combinations of input vector:
f ( x)  wT x
Linear algorithm are very well understood and enjoy strong guarantees.
(convexity, generalization bounds).
Can we carry these guarantees over to non-linear algorithms?
3
Feature Spaces
 : x  ( x), R  F
d
non-linear mapping to F
1. high-D space L2
2. infinite-D countable space :
3. function space (Hilbert space)
example:

( x, y)  ( x , y , 2 xy)
2
2
4
Ridge Regression (duality)
problem:
min w  ( yi  wT xi )2   || w ||2
i 1
target
solution:
w  ( X T X   I d )1 X T y
dxd inverse
 X T ( XX T   I )1 y
 X T (G   I ) 1 y

inverse
Gij  xi , x j 
  xi i
i 1
linear comb. data
regularization
input
Dual Representation
Gram-matrix
5
Kernel Trick
Note: In the dual representation we used the Gram matrix
to express the solution.
Kernel Trick:
Replace : x  ( x),
kernel
Gij  xi , x j  Gij  ( xi ), ( x j )  K ( xi , x j )
If we use algorithms that only depend on the Gram-matrix, G,
then we never have to know (compute) the actual features 
This is the crucial point of kernel methods
6
Modularity
Kernel methods consist of two modules:
1) The choice of kernel (this is non-trivial)
2) The algorithm which takes kernels as input
Modularity: Any kernel can be used with any kernel-algorithm.
some kernels:
k ( x, y)  e( || x  y||
2
/ c)
k ( x, y)  ( x, y   )d
k ( x, y)  tanh(  x, y   )
1
k ( x, y) 
|| x  y ||2 c 2
some kernel algorithms:
- support vector machine
- Fisher discriminant analysis
- kernel regression
- kernel PCA
- kernel CCA
7
What is a proper kernel
Definition: A finitely positive semi-definite function k : x  y  R
is a symmetric function of its arguments for which matrices formed
by restriction on any finite subset of points is positive semi-definite.
 K  0 
T
Theorem: A function k : x  y  R can be written
as k ( x, y)  ( x), ( y)  where  ( x) is a feature map
x  ( x)  F iff k(x,y) satisfies the semi-definiteness property.
Relevance: We can now check if k(x,y) is a proper kernel using
only properties of k(x,y) itself,
8
i.e. without the need to know the feature map!
Reproducing Kernel Hilbert Spaces
The proof of the above theorem proceeds by constructing a very
special feature map (note that more feature maps may give rise to a kernel)
 : x  ( x)  k ( x,.)
i.e. we map to a function space.
definition function space:
reproducing property:
m
f (.)    i k ( xi ,.) any m,{xi }
i 1
m
 f , g    i  j k ( xi , x j )
 f , ( x)  f , k ( x,.) 
k
   i k ( xi ,.), k ( x,.) 
i 1
i 1 j 1
k
 f , f    i j k ( xi , x j )  0
  k ( x , x)  f ( x)
( finite positive semi  definite)
   ( x),  ( y )  k ( x, y ) 9
m
i 1 j 1
i 1
i
i
Mercer’s Theorem
Theorem: X is compact, k(x,y) is symmetric continuous function s.t.
Tk f  k (., x ) f ( x) dx is a positive semi-definite operator: Tk  0
i.e.
  k ( x, y) f ( x) f ( y) dxdy  0 f  L2 ( X )
then there exists an orthonormal feature basis of eigen-functions
such that:

k ( x, y )   i ( x) j ( y )
i 1
Hence: k(x,y) is a proper kernel.
Note: Here we construct feature vectors in L2, where the RKHS
construction was in a function space.
10
Learning Kernels
• All information is tunneled through the Gram-matrix information
bottleneck.
• The real art is to pick an appropriate kernel.
e.g. take the RBF kernel: k ( x, y)  e( || x y||2 / c)
if c is very small: G=I (all data are dissimilar): over-fitting
if c is very large: G=1 (all data are very similar): under-fitting
We need to learn the kernel. Here is some ways to combine
kernels to improve them:
k1 cone
 k1 ( x, y)   k2 ( x, y)  k ( x, y)  ,   0
k2
k ( x, y)k ( x, y)  k ( x, y)
1
2
k1 (( x), ( y))  k ( x, y)
any positive
polynomial
11
Stability of Kernel Algorithms
Our objective for learning is to improve generalize performance:
cross-validation, Bayesian methods, generalization bounds,...
Call ES [ f ( x)]  0 a pattern a sample S.
Is this pattern also likely to be present in new data: EP [ f ( x)]  0 ?
We can use concentration inequalities (McDiamid’s theorem)
to prove that:
Theorem: Let S  {x1,..., x } be a IID sample from P and define
the sample mean of f(x) as: f 1  f ( xi ) then it follows that:
i 1
P(|| f  EP [ f ] ||
R
1
(2  2ln( ))  1  

R  sup x || f ( x) ||
12
(prob. that sample mean and population mean differ less than is more than ,independent of P!
Rademacher Complexity
Prolem: we only checked the generalization performance for a
single fixed pattern f(x).
What is we want to search over a function class F?
Intuition: we need to incorporate the complexity of this function class.
Rademacher complexity captures the ability of the function class to
fit random noise. ( i  1 uniform distributed)
 i  1
f1
(empirical RC)
R ( F )  E [sup |
2
f F

R ( F )  ES E [sup |
f F
i 1
2
i
f ( xi ) |,| x1 ,..., x ]

i 1
i
f ( xi ) |]
xi
13
f2
Generalization Bound
Theorem: Let f be a function in F which maps to [0,1]. (e.g. loss functions)
Then, with probability at least 1   over random draws of size
every f satisfies:
2
E p [ f ( x)]  Edata [ f ( x )]  R ( F ) 
ln( )

2
2
 Edata [ f ( x)]  R ( F )  3
ln( )

2
Relevance: The expected pattern E[f]=0 will also be present in a new
data set, if the last 2 terms are small:
- Complexity function class F small
14
- number of training data large
Linear Functions (in feature space)
FB  { f : x  w, ( x)  , || w || B}
Consider the
function class:
with
and a sample:
S  {x1 ,..., x }
Then, the empirical
RC of FB is bounded by:
k ( x, y)  ( x), ( y) 
R ( FB ) 
2B
tr ( K )
Relevance: Since: {x  i k ( xi , x) ,  T K  B}  FB
it follows that
if we control the norm i 1 T K || w ||2 in kernel algorithms, we control
the complexity of the function class (regularization).
15
Margin Bound (classification)
Theorem: Choose c>0 (the margin).
F : f(x,y)=-yg(x), y=+1,-1
S: {( x1, y1 ),...,( x , y )} IID sample
 : (0,1) : probability of violating bound.
2
1
4
Pp [ y  sign( g ( x))]   i 
tr ( K )  3
c i 1
c
(prob. of misclassification)
ln( )

2
i  (c  yi g ( xi )) (slack variable)
( f )  f if f  0 and 0 otherwise
Relevance: We our classification error on new samples. Moreover, we have a
strategy to improve generalization: choose the margin c as large possible such
16
that all samples are correctly classified:   0 (e.g. support vector machines).
i