On Kernels, Margins, and Lowdimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.

Download Report

Transcript On Kernels, Margins, and Lowdimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.

On Kernels, Margins, and Lowdimensional Mappings
or
Kernels versus features
Nina Balcan
CMU
Avrim Blum
CMU
Santosh Vempala MIT
Generic problem
 Given a set of images:
, want to learn a
linear separator to distinguish men from women.
 Problem: pixel representation no good.
Old style advice:
 Pick a better set of features!
 But seems ad-hoc. Not scientific.
New style advice:
 Use a Kernel! K(
,
) = f(
)¢f(
). f is
implicit, high-dimensional mapping.
 Sounds more scientific. Many algorithms can be
“kernelized”. Use “magic” of implicit high-dim’l space.
Don’t pay for it if exists a large margin separator.
Generic problem
Old style advice:
 Pick a better set of features!
 But seems ad-hoc. Not scientific.
New style advice:
 Use a Kernel! K(
,
) = f(
)¢f(
). f is
implicit, high-dimensional mapping.
 Sounds more scientific. Many algorithms can be
“kernelized”. Use “magic” of implicit high-dim’l space.
Don’t pay for it if exists a large margin separator.
 E.g., K(x,y) = (x ¢ y + 1)m. f:(n-diml space) ! (nm-diml
space).
Main point of this work:
Can view new method as way of conducting old method.
 Given a kernel [as a black-box program K(x,y)] and
access to typical inputs [samples from D],
 Claim: Can run K and reverse-engineer an explicit (small)
set of features, such that if K is good [9 large-margin
separator in f-space for D,c], then this is a good feature
set [9 almost-as-good separator].
“You give me a kernel, I give you a set of features”
Main point of this work:
Can view new method as way of conducting old method.
 Given a kernel [as a black-box program K(x,y)] and
access to typical inputs [samples from D],
 Claim: Can run K and reverse-engineer an explicit (small)
set of features, such that if K is good [9 large-margin
separator in f-space for D,c], then this is a good feature
set [9 almost-as-good separator].
E.g., sample z1,...,zd from D. Given x, define xi = K(x,zi).
Implications:
 Practical: alternative to kernelizing the algorithm.
Conceptual: View kernel as (principled) way of doing
feature generation. View as similarity function, rather
than “magic power of implicit high dimensional space”.
Basic setup, definitions
 Instance space X.
 Distribution D, target c. Use P = (D,c).
 K(x,y) = f(x)¢f(y).
 P is separable with margin g in f-space
if 9 w s.t. Pr(x,l)2 P[l(w ¢ f(x)) < g]=0.
(normalizing |w|=1, |f(x)|=1)
 Error e at margin g:
replace “0” with “e”.
X
w
+
+
-
f
P=(D,c)
Goal is to
use K to get
mapping to
low-dim’l
space.
Idea: Johnson-Lindenstrauss lemma
 If P separable with margin g in f-space, then with prob
1-d, a random linear projection down to space of
dimension d = O((1/g2)log[1/(de)]) will have a linear
separator of error < e. [AV]
+
+
 If vectors are r1,r2,...,rd, then can view
as features xi = f(x)¢ ri.
 Problem: uses f. Can we do
directly, using K as blackbox, without computing f?
f
+
X
P=(D,c)
-
+
-
3 methods (from simplest to best)
1. Draw d examples z1,...,zd from D. Use:
F(x) = (K(x,z1), ..., K(x,zd)). [So, “xi” = K(x,zi)]
For d = (8/e)[1/g2 + ln 1/d], if P was separable with
margin g in f-space, then whp this will be separable with
error e. (but this method doesn’t preserve margin).
2. Same d, but a little more complicated. Separable with
error e at margin g/2.
3. Combine (2) with further projection as in JL lemma.
Get d with log dependence on 1/e, rather than linear.
So, can set e ¿ 1/d.
All these methods need access to D, unlike JL. Can this
be removed? We show NO for generic K, but may be
possible for natural K.
Actually, the argument is pretty
easy...
(though we did try a lot of things first that
didn’t work...)
Key fact
Claim: If 9 perfect w of margin g in f-space, then if draw
z1,...,zd 2 D for d ¸ (8/e)[1/g2 + ln 1/d], whp (1-d) exists w’
in span(f(z1),...,f(zd)) of error · e at margin g/2.
Proof: Let S = examples drawn so far. Assume |w|=1,
|f(z)|=1 8 z.
 win = proj(w,span(S)), wout = w – win.
 Say wout is large if Prz(|wout¢f(z)| ¸ g/2) ¸ e; else small.
 If small, then done: w’ = win.
 Else, next z has at least e prob of improving S.
|wout|2 Ã |wout|2 – (g/2)2
 Can happen at most 4/g2 times. a
So....
 If draw z1,...,zd 2 D for d = (8/e)[1/g2 + ln 1/d], then
whp exists w’ in span(f(z1),...,f(zd)) of error · e at
margin g/2.
 So, for some w’ = a1f(z1) + ... + adf(zd),
Pr(x,l) 2 P [sign(w’ ¢ f(x))  l] · e.
 But notice that w’¢f(x) = a1K(x,z1) + ... + adK(x,zd).
) vector (a1,...ad) is an e-good separator in the feature
space: xi = K(x,zi).
 But margin not preserved because of length of
target, examples.
How to preserve margin? (mapping #2)
 We know 9 w’ in span(f(z1),...,f(zd)) of error · e at
margin g/2.
 So, given a new x, just want to do an orthogonal
projection into that span. (preserves dot-product,
decreases |x|, so only increases margin).



Run K(zi,zj) for all i,j=1,...,d. Get matrix M.
Decompose M = UTU.
(Mapping #2) = (mapping #1)U-1. a
How to improve dimension?
 Current mapping gives d = (8/e)[1/g2 + ln 1/d].
 Johnson-Lindenstrauss gives d = O((1/g2) log 1/(de) ).
 JL is nice because can have e ¿ 1/d. Good if alg wants
data to be perfectly separable.
(Learning a separator of margin g can be done in time
poly(1/g), but if no perfect separator exists,
minimizing error is NP-hard.)
 Answer: just combine the two...
f
RN
X
X
X
X
F1
X
O
Rd1
X
X
O
X
O
O
X
X
X
X
X
O
X
X
O
X
O
O
O
O
O
JL
F
Rd
X X
O
O
X X
O
X
O
O
Mapping #3
 Do JL(mapping2(x)).
 JL says: fix y,w. Random projection M down to space of
dimension O(1/g2 log 1/d’) will with prob (1-d’) preserve
margin of y up to § g/4.
 Use d’ = ed.
) For all y, PrM[failure on y] < ed,
) PrD, M[failure on y] < ed,
) PrM[fail on prob mass e] < d.
 So, we get desired dimension (# features), though
sample-complexity remains as in mapping #2.
Lower bound (on necessity of access to D)
For arbitrary black-box kernel K, can’t hope to convert
to small feature space without access to D.
 Consider X={0,1}n, random X’½ X of size 2n/2, D =
uniform over X’.
 c = arbitrary function (so learning is hopeless).
 But we have this magic kernel K(x,y) = f(x)¢f(y)



f(x) = (1,0) if x  X’.
f(x) = (-½, p3/2) if x 2 X’, c(x)=pos.
f(x) = (-½,-p3/2) if x 2 X’, c(x)=neg.
 P is separable with margin p3/2 in fspace.
 But, without access to D, all attempts at
running K(x,y) will give answer of 1.
+
+-
Open Problems
 For specific, natural kernels,
like, K(x,y) = (1 + x ¢ y)m,
Is there an efficient (probability distribution over)
mappings that is good for any P = (c,D) for which
the kernel is good?
 I.e., an efficient analog to JL for these kernels.
 Or, at least can these mappings be constructed
using less sample-complexity (fewer accesses to D)?