Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07 Suppose you want to… use learning to solve some classification problem. E.g., given a.

Download Report

Transcript Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07 Suppose you want to… use learning to solve some classification problem. E.g., given a.

Features, Kernels, and
Similarity functions
Avrim Blum
Machine learning lunch 03/05/07
Suppose you want to…
use learning to solve some classification problem.
E.g., given a set of images, learn a
rule to distinguish men from women.
 The first thing you need to do is decide what
you want as features.
 Or, for algs like SVM and Perceptron, can use a
kernel function, which provides an implicit
feature space. But then what kernel to use?
 Can Theory provide any help or guidance?
Plan for this talk
Discuss a few ways theory might be of help:
 Algorithms designed to do well in large feature
spaces when only a small number of features are
actually useful.

So you can pile a lot on when you don’t know much
about the domain.
 Kernel functions. Standard theoretical view,
plus new one that may provide more guidance.

Bridge between “implicit mapping” and “similarity
function” views. Talk about quality of a kernel in
terms of more tangible properties. [work with Nina
Balcan]
 Combining the above. Using kernels to generate
explicit features.
A classic conceptual question
 How is it possible to learn anything quickly when
there is so much irrelevant information around?
 Must there be some hard-coded focusing
mechanism, or can learning handle it?
A classic conceptual question
Let’s try a very simple theoretical model.
 Have n boolean features. Labels are + or -.
1001101110 +
1100111101 +
0111010111 -
 Assume distinction is based on just one feature.
 How many prediction mistakes do you need to
make before you’ve figured out which one it is?



Can take majority vote over all possibilities consistent
with data so far. Each mistake crosses off at least
half. O(log n) mistakes total.
log(n) is good: doubling n only adds 1 more mistake.
Can’t do better (consider log(n) random strings with
random labels. Whp there is a consistent feature in
hindsight).
A classic conceptual question
What about more interesting classes of functions
(not just target  a single feature)?
Littlestone’s Winnow algorithm [MLJ 1988]
 Motivated by the question: what if target is an
OR of r << n features? 100101011001101011 +

x4  x7  x10
Majority vote scheme over all nr possibilities would
make O(r log n) mistakes but totally impractical. Can
you do this efficiently?
 Winnow is simple efficient algorithm that meets
this bound.
 More generally, if exists LTF such that


positives satisfy w1x1+w2x2+…+wnxn  c,
negatives satisfy w1x1+w2x2+…+wnxn  c - , (W=i|wi|)
 Then # mistakes = O((W/)2 log n).


E.g., if target is “k of r” function, get O(r2 log n).
Key point: still only log dependence on n.
Littlestone’s Winnow algorithm [MLJ 1988]
1001011011001
How does it work? Balanced version: +
w
 Maintain weight vectors w+ and w-. w Initialize all weights to 1. Classify based on
whether w+x or w-x is larger. (Have x00)
 If make mistake on positive x, for each xi=1,

wi+ = (1+)wi+, wi- = (1-)wi-.
 And vice-versa for mistake on negative x.
Other properties:
 Can show this approximates maxent constraints.
 In other direction, [Ng’04] shows that maxent with L1
regularization gets Winnow-like bounds.
Practical issues
 On batch problem, may want to cycle through
data, each time with smaller .
 Can also do margin version: update if just barely
correct.
 If want to output a likelihood, natural is
+x
+x
-x
w
w
w
e /[e
+ e ]. Can extend to multiclass too.
 William & Vitor have paper with some other nice
practical adjustments.
Winnow versus Perceptron/SVM
Winnow is similar at high level to Perceptron
updates. What’s the difference?
 Suppose data is linearly separable by wx = 0
with |wx|  .
+ +
 For Perceptron, mistakes/samples
++
++
- bounded by O((L2(w)L2(x)/)2)
- - For Winnow, mistakes/samples
bounded by O((L1(w)L(x)/)2(log n))


For boolean features, L(x)=1. L2(x) can be sqrt(n).
If target is sparse, examples dense, Winnow is better.
- E.g., x random in {0,1}n, f(x)=x1. Perceptron: O(n) mistakes.

If target is dense (most features are relevant) and
examples are sparse, then Perceptron wins.
OK, now on to kernels…
Generic problem
 Given a set of images:
, want to learn a
linear separator to distinguish men from women.
 Problem: pixel representation no good.
One approach:
 Pick a better set of features! But seems ad-hoc.
Instead:
 Use a Kernel! K(
,
) = (
)¢(
).  is
implicit, high-dimensional mapping.
 Perceptron/SVM only interact with data through dotproducts, so can be “kernelized”. If data is separable
in -space by large L2 margin, don’t have to pay for it.
Kernels
 E.g., the kernel K(x,y) = (1+x¢y)d for the case of n=2, d=2,
corresponds to the implicit mapping:
z2
x2
X
X
X
X
X
X
X
X
X
O
O
X
X
O
O
O
O
O
O
x1
O
X
X
X
z3
X
X
X
X
X
X
X
O
O
z1
X
X
O
X
X
X
O
O O
O
X
X
O
X
X
X
X
X
X
X
X
Kernels
 Perceptron/SVM only interact with data through dotproducts, so can be “kernelized”. If data is separable
in -space by large L2 margin, don’t have to pay for it.
 E.g., K(x,y) = (1 + xy)d

:(n-diml space) ! (nd-diml space).
 E.g., K(x,y) = e-(x-y)
2
 Conceptual warning: You’re not really “getting all the
power of the high dimensional space without paying
for it”. The margin matters.
 E.g., K(x,y)=1 if x=y, K(x,y)=0 otherwise.
Corresponds to mapping where every example gets
its own coordinate. Everything is linearly separable
but no generalization.
Question: do we need the
notion of an implicit space to
understand what makes a
kernel helpful for learning?
Focus on batch setting
 Assume examples drawn from some probability
distribution:



Distribution D over x, labeled by target function c.
Or distribution P over (x, l)
Will call P (or (c,D)) our “learning problem”.
 Given labeled training data, want algorithm to do
well on new data.
Something funny about theory of kernels
 On the one hand, operationally a kernel is just a
similarity function: K(x,y) 2 [-1,1], with some extra
requirements. [here I’m scaling to |(x)| = 1]
x
y
 And in practice, people think of a good kernel as a good
measure of similarity between data points for the task
at hand.
 But Theory talks about margins in implicit highdimensional -space. K(x,y) = (x)¢(y).
I want to use ML to classify protein
structures and I’m trying to decide on
a similarity fn to use. Any help?
It should be pos. semidefinite, and should
result in your data having a large margin
separator in implicit high-diml space you
probably can’t even calculate.
Umm… thanks, I guess.
It should be pos. semidefinite, and should
result in your data having a large margin
separator in implicit high-diml space you
probably can’t even calculate.
Something funny about theory of kernels
 Theory talks about margins in implicit highdimensional -space. K(x,y) = (x)¢(y).
 Not great for intuition (do I expect this
kernel or that one to work better for me)
 Can we connect better with idea of a good
kernel being one that is a good notion of
similarity for the problem at hand?
- Motivation [BBV]: If margin  in -space, then can
pick Õ(1/2) random examples y1,…,yn (“landmarks”),
and do mapping x  [K(x,y1),…,K(x,yn)]. Whp data
in this space will be apx linearly separable.
Goal: notion of “good similarity function” that…
1.
Talks in terms of more intuitive properties (no
implicit high-diml spaces, no requirement of
positive-semidefiniteness, etc)
2. If K satisfies these properties for our given
problem, then has implications to learning
3. Is broad: includes usual notion of “good
kernel” (one that induces a large margin
separator in -space).
If so, then this can help with designing the K.
[Recent work with Nina, with extensions by Nati Srebro]
Proposal satisfying (1) and (2):
 Say have a learning problem P (distribution D over
examples labeled by unknown target f).
 Sim fn K:(x,y)![-1,1] is (,)-good for P if at least
a 1- fraction of examples x satisfy:
Ey~D[K(x,y)|l(y)=l(x)] ¸ Ey~D[K(x,y)|l(y)l(x)]+
 Q: how could you use this to learn?
How to use it
At least a 1- prob mass of x satisfy:
Ey~D[K(x,y)|l(y)=l(x)] ¸ Ey~D[K(x,y)|l(y)l(x)]+
 Draw S+ of O((1/2)ln 1/2) positive examples.
 Draw S- of O((1/2)ln 1/2) negative examples.
 Classify x based on which gives better score.
 Hoeffding: for any given “good x”, prob of
error over draw of S+,S- at most 2.
 So, at most  chance our draw is bad on more
than  fraction of “good x”.
 With prob ¸ 1-, error rate ·  + .
But not broad enough
+
+
30o
30o
_
 K(x,y)=x¢y has good separator but doesn’t
satisfy defn. (half of positives are more similar to
negs that to typical pos)
But not broad enough
+
+
30o
30o
_
 Idea: would work if we didn’t pick y’s from top-left.
 Broaden to say: OK if 9 large region R s.t. most x are on
average more similar to y2R of same label than to y2R of
other label. (even if don’t know R in advance)
Broader defn…
 Say K:(x,y)![-1,1] is an (,)-good similarity
function for P if exists a weighting function
w(y)2[0,1] s.t. at least 1- frac. of x satisfy:
Ey~D[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~D[w(y)K(x,y)|l(y)l(x)]+
 Can still use for learning:
+
2
 Draw S = {y1,…,yn}, S = {z1,…,zn}. n=Õ(1/ )
 Use to “triangulate” data:
x  [K(x,y1), …,K(x,yn), K(x,z1),…,K(x,zn)].

Whp, exists good separator in this space: w =
[w(y1),…,w(yn),-w(z1),…,-w(zn)]
Broader defn…
 Say K:(x,y)![-1,1] is an (,)-good similarity
function for P if exists a weighting function
w(y)2[0,1] s.t. at least 1- frac. of x satisfy:
Ey~D[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~D[w(y)K(x,y)|l(y)l(x)]+
So, take new set of examples, project to
this space, and run your favorite linear
separator learning algorithm.*
 Whp, exists good separator in this space: w
*Technically
bounds
are better
if adjust
definition to
= [w(y1),…,w(y
n),-w(z
1),…,-w(z
n)]

penalize examples more that fail the inequality badly…
Broader defn…
Algorithm
 Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
Think of these as “landmarks”.
 Use to “triangulate” data:
X  [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].
Guarantee: with prob. ¸ 1-, exists linear separator of
error ·  + at margin /4.
 Actually, margin is good in both L1 and L2 senses.
 This particular approach requires wasting examples for
use as the “landmarks”. But could use unlabeled data
for this part.
Interesting property of definition


An (,)-good kernel [at least 1- fraction of x
have margin ¸ ] is an (’,’)-good sim fn under
this definition.
But our current proofs suffer a penalty: ’ = 
+ extra, ’ = 3extra.
Nati Srebro has improved
to 2, which is tight, +
extended to hinge-loss.

So, at qualitative level, can have theory of
similarity functions that doesn’t require
implicit spaces.
Approach we’re investigating
With Nina & Mugizi:
 Take a problem where original features already
pretty good, plus you have a couple reasonable
similarity functions K1, K2,…
 Take some unlabeled data as landmarks, use to
enlarge feature space K1(x,y1), K2(x,y1), K1(x,y2),…
 Run Winnow on the result.
 Can prove guarantees if some convex
combination of the Ki is good.
Open questions


This view gives some sufficient conditions for a
similarity function to be useful for learning but
doesn’t have direct implications to direct use in
SVM, say.
Can one define other interesting, reasonably
intuitive, sufficient conditions for a similarity
function to be useful for learning?