Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07 Suppose you want to… use learning to solve some classification problem. E.g., given a.
Download
Report
Transcript Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07 Suppose you want to… use learning to solve some classification problem. E.g., given a.
Features, Kernels, and
Similarity functions
Avrim Blum
Machine learning lunch 03/05/07
Suppose you want to…
use learning to solve some classification problem.
E.g., given a set of images, learn a
rule to distinguish men from women.
The first thing you need to do is decide what
you want as features.
Or, for algs like SVM and Perceptron, can use a
kernel function, which provides an implicit
feature space. But then what kernel to use?
Can Theory provide any help or guidance?
Plan for this talk
Discuss a few ways theory might be of help:
Algorithms designed to do well in large feature
spaces when only a small number of features are
actually useful.
So you can pile a lot on when you don’t know much
about the domain.
Kernel functions. Standard theoretical view,
plus new one that may provide more guidance.
Bridge between “implicit mapping” and “similarity
function” views. Talk about quality of a kernel in
terms of more tangible properties. [work with Nina
Balcan]
Combining the above. Using kernels to generate
explicit features.
A classic conceptual question
How is it possible to learn anything quickly when
there is so much irrelevant information around?
Must there be some hard-coded focusing
mechanism, or can learning handle it?
A classic conceptual question
Let’s try a very simple theoretical model.
Have n boolean features. Labels are + or -.
1001101110 +
1100111101 +
0111010111 -
Assume distinction is based on just one feature.
How many prediction mistakes do you need to
make before you’ve figured out which one it is?
Can take majority vote over all possibilities consistent
with data so far. Each mistake crosses off at least
half. O(log n) mistakes total.
log(n) is good: doubling n only adds 1 more mistake.
Can’t do better (consider log(n) random strings with
random labels. Whp there is a consistent feature in
hindsight).
A classic conceptual question
What about more interesting classes of functions
(not just target a single feature)?
Littlestone’s Winnow algorithm [MLJ 1988]
Motivated by the question: what if target is an
OR of r << n features? 100101011001101011 +
x4 x7 x10
Majority vote scheme over all nr possibilities would
make O(r log n) mistakes but totally impractical. Can
you do this efficiently?
Winnow is simple efficient algorithm that meets
this bound.
More generally, if exists LTF such that
positives satisfy w1x1+w2x2+…+wnxn c,
negatives satisfy w1x1+w2x2+…+wnxn c - , (W=i|wi|)
Then # mistakes = O((W/)2 log n).
E.g., if target is “k of r” function, get O(r2 log n).
Key point: still only log dependence on n.
Littlestone’s Winnow algorithm [MLJ 1988]
1001011011001
How does it work? Balanced version: +
w
Maintain weight vectors w+ and w-. w Initialize all weights to 1. Classify based on
whether w+x or w-x is larger. (Have x00)
If make mistake on positive x, for each xi=1,
wi+ = (1+)wi+, wi- = (1-)wi-.
And vice-versa for mistake on negative x.
Other properties:
Can show this approximates maxent constraints.
In other direction, [Ng’04] shows that maxent with L1
regularization gets Winnow-like bounds.
Practical issues
On batch problem, may want to cycle through
data, each time with smaller .
Can also do margin version: update if just barely
correct.
If want to output a likelihood, natural is
+x
+x
-x
w
w
w
e /[e
+ e ]. Can extend to multiclass too.
William & Vitor have paper with some other nice
practical adjustments.
Winnow versus Perceptron/SVM
Winnow is similar at high level to Perceptron
updates. What’s the difference?
Suppose data is linearly separable by wx = 0
with |wx| .
+ +
For Perceptron, mistakes/samples
++
++
- bounded by O((L2(w)L2(x)/)2)
- - For Winnow, mistakes/samples
bounded by O((L1(w)L(x)/)2(log n))
For boolean features, L(x)=1. L2(x) can be sqrt(n).
If target is sparse, examples dense, Winnow is better.
- E.g., x random in {0,1}n, f(x)=x1. Perceptron: O(n) mistakes.
If target is dense (most features are relevant) and
examples are sparse, then Perceptron wins.
OK, now on to kernels…
Generic problem
Given a set of images:
, want to learn a
linear separator to distinguish men from women.
Problem: pixel representation no good.
One approach:
Pick a better set of features! But seems ad-hoc.
Instead:
Use a Kernel! K(
,
) = (
)¢(
). is
implicit, high-dimensional mapping.
Perceptron/SVM only interact with data through dotproducts, so can be “kernelized”. If data is separable
in -space by large L2 margin, don’t have to pay for it.
Kernels
E.g., the kernel K(x,y) = (1+x¢y)d for the case of n=2, d=2,
corresponds to the implicit mapping:
z2
x2
X
X
X
X
X
X
X
X
X
O
O
X
X
O
O
O
O
O
O
x1
O
X
X
X
z3
X
X
X
X
X
X
X
O
O
z1
X
X
O
X
X
X
O
O O
O
X
X
O
X
X
X
X
X
X
X
X
Kernels
Perceptron/SVM only interact with data through dotproducts, so can be “kernelized”. If data is separable
in -space by large L2 margin, don’t have to pay for it.
E.g., K(x,y) = (1 + xy)d
:(n-diml space) ! (nd-diml space).
E.g., K(x,y) = e-(x-y)
2
Conceptual warning: You’re not really “getting all the
power of the high dimensional space without paying
for it”. The margin matters.
E.g., K(x,y)=1 if x=y, K(x,y)=0 otherwise.
Corresponds to mapping where every example gets
its own coordinate. Everything is linearly separable
but no generalization.
Question: do we need the
notion of an implicit space to
understand what makes a
kernel helpful for learning?
Focus on batch setting
Assume examples drawn from some probability
distribution:
Distribution D over x, labeled by target function c.
Or distribution P over (x, l)
Will call P (or (c,D)) our “learning problem”.
Given labeled training data, want algorithm to do
well on new data.
Something funny about theory of kernels
On the one hand, operationally a kernel is just a
similarity function: K(x,y) 2 [-1,1], with some extra
requirements. [here I’m scaling to |(x)| = 1]
x
y
And in practice, people think of a good kernel as a good
measure of similarity between data points for the task
at hand.
But Theory talks about margins in implicit highdimensional -space. K(x,y) = (x)¢(y).
I want to use ML to classify protein
structures and I’m trying to decide on
a similarity fn to use. Any help?
It should be pos. semidefinite, and should
result in your data having a large margin
separator in implicit high-diml space you
probably can’t even calculate.
Umm… thanks, I guess.
It should be pos. semidefinite, and should
result in your data having a large margin
separator in implicit high-diml space you
probably can’t even calculate.
Something funny about theory of kernels
Theory talks about margins in implicit highdimensional -space. K(x,y) = (x)¢(y).
Not great for intuition (do I expect this
kernel or that one to work better for me)
Can we connect better with idea of a good
kernel being one that is a good notion of
similarity for the problem at hand?
- Motivation [BBV]: If margin in -space, then can
pick Õ(1/2) random examples y1,…,yn (“landmarks”),
and do mapping x [K(x,y1),…,K(x,yn)]. Whp data
in this space will be apx linearly separable.
Goal: notion of “good similarity function” that…
1.
Talks in terms of more intuitive properties (no
implicit high-diml spaces, no requirement of
positive-semidefiniteness, etc)
2. If K satisfies these properties for our given
problem, then has implications to learning
3. Is broad: includes usual notion of “good
kernel” (one that induces a large margin
separator in -space).
If so, then this can help with designing the K.
[Recent work with Nina, with extensions by Nati Srebro]
Proposal satisfying (1) and (2):
Say have a learning problem P (distribution D over
examples labeled by unknown target f).
Sim fn K:(x,y)![-1,1] is (,)-good for P if at least
a 1- fraction of examples x satisfy:
Ey~D[K(x,y)|l(y)=l(x)] ¸ Ey~D[K(x,y)|l(y)l(x)]+
Q: how could you use this to learn?
How to use it
At least a 1- prob mass of x satisfy:
Ey~D[K(x,y)|l(y)=l(x)] ¸ Ey~D[K(x,y)|l(y)l(x)]+
Draw S+ of O((1/2)ln 1/2) positive examples.
Draw S- of O((1/2)ln 1/2) negative examples.
Classify x based on which gives better score.
Hoeffding: for any given “good x”, prob of
error over draw of S+,S- at most 2.
So, at most chance our draw is bad on more
than fraction of “good x”.
With prob ¸ 1-, error rate · + .
But not broad enough
+
+
30o
30o
_
K(x,y)=x¢y has good separator but doesn’t
satisfy defn. (half of positives are more similar to
negs that to typical pos)
But not broad enough
+
+
30o
30o
_
Idea: would work if we didn’t pick y’s from top-left.
Broaden to say: OK if 9 large region R s.t. most x are on
average more similar to y2R of same label than to y2R of
other label. (even if don’t know R in advance)
Broader defn…
Say K:(x,y)![-1,1] is an (,)-good similarity
function for P if exists a weighting function
w(y)2[0,1] s.t. at least 1- frac. of x satisfy:
Ey~D[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~D[w(y)K(x,y)|l(y)l(x)]+
Can still use for learning:
+
2
Draw S = {y1,…,yn}, S = {z1,…,zn}. n=Õ(1/ )
Use to “triangulate” data:
x [K(x,y1), …,K(x,yn), K(x,z1),…,K(x,zn)].
Whp, exists good separator in this space: w =
[w(y1),…,w(yn),-w(z1),…,-w(zn)]
Broader defn…
Say K:(x,y)![-1,1] is an (,)-good similarity
function for P if exists a weighting function
w(y)2[0,1] s.t. at least 1- frac. of x satisfy:
Ey~D[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~D[w(y)K(x,y)|l(y)l(x)]+
So, take new set of examples, project to
this space, and run your favorite linear
separator learning algorithm.*
Whp, exists good separator in this space: w
*Technically
bounds
are better
if adjust
definition to
= [w(y1),…,w(y
n),-w(z
1),…,-w(z
n)]
penalize examples more that fail the inequality badly…
Broader defn…
Algorithm
Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
Think of these as “landmarks”.
Use to “triangulate” data:
X [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].
Guarantee: with prob. ¸ 1-, exists linear separator of
error · + at margin /4.
Actually, margin is good in both L1 and L2 senses.
This particular approach requires wasting examples for
use as the “landmarks”. But could use unlabeled data
for this part.
Interesting property of definition
An (,)-good kernel [at least 1- fraction of x
have margin ¸ ] is an (’,’)-good sim fn under
this definition.
But our current proofs suffer a penalty: ’ =
+ extra, ’ = 3extra.
Nati Srebro has improved
to 2, which is tight, +
extended to hinge-loss.
So, at qualitative level, can have theory of
similarity functions that doesn’t require
implicit spaces.
Approach we’re investigating
With Nina & Mugizi:
Take a problem where original features already
pretty good, plus you have a couple reasonable
similarity functions K1, K2,…
Take some unlabeled data as landmarks, use to
enlarge feature space K1(x,y1), K2(x,y1), K1(x,y2),…
Run Winnow on the result.
Can prove guarantees if some convex
combination of the Ki is good.
Open questions
This view gives some sufficient conditions for a
similarity function to be useful for learning but
doesn’t have direct implications to direct use in
SVM, say.
Can one define other interesting, reasonably
intuitive, sufficient conditions for a similarity
function to be useful for learning?