Learning and clustering

Download Report

Transcript Learning and clustering

On a Theory of Similarity
functions for Learning and
Clustering
Avrim Blum
Carnegie Mellon University
This talk is based on work joint with Nina Balcan,
Nati Srebro and Santosh Vempala
Theory and Practice of Computational Learning, 2009
2-minute version
• Suppose we are given a set of images
, and
want to learn a rule to distinguish men from
women. Problem: pixel representation not so good.
• A powerful technique for such settings is to use
a kernel: a special kind of pairwise similarity
function K( ,
).
• But, theory in terms of implicit mappings.
Q: Can we develop a theory that just views K as a
measure of similarity? Develop more general and
intuitive theory of when K is useful for learning?
2-minute version
• Suppose we are given a set of images
, and
want to learn a rule to distinguish men from
women. Problem: pixel representation not so good.
• A powerful technique for such settings is to use
a kernel: a special kind of pairwise similarity
function K( ,
).
• But, theory in terms of implicit mappings.
Q: What if we only have unlabeled data (i.e.,
clustering)? Can we develop a theory of properties
that are sufficient to be able to cluster well?
2-minute version
• Suppose we are given a set of images
, and
want to learn a rule to distinguish men from
women. Problem: pixel representation not so good.
• A powerful technique for such settings is to use
a kernel: a special kind of pairwise similarity
function K( ,
).
• But, theory in terms of implicit mappings.
Develop a kind of PAC model for clustering.
Part 1: On similarity
functions for learning
Theme of this part
• Theory of natural sufficient conditions for
similarity functions to be useful for classification
learning problems.
Don’t require PSD, no implicit spaces, but includes
notion of large-margin kernel.
At a formal level, can even allow you to learn more
(can define classes of functions with no largemargin kernel even if allow substantial hinge-loss
but that do have a good similarity fn under this
notion)
Kernels
• We have a lot of great algorithms for learning
linear separators (perceptron, SVM, …). But, a lot
of time, data is not linearly separable.
– “Old” answer: use a multi-layer neural network.
– “New” answer: use a kernel function!
• Many algorithms only interact with the data via
dot-products.
++ + -
– So, let’s just re-define dot-product.
+
d
– E.g., K(x,y) = (1 + x¢y) .
-• K(x,y) = (x) ¢ (y), where () is implicit mapping into
an nd-dimensional space.
– Algorithm acts as if data is in “-space”. Allows it to
produce non-linear curve in original space.
Example
E.g., for n=2, d=2, the kernel K(x,y) = (x¢y)d corresponds to
original space
-space
x2
X
X
z2
X
X
X
X
X
X
X
O
O
X
O
O
O
X
O
O
x1
O
X
O
X
X
X
X
X
X
X
z3
X
X
X
O
X
X
O
O O
O
X
X
O
X
O
z1
X
X
O
X
X
X
X
X
X
X
Moreover, generalize well if good Margin
• If data is linearly separable by large margin in -space, then
good sample complexity.
If margin  in -space, then need
sample size of only
Õ(1/2)
to
get confidence in generalization.
[no dependence on
dimension]


+
- - -
+
+ +
++
|(x)| · 1
• Kernels useful in practice for dealing with many,
many different kinds of data.
Limitations of the Current Theory
In practice: kernels are constructed by viewing them as
measures of similarity.
Existing Theory: in terms of margins in implicit spaces.
Not best for intuition.
Kernel requirement rules out many natural similarity functions.
Alternative, perhaps more general
theoretical explanation?
A notion of a good similarity function that is:
[Balcan-Blum, ICML 2006] [Balcan-Blum-Srebro, MLJ 2008] [Balcan-Blum-Srebro, COLT 2008]
1) In terms of natural direct quantities.
• no implicit high-dimensional spaces
• no requirement that K(x,y)=(x) ¢  (y)
K can be used to learn well.
Main notion
Good kernels
First attempt
2) Is broad: includes usual notion of good kernel,
has a large margin sep. in -space
3) Even formally allows you to do more.
A First Attempt
P distribution over labeled examples (x, l(x))
Goal: output classification rule good for P
K is good if most x are on average more similar to points
y of their own type than to points y of the other type.
K is (,)-good for P if a 1- prob. mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Average similarity to
points of the same label
Average similarity to
points of opposite label
gap
A First Attempt
K is (,)-good for P if a 1- prob. mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Algorithm
• Draw sets S+, S- of positive and negative examples.
• Classify x based on average similarity to S+ versus to S-.
0.4
S+
-1
1
x
1
0.5
S-
A First Attempt
K is (,)-good for P if a 1- prob. mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Algorithm
• Draw sets S+, S- of positive and negative examples.
• Classify x based on average similarity to S+ versus to S-.
Theorem If |S+| and |S-| are ((1/2) ln(1/’)), then with
probability ¸ 1-, error · +’.
A First Attempt: Not Broad Enough
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
++
+
+
++
more similar
to - than to
typical +
30o
30o
--- --
Similarity function K(x,y)=x ¢ y
½ versus ½ ¢ 1 + ½ ¢ (- ½) = ¼
• has a large margin separator; does not satisfy our definition.
A First Attempt: Not Broad Enough
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
R
++
+
+
++
30o
30o
--- --
Broaden: 9 non-negligible R s.t. most x are on average more
similar to y 2 R of same label than to y 2 R of other label.
[even if do not know R in advance]
Broader Definition
K is (, , )-good if 9 a set R of “reasonable” y (allow probabilistic)
s.t. 1- fraction of x satisfy: (technically  hinge loss)
Ey~P[K(x,y)|l(y)=l(x), R(y)] ¸ Ey~P[K(x,y)|l(y)l(x), R(y)]+
At least  prob. mass of reasonable positives & negatives.
Algorithm
• Draw S={y1, , yd} set of landmarks.
Re-represent data.
x ! F(x) = [K(x,y1), …,K(x,yd)].
P
d
F
R
F(P)
• If enough landmarks (d=(1/2  )), then with high prob. there
exists a good L1 large margin linear separator.
w=[0,0,1/n+,1/n+,0,0,0,-1/n-,0,0]
Broader Definition
K is (, , )–good if 9 a set R of “reasonable” y (allow probabilistic)
s.t. 1- fraction of x satisfy: (technically  hinge loss)
Ey~P[K(x,y)|l(y)=l(x), R(y)] ¸ Ey~P[K(x,y)|l(y)l(x), R(y)]+
At least  prob. mass of reasonable positives & negatives.
Algorithm
du=Õ(1/(2 ))
dl=O((1/(2²acc))ln du)
• Draw S={y1, , yd} set of landmarks.
Re-represent data.
P
X
O
X X
O
X
X
O O O
x ! F(x) = [K(x,y1), …,K(x,yd)]
F(P)
d
F
R
X
XX X
X
O
O O
O O
• Take a new set of labeled examples, project to this space, and run a
good L1 linear separator alg. (e.g., Winnow etc).
Kernels and Similarity Functions
Good Similarities
Large-margin Kernels
Theorem
K is a good kernel
K is also a good similarity function.
(but  gets squared).
If K has margin  in implicit space, then for any ,
K is (,2,)-good in our sense.
Kernels and Similarity Functions
Good Similarities
Large-margin Kernels
Theorem
K is a good kernel
K is also a good similarity function.
(but  gets squared).
Can also show a separation.
Theorem
Exists class C, distrib D s.t. 9 a similarity function with large  for all
f in C, but no large-margin kernel function exists.
Kernels and Similarity Functions
Theorem
For any class C of pairwise uncorrelated functions, 9 a similarity
function good for all f in C, but no such good kernel function exists.
• In principle, should be able to learn from O(-1log(|C|/))
labeled examples.
• Claim 1: can define generic (0,1,1/|C|)-good similarity function
achieving this bound. (Assume D not too concentrated)
• Claim 2: There is no (,) good kernel in hinge loss, even if =1/2
and =1/|C|1/2. So, margin based SC is d=(|C|).
Learning with Multiple Similarity Functions
• Let K1, …, Kr be similarity functions s. t. some (unknown)
convex combination of them is (,)-good.
Algorithm
• Draw S={y1, , yd} set of landmarks. Concatenate features.
F(x) = [K1(x,y1), …,Kr(x,y1), …, K1(x,yd),…,Kr(x,yd)].
• Run same L1 optimization algorithm as before in this new feature
space.
Learning with Multiple Similarity Functions
• Let K1, …, Kr be similarity functions s. t. some (unknown)
convex combination of them is (,)-good.
Algorithm
• Draw S={y1, , yd} set of landmarks. Concatenate features.
F(x) = [K1(x,y1), …,Kr(x,y1), …, K1(x,yd),…,Kr(x,yd)].
Guarantee: Whp the induced distribution F(P) in R2dr has a
separator of error ·  +  at L1 margin at least
Sample complexity is roughly:
Only increases by log(r) factor!
Learning with Multiple Similarity Functions
• Let K1, …, Kr be similarity functions s. t. some (unknown)
convex combination of them is (,)-good.
Algorithm
• Draw S={y1, , yd} set of landmarks. Concatenate features.
F(x) = [K1(x,y1), …,Kr(x,y1), …, K1(x,yd),…,Kr(x,yd)].
Guarantee: Whp the induced distribution F(P) in R2dr has a
separator of error ·  +  at L1 margin at least
Proof: imagine mapping Fo(x) = [Ko(x,y1), …,Ko (x,yd)], for the good
similarity function Ko =1 K1 + …. + r Kr
Consider wo =(w1, …, wd) of L1 norm 1, margin /4.
The vector w = (1 w1 , 2 w1,…, r w1, …, 1 wd , 2 wd,…, r wd) also
has norm 1 and has w¢F(x) = wo¢Fo(x).
Learning with Multiple Similarity Functions
• Because property defined in terms of L1, no change in
margin!
– Only log(r) penalty for concatenating feature spaces.
– If L2, margin would drop by factor r1/2, giving O(r) penalty in sample
complexity.
• Algorithm is also very simple (just concatenate).
• Alternative algorithm: do joint optimization:
– solve for Ko = (1K1 + … + nKn), vector wo s.t. wo has good L1
margin in space defined by Fo(x) = [Ko(x,y1),…,Ko(x,yd)]
– Bound also holds here since capacity only lower.
– But we don’t know how to do this efficiently…
Part 2: Can we use this angle to
help think about clustering?
Clustering comes up in many places
• Given a set of documents or search results,
cluster them by topic.
• Given a collection of protein sequences,
cluster them by function.
• Given a set of images of people, cluster by
who is in them.
• …
Can model clustering like
this:
• Given data set S of n objects.
[news articles]
[sports]
[politics]
• There is some (unknown) “ground truth”
clustering C1*,C2*,…,Ck*.
• Goal: produce hypothesis clustering C1,C2,…,Ck
that matches target as much as possible.
[minimize # mistakes up to renumbering of indices]
Problem: no labeled data!
But: do have a measure of similarity…
What
a similarity measure
Canconditions
model on
clustering
like
would be enough to allow one to cluster well?
this:
• Given data set S of n objects.
[news articles]
[sports]
[politics]
• There is some (unknown) “ground truth”
clustering C1*,C2*,…,Ck*.
• Goal: produce hypothesis clustering C1,C2,…,Ck
that matches target as much as possible.
[minimize # mistakes up to renumbering of indices]
Problem: no labeled data!
But: do have a measure of similarity…
What conditions on a similarity measure
would be enough to allow one to cluster well?
Contrast with more standard
approach to clustering analysis:
• View similarity/distance info as “ground
truth”
min-sum, k-means,
k-median,…
• Analyze abilities of algorithms to
achieve different optimization criteria.
• Or, assume generative model, like
mixture of Gaussians
• Here, no generative assumptions.
Instead: given data, how powerful a K do
we need to be able to cluster it well?
What conditions on a similarity measure
would be enough to allow one to cluster well?
Here is a condition that trivially
works:
Suppose K has property that:
• K(x,y) > 0 for all x,y such that C*(x) = C*(y).
• K(x,y) < 0 for all x,y such that C*(x)  C*(y).
If we have such a K, then clustering is easy.
Now, let’s try to make this condition a little
weaker….
What conditions on a similarity measure
would be enough to allow one to cluster well?
Suppose K has property that all x are
more similar to all points y in their own
cluster than to any y’ in other clusters.
• Still a very strong condition.
Problem: the same K can satisfy for two very
different clusterings of the same data!
baseball
Math
basketball
Physics
What conditions on a similarity measure
would be enough to allow one to cluster well?
Suppose K has property that all x are
more similar to all points y in their own
cluster than to any y’ in other clusters.
• Still a very strong condition.
Problem: the same K can satisfy for two very
different clusterings of the same data!
baseball
Math
basketball
Physics
Let’s weaken our goals a bit…
•
OK to produce a hierarchical clustering
(tree) such that target clustering is apx
baseball
some pruning of it.
Math
–
E.g., in case from last slide:
sports
all documents
baseball basketball
–
•
science
basketball
Physics
math physics
Can view as saying “if any of these clusters is
too broad, just click and I will split it for
you”
Or, OK to output a small # of clusterings such that at
least one has low error (like list-decoding) but won’t
talk about this one today.
1.
Then you can start getting
somewhere….
“all x more similar to all y in their own cluster
than to any y’ from any other cluster”
is sufficient to get hierarchical clustering such that target is
some pruning of tree. (Kruskal’s / single-linkage works)
1.
Then you can start getting
somewhere….
“all x more similar to all y in their own cluster
than to any y’ from any other cluster”
is sufficient to get hierarchical clustering such that target is
some pruning of tree. (Kruskal’s / single-linkage works)
2. Weaker condition: ground truth is “stable”:
For all clusters C, C’, for all AµC,
A’µC’: A and A’ not both more
similar on avg to each other
than to rest of own clusters.
(plus technical conditions at boundary)
View K(x,y) as
attraction
between x and y
Sufficient to get a good tree using average single linkage alg.
Analysis for slightly simpler version
Assume for all C, C’, all A½C, A’µC’, we have
K(A,C-A) > K(A,A’),
Avgx2A, y2C-A[S(x,y)]
and say K is symmetric.
Algorithm: average single-linkage
• Like Kruskal, but at each step merge pair of
clusters whose average similarity is highest.
Analysis: (all clusters made are laminar wrt target)
• Failure iff merge C1, C2 s.t. C1½C, C2ÅC = .
43
Analysis for slightly simpler version
Assume for all C, C’, all A½C, A’µC’, we have
C3
K(A,C-A) > K(A,A’),
Avgx2A, y2C-A[S(x,y)]
and say K is symmetric.
C2
C1
Algorithm: average single-linkage
• Like Kruskal, but at each step merge pair of
clusters whose average similarity is highest.
Analysis: (all clusters made are laminar wrt target)
• Failure iff merge C1, C2 s.t. C1½C, C2ÅC = .
• But must exist C3½C at least as similar to C1 as the
average. Contradiction.
44
More sufficient properties:
3.
“all x more similar to all y in their own cluster
than to any y’ from any other cluster”
But add noisy data.
– Noisy data can ruin bottom-up algorithms, but can
show a generate-and-test style algorithm works.
– Create collection of plausible clusters.
– Use series of pairwise tests to remove/shrink
clusters until consistent with a tree
More sufficient properties:
3.
“all x more similar to all y in their own cluster
than to any y’ from any other cluster”
But add noisy data.
4.
Implicit assumptions made by optimization
approach:
“Any approximately-optimal ..k-median.. solution
is close (in terms of how pts are clustered) to
the target.”
[Nina Balcan’s talk on Saturday]
Can also analyze inductive setting
Assume for all C, C’, all A½C, A’µC’, we have
K(A,C-A) > K(A,A’)+,
but only see small sample S
Can use “regularity” type results of [AFKK] to
argue that whp, a reasonable size S will give
good estimates of all desired quantities.
Once S is hierarchically partitioned, can insert
new points as they arrive.
Like a PAC model for clustering
• A property is a relation between target and
similarity information (data). Like a datadependent concept class in learning.
• Given data and a similarity function K, a
property induces a “concept class” C of all
clusterings c such that (c,K) is consistent
with the property.
• Tree model: want tree T s.t. set of prunings
of T form an -cover of C.
• In inductive model, want this with prob 1-.
Summary (part II)
•
•
Exploring the question: what does an
algorithm need in order to cluster well?
What natural properties allow a similarity
measure to be useful for clustering?
–
–
•
To get a good theory, helps to relax what we
mean by “useful for clustering”.
User can then decide how specific he wanted
to be in each part of domain.
Analyze a number of natural properties
and prove guarantees on algorithms able
to use them.
Wrap-up
• Tour through learning and clustering by similarity functions.
– User with some knowledge of the problem domain comes up with
pairwise similarity measure K(x,y) that makes sense for the given
problem.
– Algorithm uses this (together with labeled data in the case of
learning) to find a good solution.
• Goals of a theory:
– Give guidance to similarity-function designer (what properties to
shoot for?).
– Understand what properties are sufficient for learning/clustering,
and by what algorithms.
• For learning, get theory of kernels without need for
“implicit spaces”.
• For clustering, “reverses” the usual view. Suggests giving
the algorithm some slack (tree vs partitioning).
• A lot of interesting questions still open in these areas.