Efficient classification for metric data Lee-Ad Gottlieb Aryeh Kontorovich Robert Krauthgamer Hebrew U. Ben Gurion U. Weizmann Institute.

Download Report

Transcript Efficient classification for metric data Lee-Ad Gottlieb Aryeh Kontorovich Robert Krauthgamer Hebrew U. Ben Gurion U. Weizmann Institute.

Efficient classification for
metric data
Lee-Ad Gottlieb
Aryeh Kontorovich
Robert Krauthgamer
Hebrew U.
Ben Gurion U.
Weizmann Institute
Classification problem

A fundamental problem in learning:



Point space X
Probability distribution P on X x {-1,1}
Learner observes sample S of n points (x,y) drawn iid ~P
Wants to predict labels of other points in X
Produces hypothesis h: X → {-1,1} with
+1
-1
empirical error
and true error

Goal:
uniformly over h in probability
Efficient classification for metric data
2
Classification problem

A fundamental problem in learning:



Point space X
Probability distribution P on X x {-1,1}
Learner observes sample S of n points (x,y) drawn iid ~P
Wants to predict labels of other points in X
Produces hypothesis h: X → {-1,1} with
+1
-1
empirical error
and true error

Goal:
uniformly over h in probability
Efficient classification for metric data
3
Classification problem

A fundamental problem in learning:



Point space X
Probability distribution P on X x {-1,1}
Learner observes sample S of n points (x,y) drawn iid ~P
Wants to predict labels of other points in X
Produces hypothesis h: X → {-1,1} with
+1
-1
empirical error
and true error

Goal:
uniformly over h in probability
Efficient classification for metric data
4
Generalization bounds

How do we upper bound the true error?

Use a generalization bound. Roughly speaking (and whp)
true error ≤ empirical error + (complexity of h)/n
+1
-1

+1
-1
More complex classifier ↔ “easier” to fit to arbitrary data

VC-dimension: largest point set
that can be shattered by h
5
Popular approach for classification


Assume the points are in Euclidean space!
Pros




Cons



Existence of inner product
Efficient algorithms (SVM)
Good generalization bounds (max margin)
Many natural settings non-Euclidean
Euclidean structure is a strong assumption
Recent popular focus

Metric space data
Efficient classification for metric data
6
Metric space

(X,d) is a metric space if


X = set of points
d() = distance function






‫חיפה‬
nonnegative
symmetric
triangle inequality
95km
‫תל אביב‬
208km
113km
‫באר שבע‬
inner product → norm
norm → metric
But ⇐ doesn’t hold
Efficient classification for metric data
7
Classification for metric data?

Advantage: often much more natural




much weaker assumption
strings
Images (earthmover distance)
Problem: no vector representation


No notion of dot-product (and no kernel)
What to do?

Invent kernel (e.g. embed into Euclidean space)?.. Possible high distortion!

Use some NN heuristic?.. NN classifier has ∞ VC-dim!
Efficient classification for metric data
8
Preliminaries: Lipschitz constant


The Lipschitz constant L of a function f: X → R measures its
smoothness

It is the smallest value L that satisfies for all points xi,xj in X

Denoted by
Suppose hypothesis h: S → {-1,1} is consistent with sample S


Its Lipschitz constant of h is determined by the closest pair of differently
labeled points
Or equivalently ≥ 2/d(S+,S−)
Efficient classification for metric data
-1
+1
9
Preliminaries: Lipschitz extension

Lipschitz extension:



A classic problem in Analysis
given a function f: S → R for S in X, extend f to all of X without
increasing the Lipschitz constant
Example: Points on the real line



f(1) = 1
f(-1) = -1
credit: A. Oberman
Efficient classification for metric data
10
Classification for metric data

A powerful framework for metric classification was introduced
by von Luxburg & Bousquet (vLB, JMLR ‘04)

Construction of h on S: The natural hypotheses (classifiers) to consider
are maximally smooth Lipschitz functions

Estimation of h on X: The problem of evaluating h for new points in X
reduces to the problem of finding a Lipschitz function consistent with h



Lipschitz extension problem
For example
f(x) = mini [f(xi) + 2d(x, xi)/d(S+,S−)]
over all (xi,yi) in S
Evaluation of h reduces to exact Nearest Neighbor Search

Strong theoretical motivation for the NNS classification heuristic
Efficient classification for metric data
11
Two new directions

The framework of [vLB ‘04] leaves open two further questions:

Constructing h: handling noise


Bias-Variance tradeoff
Which sample points in S should h ignore?
-1

+1
Evaluating h on X


In arbitrary metric space, exact NNS
requires Θ(n) time
Can we do better?
Efficient classification for metric data
q
~1
~1
14
Doubling dimension


Definition: Ball B(x,r) = all points within distance r from x.
The doubling constant (of a metric M) is the minimum value >0
such that every ball can be covered by  balls of half the radius





First used by [Assoud ‘83], algorithmically by [Clarkson ‘97].
The doubling dimension is ddim(M)=log2(M)
A metric is doubling if its doubling dimension is constant
Euclidean: ddim(Rd) = O(d)
Packing property of doubling spaces

A set with diameter diam and minimum
inter-point distance a, contains at most
(diam/a)O(ddim) points
Here ≥7.
Efficient classification for metric data
15
Applications of doubling dimension

Major application to databases



Database/network structures and tasks analyzed via the doubling dimension







Recall that exact NNS requires Θ(n) time in arbitrary metric space
There exists a linear size structure that supports approximate nearest neighbor search in time
2O(ddim) log n
Nearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06]
Image recognition (Vision) [KG --]
Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b]
Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11]
Clustering [Tal ‘04, ABS ‘08, FM ‘10]
Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08]
Further applications



Travelling Salesperson [Tal ‘04]
Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11]
Machine learning [BLL ‘09, KKL ‘10, KKL --]

Note: Above algorithms can be extended to nearly-doubling spaces [GK ‘10]

Message: This is an active line of research…
16
Our dual use of doubling dimension

Interestingly, considering the doubling dimension yields
contributes in two different areas

Statistical: Function complexity


We bound the complexity of the hypothesis in terms of the doubling
dimension of X and the Lipschitz constant of the classifier h
Computational: efficient approximate NNS
Efficient classification for metric data
19
Statistical contribution

We provide generalization bounds for Lipschitz functions on
spaces with low doubling dimension


vLB provided similar bounds using covering numbers and Rademacher
averages
Fat-shattering analysis:



L-Lipschitz functions shatter a set →
inter-point distance is at least 2/L
Packing property →
set has (diam L)O(ddim) points
This is the fat-shattering dimension
of the classifier on the space, and is
a good measure of its complexity.
Efficient classification for metric data
20
Statistical contribution

[BST ‘99]:

For any f that classifies a sample of size n correctly, we have with
probability at least 1−


Likewise, if f is correct on all but k examples, we have with probability at
least 1−



P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log2(578n) + log(4/)) .
P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log2(578n) + ln(4/))]1/2.
In both cases, d is bound by the fat-shattering dimension,
d ≤ (diam L)ddim + 1
Done with the statistical contribution … On to the
computational contribution.
Efficient classification for metric data
21
Computational contribution

Evaluation of h for new points in X


Lipschitz extension function
f(x) = mini [yi + 2d(x, xi)/d(S+,S−)]


New tool: (1+)-approximate nearest neighbor search



2O(ddim) log n + O(-ddim) time
[KL ‘04, HM ‘06, BKL ‘06, CG ‘06]
If we evaluate f(x) using an approximate NNS, we can show that the
result agrees with (the sign of) at least one of




Requires exact nearest neighbor search, which can be expensive!
g(x) = (1+) f(x) + 
e(x) = (1+) f(x) - 
Note that g(x) ≥ f(x) ≥ e(x)
2
g(x)
f(x)
e(x)
g(x) and e(x) have Lipschitz constant (1+)L, so they and the
approximate function generalizes well
Efficient classification for metric data
22
Final problem: bias variance tradeoff

Which sample points in S should h ignore?
-1

+1
If f is correct on all but k examples, we have with probability at least
1−


P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/))]1/2.
Where d ≤ (diam L)ddim + 1
Efficient classification for metric data
23
Structural Risk Minimization

Algorithm

Fix a target Lipschitz constant L


Locate all pairs of points from S+ and S- whose distance is less than 2L


O(n2) possibilities
At least one of these points has to be taken as an error
Goal: Remove as few points as possible
+1
-1
Efficient classification for metric data
24
Structural Risk Minimization

Algorithm

Fix a target Lipschitz constant L


Locate all pairs of points from S+ and S- whose distance is less than 2L


O(n2) possibilities
At least one of these points has to be taken as an error
Goal: Remove as few points as possible
+1

Minimum vertex cover


NP-Complete
Admits a 2-approximation in O(E) time
-1
Efficient classification for metric data
25
Structural Risk Minimization

Algorithm

Fix a target Lipschitz constant L


Locate all pairs of points from S+ and S- whose distance is less than 2L


O(n2) possibilities
At least one of these points has to be taken as an error
Goal: Remove as few points as possible
+1

Minimum vertex cover



NP-Complete
Admits a 2-approximation in O(E) time
-1
Minimum vertex cover on a bipartite graph


Equivalent to maximum matching (Konig’s theorem)
Admits an exact solution in O(n2.376) randomized time [MS ‘04]
Efficient classification for metric data
26
Efficient SRM

Algorithm:

For each of O(n2) values of L




Run matching algorithm to find minimum error
Evaluate generalization bound for this value of L
O(n4.376) randomized time
Better algorithm


Binary search over O(n2) values of L
For each value

Run greedy 2-approximation
Approximate minimum error in O(n2 log n) time
Evaluate approximate generalization bound for this value of L
Efficient classification for metric data
27
Conclusion

Results:




Generalization bounds for Lipschitz classifiers in doubling spaces
Efficient evaluation of the Lipschitz extension hypothesis using
approximate NNS
Efficient Structural Risk Minimization
Continuing research: Continuous labels



Risk bound via the doubling dimension
Classifier h determined via an LP
Faster LP: low-hop low-stretch spanners [GR ’08a, GR ’08b] → fewer
constraints, each variable appears in bounded number of constraints.
Efficient classification for metric data
28
Application: earthmover distance
S
T
Efficient classification for metric data
29