Efficient classification for metric data Lee-Ad Gottlieb Aryeh Kontorovich Robert Krauthgamer Hebrew U. Ben Gurion U. Weizmann Institute.
Download
Report
Transcript Efficient classification for metric data Lee-Ad Gottlieb Aryeh Kontorovich Robert Krauthgamer Hebrew U. Ben Gurion U. Weizmann Institute.
Efficient classification for
metric data
Lee-Ad Gottlieb
Aryeh Kontorovich
Robert Krauthgamer
Hebrew U.
Ben Gurion U.
Weizmann Institute
Classification problem
A fundamental problem in learning:
Point space X
Probability distribution P on X x {-1,1}
Learner observes sample S of n points (x,y) drawn iid ~P
Wants to predict labels of other points in X
Produces hypothesis h: X → {-1,1} with
+1
-1
empirical error
and true error
Goal:
uniformly over h in probability
Efficient classification for metric data
2
Classification problem
A fundamental problem in learning:
Point space X
Probability distribution P on X x {-1,1}
Learner observes sample S of n points (x,y) drawn iid ~P
Wants to predict labels of other points in X
Produces hypothesis h: X → {-1,1} with
+1
-1
empirical error
and true error
Goal:
uniformly over h in probability
Efficient classification for metric data
3
Classification problem
A fundamental problem in learning:
Point space X
Probability distribution P on X x {-1,1}
Learner observes sample S of n points (x,y) drawn iid ~P
Wants to predict labels of other points in X
Produces hypothesis h: X → {-1,1} with
+1
-1
empirical error
and true error
Goal:
uniformly over h in probability
Efficient classification for metric data
4
Generalization bounds
How do we upper bound the true error?
Use a generalization bound. Roughly speaking (and whp)
true error ≤ empirical error + (complexity of h)/n
+1
-1
+1
-1
More complex classifier ↔ “easier” to fit to arbitrary data
VC-dimension: largest point set
that can be shattered by h
5
Popular approach for classification
Assume the points are in Euclidean space!
Pros
Cons
Existence of inner product
Efficient algorithms (SVM)
Good generalization bounds (max margin)
Many natural settings non-Euclidean
Euclidean structure is a strong assumption
Recent popular focus
Metric space data
Efficient classification for metric data
6
Metric space
(X,d) is a metric space if
X = set of points
d() = distance function
חיפה
nonnegative
symmetric
triangle inequality
95km
תל אביב
208km
113km
באר שבע
inner product → norm
norm → metric
But ⇐ doesn’t hold
Efficient classification for metric data
7
Classification for metric data?
Advantage: often much more natural
much weaker assumption
strings
Images (earthmover distance)
Problem: no vector representation
No notion of dot-product (and no kernel)
What to do?
Invent kernel (e.g. embed into Euclidean space)?.. Possible high distortion!
Use some NN heuristic?.. NN classifier has ∞ VC-dim!
Efficient classification for metric data
8
Preliminaries: Lipschitz constant
The Lipschitz constant L of a function f: X → R measures its
smoothness
It is the smallest value L that satisfies for all points xi,xj in X
Denoted by
Suppose hypothesis h: S → {-1,1} is consistent with sample S
Its Lipschitz constant of h is determined by the closest pair of differently
labeled points
Or equivalently ≥ 2/d(S+,S−)
Efficient classification for metric data
-1
+1
9
Preliminaries: Lipschitz extension
Lipschitz extension:
A classic problem in Analysis
given a function f: S → R for S in X, extend f to all of X without
increasing the Lipschitz constant
Example: Points on the real line
f(1) = 1
f(-1) = -1
credit: A. Oberman
Efficient classification for metric data
10
Classification for metric data
A powerful framework for metric classification was introduced
by von Luxburg & Bousquet (vLB, JMLR ‘04)
Construction of h on S: The natural hypotheses (classifiers) to consider
are maximally smooth Lipschitz functions
Estimation of h on X: The problem of evaluating h for new points in X
reduces to the problem of finding a Lipschitz function consistent with h
Lipschitz extension problem
For example
f(x) = mini [f(xi) + 2d(x, xi)/d(S+,S−)]
over all (xi,yi) in S
Evaluation of h reduces to exact Nearest Neighbor Search
Strong theoretical motivation for the NNS classification heuristic
Efficient classification for metric data
11
Two new directions
The framework of [vLB ‘04] leaves open two further questions:
Constructing h: handling noise
Bias-Variance tradeoff
Which sample points in S should h ignore?
-1
+1
Evaluating h on X
In arbitrary metric space, exact NNS
requires Θ(n) time
Can we do better?
Efficient classification for metric data
q
~1
~1
14
Doubling dimension
Definition: Ball B(x,r) = all points within distance r from x.
The doubling constant (of a metric M) is the minimum value >0
such that every ball can be covered by balls of half the radius
First used by [Assoud ‘83], algorithmically by [Clarkson ‘97].
The doubling dimension is ddim(M)=log2(M)
A metric is doubling if its doubling dimension is constant
Euclidean: ddim(Rd) = O(d)
Packing property of doubling spaces
A set with diameter diam and minimum
inter-point distance a, contains at most
(diam/a)O(ddim) points
Here ≥7.
Efficient classification for metric data
15
Applications of doubling dimension
Major application to databases
Database/network structures and tasks analyzed via the doubling dimension
Recall that exact NNS requires Θ(n) time in arbitrary metric space
There exists a linear size structure that supports approximate nearest neighbor search in time
2O(ddim) log n
Nearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06]
Image recognition (Vision) [KG --]
Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b]
Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11]
Clustering [Tal ‘04, ABS ‘08, FM ‘10]
Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08]
Further applications
Travelling Salesperson [Tal ‘04]
Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11]
Machine learning [BLL ‘09, KKL ‘10, KKL --]
Note: Above algorithms can be extended to nearly-doubling spaces [GK ‘10]
Message: This is an active line of research…
16
Our dual use of doubling dimension
Interestingly, considering the doubling dimension yields
contributes in two different areas
Statistical: Function complexity
We bound the complexity of the hypothesis in terms of the doubling
dimension of X and the Lipschitz constant of the classifier h
Computational: efficient approximate NNS
Efficient classification for metric data
19
Statistical contribution
We provide generalization bounds for Lipschitz functions on
spaces with low doubling dimension
vLB provided similar bounds using covering numbers and Rademacher
averages
Fat-shattering analysis:
L-Lipschitz functions shatter a set →
inter-point distance is at least 2/L
Packing property →
set has (diam L)O(ddim) points
This is the fat-shattering dimension
of the classifier on the space, and is
a good measure of its complexity.
Efficient classification for metric data
20
Statistical contribution
[BST ‘99]:
For any f that classifies a sample of size n correctly, we have with
probability at least 1−
Likewise, if f is correct on all but k examples, we have with probability at
least 1−
P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log2(578n) + log(4/)) .
P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log2(578n) + ln(4/))]1/2.
In both cases, d is bound by the fat-shattering dimension,
d ≤ (diam L)ddim + 1
Done with the statistical contribution … On to the
computational contribution.
Efficient classification for metric data
21
Computational contribution
Evaluation of h for new points in X
Lipschitz extension function
f(x) = mini [yi + 2d(x, xi)/d(S+,S−)]
New tool: (1+)-approximate nearest neighbor search
2O(ddim) log n + O(-ddim) time
[KL ‘04, HM ‘06, BKL ‘06, CG ‘06]
If we evaluate f(x) using an approximate NNS, we can show that the
result agrees with (the sign of) at least one of
Requires exact nearest neighbor search, which can be expensive!
g(x) = (1+) f(x) +
e(x) = (1+) f(x) -
Note that g(x) ≥ f(x) ≥ e(x)
2
g(x)
f(x)
e(x)
g(x) and e(x) have Lipschitz constant (1+)L, so they and the
approximate function generalizes well
Efficient classification for metric data
22
Final problem: bias variance tradeoff
Which sample points in S should h ignore?
-1
+1
If f is correct on all but k examples, we have with probability at least
1−
P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/))]1/2.
Where d ≤ (diam L)ddim + 1
Efficient classification for metric data
23
Structural Risk Minimization
Algorithm
Fix a target Lipschitz constant L
Locate all pairs of points from S+ and S- whose distance is less than 2L
O(n2) possibilities
At least one of these points has to be taken as an error
Goal: Remove as few points as possible
+1
-1
Efficient classification for metric data
24
Structural Risk Minimization
Algorithm
Fix a target Lipschitz constant L
Locate all pairs of points from S+ and S- whose distance is less than 2L
O(n2) possibilities
At least one of these points has to be taken as an error
Goal: Remove as few points as possible
+1
Minimum vertex cover
NP-Complete
Admits a 2-approximation in O(E) time
-1
Efficient classification for metric data
25
Structural Risk Minimization
Algorithm
Fix a target Lipschitz constant L
Locate all pairs of points from S+ and S- whose distance is less than 2L
O(n2) possibilities
At least one of these points has to be taken as an error
Goal: Remove as few points as possible
+1
Minimum vertex cover
NP-Complete
Admits a 2-approximation in O(E) time
-1
Minimum vertex cover on a bipartite graph
Equivalent to maximum matching (Konig’s theorem)
Admits an exact solution in O(n2.376) randomized time [MS ‘04]
Efficient classification for metric data
26
Efficient SRM
Algorithm:
For each of O(n2) values of L
Run matching algorithm to find minimum error
Evaluate generalization bound for this value of L
O(n4.376) randomized time
Better algorithm
Binary search over O(n2) values of L
For each value
Run greedy 2-approximation
Approximate minimum error in O(n2 log n) time
Evaluate approximate generalization bound for this value of L
Efficient classification for metric data
27
Conclusion
Results:
Generalization bounds for Lipschitz classifiers in doubling spaces
Efficient evaluation of the Lipschitz extension hypothesis using
approximate NNS
Efficient Structural Risk Minimization
Continuing research: Continuous labels
Risk bound via the doubling dimension
Classifier h determined via an LP
Faster LP: low-hop low-stretch spanners [GR ’08a, GR ’08b] → fewer
constraints, each variable appears in bounded number of constraints.
Efficient classification for metric data
28
Application: earthmover distance
S
T
Efficient classification for metric data
29