Transcript Slides - People
Data reduction for weighted and outlier-resistant clustering Leonard J. Schulman Caltech
joint with
Dan Feldman MIT
Talk outline
• Clustering-type problems: – k-median – weighted k-median – k-median with m outliers (small m) – k-median with penalty (clustering with many outliers) – k-line median • Unifying framework: tame loss functions • Core-sets, a.k.a. -approximations • Common existence proof and algorithm
Voronoi regions have spherical boundaries
k
-
Median with penalty
k
-
Median with penalty: good for outliers
2-median clustering of a data set: Same data set plus an outlier: Now cluster with h-robust loss function:
Related work and our results Problem Approx.
4
Time
O(1)
Reference
Charikar et al.
SODA’01
Our Result
K. Chen SODA’08
Our Result
Har-Peled FSTTCS’06
Our Result
F, Fiat, Sharir FOCS’06
Our Result
Why are all these problems in the same paper?
In each case the objective function is a suitably tame “loss function”.
The loss in representing a point p by a center c is: k-median: D(p) = dist(p,c) Weighted k-median: D(p) = w · dist(p,c) Robust k-median: D(p) = min{h, dist(p,c)}
What qualifies as a “tame” loss function?
Log-Log Lipschitz (LgLgLp) condition on the loss function
Many examples of LgLgLp loss functions: Robust M-estimators in Statistics
figure: Z. Zhang
Classic Data Reduction
Same notion for LgLgLp loss functions
k -clustering core-set for loss D
Weighted k -clustering core-set for loss D
Handling arbitrary weight centers is the “hard part”
Our main technical result
1. For every LgLgLp loss fcn D on a metric space, for every set P of n points, there is a weighted-(D,k)-core-set S of size |S| = O(log 2 n) (In more detail: |S|=(dk O(k) / 2 ) log 2 metrics, d=log n.) n in R d . For finite 2. S can be computed in time O(n)
Sensitivity
[Langberg and S, SODA’11] The sensitivity of a point p is to include P in a core-set: P determines how important it s(p) = max C D W (p,C) q P D W (q,C) Why this works: If s(p) is small, then p has many “surrogates” in the data, we can take any one of them for the core-set.
If s(p) is large, then there is some C for which p alone contributes a significant fraction of the loss, so we need to include p in any core-set.
Total sensitivity
The total sensitivity T(P) is the sum of the sensitivities of all the points: T(P)= s P s(p) The total sensitivity of the problem of T(P) over all input sets P.
is the maximum Total sensitivity ~ n: cannot have small core-sets.
Total sensitivity constant or polylog: there may exist small core-sets.
Small total sensitivity
Small coreset
Small total sensitivity
Small core-set
The main thing we need to do in order to produce a small core-set for weighted-k-median:
For each p P compute a good upper bound on s(p) in amortized O(1) time per point.
(Upper bound should be good enough that s(p) is small)
Algorithm for computing sensitivities
Recursive-Robust-Median(P,k) • Input: – A set P of n points in a metric space – An integer k 1 • Output: – A subset Q P of (n/k k ) points We prove that any two points in Q can serve as each others’ surrogates w.r.t. any query . Hence each point p Q has sensitivity s(p) O(1/|Q|). Outer loop: Call Recursive-Robust-Median(P,k), then set P:=P-Q. Repeat until P is empty.
Total sensitivity bd: T Median k k log n.
# calls to Recursive-Robust-
The algorithm to find the
(n) –size set Q:
Recursive-Robust-Median: illustration c* c*
Recursive-Robust-Median: illustration c*
A detail
Actually it’s more complicated than described because we can’t afford to look for a (1+ best k ) approximation, or even a 2-approximation, to the median of any b·n points (b constant).
Instead look for a bicriteria approximation: a 2 approximation of the best k STOC’11].
median of any b·n/2 points. Linear time algorithm from [F,Langberg
High-level intuition for the correctness of Recursive-Robust-Median
Consider any p in the “output” set Q.
If for all queries C, D(p,C) is small, then p has low sensitivity.
If there is a query C for which D(p,C) is large then in that query, all points of Q are assigned to the same center c C, and are closer to each other than to c; so they are surrogates.
Thank you
appendices
Many examples of LgLgLp loss functions: Robust M-estimators in Statistics
…
М-оценка Huber "fair" Cauchy Geman McClure Welsch Tukey Andrews