Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan.

Download Report

Transcript Testing of clustering Article by : Noga Alon, Seannie Dar, Michal Parnas and Dana Ron Presented by: Nir Eitan.

Testing of clustering
Article by :
Noga Alon, Seannie Dar, Michal Parnas and Dana Ron
Presented by:
Nir Eitan
What will I talk about?





General definition of clustering
and motivations.
Being (k,b) clusterable
Sublinear property tester
Solving for a general metric
Better result for a specific metric
& cost function.
2
Motivation

What is a clustering problem?


Cluster analysis or clustering is the assignment
of a set of observations into subsets (called
clusters) so that observations in the same cluster
are similar in some sense.
Method of Unsupervised Learning
3
Motivation

What is it used for?






Image segmentation, object recognition, face
detection.
Social network analysis
Bioinformatics, grouping sequences into gene
families
Crime analysis
Market research
And many more
4
Clustering

Being (k,b) clusterable



Input: a set X of n d-dimensional points
Output: can X be partitioned into k subsets, so
that the cost of each is at most b.
Different cost measures


Radius cost
Diameter cost
5
Hardness

How hard is it?



NP-complete! (both cases, for d>1)
For a general metric, it is hard to approximate the
cost of an optimal clustering to within a factor of 2.
Diameter Cost can be solved in (O(n))dk^2 (disjoint
covex hulls).
6
Sublinearity



We would like to have some sublinear tester
which tells us if the input is (k,b) clusterable,
or far from it - Property testing.
Input: A set X of n d-dimensional points
Output:



If X is (k,b) clusterable, answer yes
If it is ε-far from being (k,(1+β)b) clusterable,
reject with probability at least 2/3
Being ε-far means there is no such clustering
even after removing any nε points
7
Testers covered in the article





Solving for general metrics and β=1
L2 metric, radius cost, can be solved for β=0 (no approximation)
L2 metric, diameter cost. Can be solved in O(p(d,k)β-2d) samples.
Lower bounds
I will focus on the first and the third
8
Testing of clustering under general metric




Will show an algorithm with β=1, for radius
clustering
Assumes triangle inequality
Idea : Find representatives – points which
their pairwise distance is greater than 2b.
Algorithm:


Hold a representatives list, and greedily try to add
valid points to it (choosing the points uniformly
and independently)
Do it for up to m iterations. If at any stage |rep|>k
reject, otherwise accept
9
Testing of clustering under general metric
- Analyses


Case 1: X is (k,b) clusterable. the algorithm
will always accept.
Case 2: X is ε-far from being (k,2b).



More than εn candidate representatives at every
stage
Probability to get a candidate at every stage is
>=ε.
Can use chernoff bound for m samples of
bernoulli trials with p=ε.
10
Testing of clustering under general metric
- Analyses

Case 2- continued





Take m to be 6k/ε.
Expected number of representatives after m
iterations > mε. Algorithm fails if less than
k=1/6(mε) are found.
Use Chernoff bound to get fail probability < 1/3 :
Pr[ΣXi<(1-γ)pm]<exp(-(1/2)γ2pm).
Run time is O(mk) = O(k2/ε)
Can be done similarly for diameter cost
11
Finding a clustering under general metric


Finding an approximately good clustering: If the
set is (k,b) clustered, return t<=k clusters of radius
<=2b and at most εn outside w.h.p.
Use the same algorithm as before, and return the
representatives list.

Probability to get more than εn outside the enlarged
radius is <1/3.
12
L2 metric – diameter clustering


Can get to 0>β<1
Proof stages




Prove for d=1
Prove for d=2, k=1
Prove for any d>=2, k=1
Prove for any d and k.
13
1 Dimensional clustering





Can be solved deterministically in poly time.
No real difference between diameter and
radius cost.
A sublinear algorithm with β=0 will be shown
here
Select uniformly and independently
m=θ(k/ε*log(k/ε)) random points
Check if they can be (k,b) clustered.
14
1 Dimensional clustering


If X is (k,b) clusterable clearly the subset will be (k,b) clusterable
as well, and the algorithm will accept
Lemma : Let X be ε-far from being (k,b) clusterable

then there exist k nonintersecting segments, each of length 2b,
such that:


nε/(k+1)
There are at least εn/(k+1) points from X between every two
segments
As well as to the left of the leftmost and to the right of the rightmost
segment.
k=4
15
1 Dimensional clustering

From balls and bins analysis one gets that a point is chosen from
each one of those inter-segments with good probability (>2/3), so
the algorithm rejects in this case.
16
2-dimensional clustering with L2



A sublinear algorithm will be shown,
dependent on beta, for d=2, and L2 metric,
with diameter clustering.
Algorithm: Take m samples, and check if they
form a (k,b) clustering
Start with k=1 (one cluster).
17
Some definitions





Cx denote the disk or radius b centered at x
I(T) denote the intersection of all disks Cx of
points in T
A(R) – denotes the area of a region R.
Uj – Union of all sampled points up to phase j
A point is influential with respect to I(Uj) if it
causes a significant decrease in the area of I(Uj)
,more than 0.5(βb)2
18
2-dimensional clustering with L2


Divide the m samples into phases
For phase = 1 to p=2π/β2


Choose (Uniformly&indepedantly) ln(3p)/ε points
Claim: For X which is epsilon-far from being
(k,(1+ β)b) clusterable, for every phase j there
are at least εn influential points with respect to
I(Uj-1).

Will be proved using the next lemmas
19
Geometric claim

Let C be a circle of radius at most b. Let s and t be any
two points on C, and let o be a point on the segment
connecting s and t such that dist(s,o)>=b. Consider the
line perpendicular to the line through s and t at o, and let
w be it’s closer meeting point with the circle C. Then
dist(w,o)>=dist(o,t)/2
t=(α,η)
l
o
l’
w
s
20
Lemma

Let T be any finite subset of R2. Then for every x,y in I(T)
such that x is noninfluential with respect to T,
dist(x,y)<=(1+β)b.


Use the geometry claim to prove it
Reminder : a point is influential if it reduces the area by more
than 0.5(βb)2
21
2 dimensions - Conclusion





It means that if X is ε-far from being (k,(1+β)b) clusterable,
there are at least εn influential points in each stage
Given the sample size, we get that the probability to get an
influential point in each phase is greater or equal to 2/3
(Union bound)
If there is an influential point in each phase, It means that
by the end of the sampling the set sampled points T must
have A(I(T))<0. Therefore, the algorithm must reject.
We get that for d=2, the sample size is :
m=Θ(1/ε*log(1/β)(1/β)2)
Running time O(m2)
22
Getting to higher dimensions




In the general case the sample size needed is
Θ(1/ε*d3/2log(1/β)(2/β)d)
Define influential point as a point which reduces
the area by > (βb)dVd-1/(d2d-1)
Number of phases – dVd(2/β)d/(2Vd-1)
For every plane that contains the line xy, the
same geometric argument as before can be
used, giving a base of area (h/2)d-1Vd-1, giving us
h<= βb as we need
23
Getting to higher k




For general k, the sample size needed is
m=Θ(k2log(k)/ε*d(2/β)2d)
Running time is exponential in k and d.
Uses about the same idea as before, now take
p(k)=k*(p+1) , where p was the number of phases
taken for k=1
Influential point is a point which is influential for
all current clusters (same value as for k=1).
24
Getting to higher k

So can we set the number of samples every
phase to be ln(3p(k))/ ε as before?


The answer is no, as there are multiple possibilities of
influential partitions.
An influential partition is a k-partiton of all
influential points found until the given phase.
25
Getting to higher k



Consider all the possible partitions of the
samples taken up to phase j.
the total number of possible influential partitions
after phase j is up to kj
Take a different sample size for every phase:




mj=((j-1)lnk + ln(3p(k)))/ ε .
Union bound gives us the needed result
Sum over all mj gives m=Θ(k2log(k)/ε*d(2/β)2d)
We get again, that A(I(T))<0, and the algorithm
will reject w.h.p
26
Thank you for listening
27
Star Cluster R136 Bursts Out