A System for Outlier Detection and Cluster Repair

Download Report

Transcript A System for Outlier Detection and Cluster Repair

A System for Outlier Detection
and Cluster Repair
Ying Liu
Dr. Sprague
Oct 21, 2005
1
A data set
2
Clustering algorithms could generate bad cluster

hMETIS (k=6)
3
Clustering algorithms could generate bad cluster

hMETIS (k=20)
4
BIRCH
5
BIRCH
6
Clustering algorithms could generate bad cluster

BIRCH (k=20)
7
Factors affecting clustering results



Outliers
Inappropriate value for parameters
Drawbacks of the clustering algorithm
themselves
8
Factors affecting outlier detection results



Distributions
Boundary between outlier group and
microcluster
Nested outliers
9
Two steps of cluster repair


Outlier/outlier group detection for each cluster

Separate points which are not supposed to be together
Merge density connected points

Merge points which should be together
Clusters generated by a
clustering algorithm
Outlier detection of
different clusters.
Merge similar points from
different clusters.
10
Step 1: Cluster Repair
Outlier Detection and Evaluation
by Network Flow
11
Network Flow:
Maximum Flow/Minimum Cut


Ford-Fulkerson (1962)
The maximum flow problem is to find a
f for which the total flow is maximum.
The total flow can be measured at the
sink, or it can be measured at any cut
separating the source from the sink.
12
Outlier detection:
Maximum flow/Minimum cut
a
12/12
b
19/19
s
s->a->b->t: 12
28/30
7/10
9/9 7/7
12/13
c
d
t
s->c->d->t: 3
s->c->b->t: 9
3/3
s->a->c->d->b->t: 7
10/11
maximum-flow= minimum-cut = 12+3+9+7=31
13
Outlier detection by network flow
1.
2.
3.
4.
5.
6.
7.
compute k nearest neighbors of each point in a cluster of
data.
for the data of a cluster, set up the network.
begin at a random vertex as source/sink s, choose its
farthest vertex as the sink/source t.
use the Maximum-Flow/Minimum-Cut algorithm to find
the flow from source to sink, get the cut separating s and
t, and use the smaller side as the candidate outlier or
outlier group.
remove the candidate outlier or outlier groups from the
graph.
select the next source, go back to 3 until the stop
criterion.
adjusting: coarsen the graph and adjust the maximum
14
flow.
Loosely connected clusters
20
1
19
2
10
15
Experiments (setting up the network)
Setting up the Network
The No. 20 cluster,591 points
7 nearest neighbors
591 points, 5028 edges
16
Setting up the network


Compute k nearest neighbors, make sure all
vertices are connected.
Compute the capacity between two vertices
by the distance.
1
c
*100
1  dist
4
Capacity c
17
Experiment result
Loop
Max Flow
No. 4
1267
No. 1
1269
No. 3
3256
No. 5
3937
No. 8
5939
No. 7
7717
No. 14
8962
No. 9
10148
No. 10
16194
No. 2
16533
No. 13
17793
No. 6
25378
No. 11
63797
No. 12
160515
No. 15
359560
No. 17
427908
No. 16
1307310
18
Experiment (adjusting)
18 vertices, 66 edges
Loop
Cut
Max Flow
No. 1
vertex 4
1267
No. 2
vertex 1
1269
No. 3
vertex 3
3256
No. 4
Vertex 5
3937
No. 5
vertex 8
5939
No. 6
vertex 7,9,10
16531
No. 7
vertex 2
16533
No. 8
vertex 13
17793
No. 9
Vertex 14
20261
No. 10
Vertex 6
25378
No. 11
Vertex 11
52498
No. 12
Vertex 12
160515
No. 13
Vertex 15
359560
No. 14
Vertex 17
427908
No. 15
Vertex 16
1307310
19
Stop criteria


Users input the number of outlier or outlier group
they want.
Use the maximum flow as the stop condition.
100
c
1  dist
Capacity  c 4

D flow 
4
100
1
max_ flow
# cross _ edge
 Davg
Stop when Dflow

Davg = average distance of the remaining data
20
Outlier Degree
21
Experiment (20 clusters)
2
9
7
10
20
1
4
6
11
8
19
13
3
5
12
18
17
14
15
16
22
Step 2: Cluster Repair
Merge Density Connected Points
23
Merge density connected microclusters by
flexible parameters of DBSCAN
2
9
7
10
20
1
4
6
11
8
19
13
3
5
12
18
17
14
15
16
24
Flexible parameters of DBSCAN

get the average distance d of every microcluster
by each point’s k nearest neighbors
No. 10 cluster
No. 19 cluster
No. 20 cluster
25
DBSCAN
26
DBSCAN
27
DBSCAN with flexible Eps


Original DBSCAN use least dense eneighborhood as global Eps and set
MinPts=4.
We use average distance of every
microcluster as the Eps.

When do DBSCAN, points in different
microclusters use different Eps.
28
Kd tree

Use kd tree to find buckets with more than two
microclusters from different original cluster results.
29
No. 125 bucket
30
MinPts = 4 for dim = 2
Eps
p
Search the rectangle (x+Eps, y+Eps, x-Eps, y-Eps) by R* tree,
when Eps = avg_dist between points, it is very possible the point
P could include 3 extra points besides itself.
31
No. 125 bucket
(a) MinPts = 5
(b) MinPts = 5
32
Other controversial buckets
No.119 bucket
No.113 bucket
No.114 bucket
If x% points of a microcluster are merged into another microcluster, then merge
These two microclusters. Since the proportion of points of these microclusters
in these buckets that are merged exceeds 90%, 24 and 28 microclusters are merged. 33
No. 20, 19 and 10 cluster repair
34
After repair 20 clusters
35
Conclusion

Repair cluster from two aspects.


Removing points which are loosely connect to the clusters by
outlier/outlier group detection;
merging points which are density connected by DBSCAN
with flexible Eps.


Analyze interested microclusters
Found the Relationship among Outliers,
outlier groups and main clusters.
36
Questions

MinPts in high dimensional data


For 3-d, MinPts=5; 4-d, MinPts=6?
For some outlier group microcluster,
MinPts could be very high, it’s because
border points include points in neighbor
dense microcluters within its Eps, how
to use each microcluster’s MinPts as
reference.
37