Transcript Part 2c

Clustering algorithms: Part 2c
Agglomerative clustering (AC)
Pasi Fränti
25.3.2014
Speech & Image Processing Unit
School of Computing
University of Eastern Finland
Joensuu, FINLAND
Agglomerative clustering
Categorization by cost function
Single link
– Minimize distance of nearest vectors
Complete link
– Minimize distance of two furthest vectors
Ward’s method
We focus on this
– Minimize mean square error
– In Vector Quantization, known as
Pairwise Nearest Neighbor (PNN) method
Pseudo code
PNN(X, M)  C, P
si  {xi}  i[1,N];
m  N;
REPEAT
(sa, sb)  NearestClusters();
MergeClusters(sa, sb);
m  m-1;
UpdateDataStructures();
UNTIL m=M;
Pseudo code
PNN(X, M) → C, P
FOR i←1 TO N DO
p[i]←i; c[i]←x[i];
N times
REPEAT
a,b ← FindSmallestMergeCost();
MergeClusters(a,b);
m←m-1;
UNTIL m=M;
T(N) = O(N3)
O(N)
O(N2)
Ward’s method
[Ward 1963: Journal of American Statistical Association]
Merge cost:
d a ,b
na nb

 ca  cb
na  nb
2
Local optimization strategy:
a, b  arg min d i , j
i , j1, N 
i j
Nearest neighbor search:
1. Find the cluster pair to be merged
2. Update of NN pointers
Example of distance calculations
nb = 9
nc = 3
na = 1
6
a
da,b=32,40
MergeCost a, b  
5
b
1 9
9
 36   36  32.40
1 9
10
39
27
MergeCost b, c  
 25 
 25  56.25
39
12
db,c=56,25
c
Example of the overall process
M=5000
M=50
M=16
M=15
M=5000
M=4999
M=4998
.
.
.
M=50
.
.
M=16
M=15
Detailed example of the process
Example - 25 Clusters
MSE ≈ 1.01*109
Example - 24 Clusters
MSE ≈ 1.03*109
Example - 23 Clusters
MSE ≈ 1.06*109
Example - 22 Clusters
MSE ≈ 1.09*109
Example - 21 Clusters
MSE ≈ 1.12*109
Example - 20 Clusters
MSE ≈ 1.16*109
Example - 19 Clusters
MSE ≈ 1.19*109
Example - 18 Clusters
MSE ≈ 1.23*109
Example - 17 Clusters
MSE ≈ 1.26*109
Example - 16 Clusters
MSE ≈ 1.30*109
Example - 15 Clusters
MSE ≈ 1.34*109
Storing distance matrix
• Maintain the distance matrix and update rows
for the changed cluster only!
• Number of distance calculations reduces from
O(N2) to O(N) for each step.
• Search of the minimum pair still requires
O(N2) time  still O(N3) in total.
• It also requires O(N2) memory.
Heap structure for fast search
[Kurita 1991: Pattern Recognition]
1
2
3
4
5
6
7
...
N
HEAP
1
2
6,7
3
4
5
6
...
7
...
N
• Search reduces O(N2)  O(logN).
• In total: O(N2 logN)
...
...
Store nearest neighbor (NN) pointers
[Fränti et al., 2000: IEEE Trans. Image Processing]
b
c
g
a
f
e
d
Time complexity reduces to O(N 3)  Ω (N 2)
Pseudo code
PNN(X, M) → C, P
FOR i←1 TO N DO
p[i]←i; c[i]←x[i];
FOR i←1 TO N DO
NN[i]← FindNearestCluster(i);
O(N)
O(N2)
O(N)
REPEAT
a ← SmallestMergeCost(NN);
b ← NN[i];
MergeClusters(C,P,NN,a,b,);
O(N)
UpdatePointers(C,NN);
UNTILhttp://cs.uef.fi/pages/franti/research/pnn.txt
m=M;
Example with NN pointers
[Virmajoki 2004: Pairwise Nearest Neighbor Method Revisited ]
10
cluster
c
a
b
c
d
e
f
g
min
NN
--
2.0
16.0
2.5
4.0
8.0
16.0
2.0
b
b
2.0
--
10.0
2.5
10.0
18.0
26.0
2.0
a
c
16.0
10.0
--
22.5
36.0
40.0
32.0
10.0
b
d
2.5
2.5
22.5
--
4.5
14.5
30.5
2.5
a
e
4.0
10.0
36.0
4.5
--
4.0
20.0
4.0
a
f
8.0
18.0
40.0
14.5
4.0
--
8.0
4.0
e
g
16.0
26.0
32.0
30.5
20.0
8.0
--
8.0
f
Input
data a
8
10.0
b
d
2.0
6
a
2.0
2.5
4
e
4.0
2
g
f
4.0
8.0
0
0
2
4
6
8
10
Example
Step 1
10
c
After
step 1
8
16.7
ab
2.7
d
6
cluster
ab
c
d
e
f
g
min
NN
ab
--
16.7
2.7
8.7
16.7
27.3
2.7
d
c
16.7
--
22.5
36.0
40.0
32.0
16.7
ab
d
2.7
22.5
--
4.5
14.5
30.5
2.7
ab
e
8.7
36.0
4.5
--
4.0
20.0
4.0
f
f
16.7
40.0
14.5
4.0
--
8.0
4.0
e
g
27.3
32.0
30.5
20.0
8.0
--
8.0
f
2.7
4
e
4.0
2
g
f
4.0
8.0
0
0
2
4
6
8
10
Example
Step 2
10
cluster
abd
c
e
f
g
min
NN
abd
--
23.1
8.1
19.1
35.1
8.1
e
c
23.1
--
36.0
40.0
32.0
23.1
abd
8.1
e
8.1
36.0
--
4.0
20.0
4.0
f
e
f
19.1
40.0
4.0
--
8.0
4.0
e
g
35.1
32.0
20.0
8.0
--
8.0
f
c
After
step 2
8
23.1
abd
6
4
4.0
2
g
f
4.0
8.0
0
0
2
4
6
8
10
Example
Step 3
10
c
After
step 3
8
23.1
abd
cluster
abd
c
ef
g
min
NN
abd
--
23.1
19.3
35.1
19.3
ef
c
23.1
--
49.3
32.0
23.1
abd
ef
19.3
49.3
--
17.3
17.3
g
g
35.1
32.0
17.3
--
17.3
ef
6
4
19.3
17.3
2
g
ef
17.3
0
0
2
4
6
8
10
Example
Step 4
10
c
After
step 4
23.1
8
abd
23.1
6
30.8
4
2
efg
0
0
2
4
6
8
10
cluster
abd
c
efg
min
NN
abd
--
23.1
30.8
23.1
c
c
23.1
--
48.7
23.1
abd
efg
30.8
48.7
--
30.8
abd
Example
Final
10
Final
clustering
8
abcd
6
4
2
efg
0
0
2
4
6
8
10
Time complexities of the variants
Initialization phase:
Single merge phase:
 Find two nearest
 Merge the clusters
 Recalculate distances
 Update data structures
Merge phases in total:
Algorithm in total:
Original
method:
With heap
structure:
With NN
pointers:
O(N)
O(N2)
O(N2)
O(N2)
O(1)
O(1)
O(1)
O(N3)
O(N3)
O(1)
O(1)
O(N)
O(Nlog N)
O(N2log N)
O(N2log N)
O(N)
O(1)
O(N)
O(N)
O(N2)
O(N2)
Number of neighbors (τ)
0.25
BIRCH1
average = 5.1
Normalized frequency
0.20
0.15
House
average = 7.0
0.10
Bridge
0.05
average = 12.1
0
5
10
15
20
Tau
25
30
35
40
Processing time comparison
Time (secon ds)
100000
10000
Original method
1000
method
With NNOur
pointers
100
10
1
512
1024
2048
Training set size (N)
4096
Algorithm:
Lazy-PNN
T. Kaukoranta, P. Fränti and O. Nevalainen, "Vector
quantization by lazy pairwise nearest neighbor method",
Optical Engineering, 38 (11), 1862-1868, November 1999
Monotony property of merge cost
[Kaukoranta et al., Optical Engineering, 1999]
Cc
Merge costs values are
monotonically increasing:
d(Sa, Sb)  d(Sa, Sc)  d(Sb, Sc)

d(Sa, Sc)  d(Sa+b, Sc)
Ca
Ca+b
nb
na+nb
Cb
na
na+nb
Lazy variant of the PNN
• Store merge costs in heap.
• Update merge cost value only when it appears at top
of the heap.
• Processing time reduces about 35%.
Time
complexity
Additional
data structure
Space
compl.
Method
Ref.
Trivial PNN
[10]
O(d∙N3)
-
O(N)
Distance matrix
[6]
O(d∙N2+ N3)
Distance matrix
O(N2)
Kurita’s method
[5]
O(d∙N2+ N2∙logN)
Dist. matrix + heap
O(N2)
-PNN
[1]
O(d∙N2)
NN-table
O(N)
Lazy-PNN
[4]
O(d∙N2)
NN-table
O(N)
Combining PNN and K-means
1
M
codebook size
M
K-means
GLA
M0
N
combined
PNN
random
selection
M0
standard
PNN
N
Algorithm:
Iterative shrinking
P. Fränti and O. Virmajoki
“Iterative shrinking method for clustering problems“
Pattern Recognition, 39 (5), 761-765, May 2006.
Agglomerative clustering
based on merging
Before cluster merge
x
x
x
x
After cluster merge
S2
x
x
x
x
S3
+
+
+
+
+
x
x
x
+
+
xx
+
+
+
x
x
+
+
x
+
x
+
xx
+
+
x
x
S1
+
x
+
+
+
+
x
+
+
+
+
+
+
S4
+
Code vectors:
+
x
x
S5
+
x
+
+
+
+
Data vectors:
Vectors to be merged
x
Remaining vectors
+
Data vectors of the clusters to be merged
Other data vectors
+
+
Agglomeration based on
cluster removal
[Fränti and Virmajoki, Pattern Recognition, 2006]
Before cluster removal
+ +
S2
+
+
+
+
+
+
After cluster removal
+
S3
S1
+
+
+
+
+
+
+
+
+
+
xx
+
+
+
x
+
+
+
+
+
x
+
+
+
xx
+
+
+
+
+
+
+
x
+
+
+
+
+
+
S4
+
Code vectors:
+
x
x
S5
+
x
+
+
+
+
+
Data vectors:
Vector to be removed
x
Data vectors of the cluster to be removed
Remaining vectors
+
Other data vectors
+
Merge versus removal
PNN
IS
After
third
merge
After
fourth
merge
Pseudo code of iterative shrinking (IS)
IS(X, M)  C, P
m  N;
FOR  i1, m:
ci  xi;
pi  i;
ni  1;
FOR  i1, m:
qi  FindSecondNearestCluster(C, xi);
REPEAT
CalculateRemovalCosts(C, P, Q, d);
a  SelectClusterToBeRemoved(d);
RemoveCluster(P, Q, a);
UpdateCentroids(C, P, a);
UpdateSecondaryPartitions(C, P, Q, a);
m  m - 1;
UNTIL m=M.
Cluster removal in practice
Find secondary cluster:
nj
qi  arg min
xi  c j
1 j  m n j  1
2
j  pi
Calculate removal cost for every vector:
D i 
n qi
n qi  1
x i  c qi
2
 xi  c a
2
Partition updates
S6
+
+
+
+
+
S13
z
z
z
z
z
y
+
y
y
z
y
+
z
x
z
y
z
z
x
x
y
S5
z
x
z
y
x
x
z
x
y
S12
x
y
y
z
z
S4
z
z
+
z
+
z
S9
z
z
+
+
z
z
x
x
x
+
y
S1
x
S8
z
y
x x
z
+
S3
x
x
y
y
+
z
S2
y
y
z
S7
z
z
+
+
S10
S11
+
+
Code vectors:
Code vector to be removed
Remaining code vectors
Data vectors:
Data vectors of the minimum update:
Data vectors of the standard update:
x
x
Data vectors of the extensive update:
Other data vectors
+
Uy
x Uy U z
Complexity analysis
Number of vectors per cluster:
N
N
N
1
1
 1

 ... 
 N  
 ...  
N N 1
M
N
 M M 1
If we iterate until M=1:
1
1 1
N     ...    ON  log N 
N
1 2
Adding the processing time per vector:
N  log N  log M 
N
2
 N  M   N  N  log
N M
M
Algorithm:
PNN with kNN-graph
P. Fränti, O. Virmajoki and V. Hautamäki,
"Fast agglomerative clustering using a k-nearest neighbor
graph". IEEE Trans. on Pattern Analysis and Machine
Intelligence, 28 (11), 1875-1881, November 2006
Agglomerative clustering
with kNN graph
AgglomerativeClustering(X, M)  S
Step 1: Construct k-NN graph.
Step 2: REPEAT
Step 2.1: Find the best pair (sa, sb) to be merged.
Step 2.2: Merge the pair (sa, sb)  sab.
Resolve k-NN list for sab.
Step 2.3: Remove obsolete (sb).
Step 2.4: Find neighbors of sab.
Step 2.5: Update distances for the neighbors of sab.
UNTIL |S|=M;
Example of 2NN graph
k
j
c
i
a
b
d
h
e
f
g
Example of 4NN graph
Graph using double linked lists
before merge
c
k
e
a
c
a
h
j

d
k
b
b
c
d
a
Merging a and b
after merge
c
e
c
j
k
a+b
h
d
c
d
Effect on calculations
number of steps
Theoretical
Observed
STAGE
-PNN
Single
link
Double
link
-PNN
Find pair
N
1
1
8 357
3
3
Merge
N
k2 + logN
k2 + k + logN
8 367
200
305
Remove last
N
k + logN
LogN
8 349
102
45
Find neighbors
N
kN
k
8 357
41 769
204
Update costs
N (1+)
 + /klogN
 + /klogN
48 538
198
187
TOTAL
O(N2)
O(kN2)
O(NlogN)
81 970
42 274
746
Single
link
Double
link
Processing time as function of k
seconds
(number of neighbors in graph)
6
5
4
3
2
1
0
Bridge
Agglomeration
Graph creation by divide-and-conquer
2
4
6
8
10
12
k
14
16
18
20
Time distortion comparison
6.0
Miss America
5.9
5.8
MSE
K-means (fast)
5.7
Graph-PNN (1)
MPS
5.6
K-means
-PNN (229 s)
Trivial-PNN (>9999 s)
5.5
5.4
5.3
0
(1)
(2)
MSE = 5.36
Graph-PNN
(2)
D-n-C
10
Graph created by MSP
Graph created by D-n-C
20
30
Time (seconds)
40
Conclusions
• Simple to implement, good clustering quality
• Straightforward algorithm slow O(N3)
• Fast exact (yet simple) algorithm O(τN2)
• Beyond this possible:
– O(τ∙N∙logN) complexity
– Complicated graph data structure
– Compromizes the exactness of the merge
Literature
1.
P. Fränti, T. Kaukoranta, D.-F. Shen and K.-S. Chang, "Fast and
memory efficient implementation of the exact PNN", IEEE Trans. on
Image Processing, 9 (5), 773-777, May 2000.
2.
P. Fränti, O. Virmajoki and V. Hautamäki,
"Fast agglomerative clustering using a k-nearest neighbor graph".
IEEE Trans. on Pattern Analysis and Machine Intelligence, 28 (11),
1875-1881, November 2006.
3.
P. Fränti and O. Virmajoki,
"Iterative shrinking method for clustering problems",
Pattern Recognition, 39 (5), 761-765, May 2006.
4.
T. Kaukoranta, P. Fränti and O. Nevalainen, "Vector quantization by
lazy pairwise nearest neighbor method", Optical Engineering, 38 (11),
1862-1868, November 1999.
5.
T. Kurita, "An efficient agglomerative clustering algorithm using a
heap", Pattern Recognition 24 (3) (1991) 205-209.
Literature
6.
J. Shanbehzadeh and P.O. Ogunbona, "On the computational
complexity of the LBG and PNN algorithms". IEEE Transactions on
Image Processing 6 (4), 614-616, April 1997.
7.
O. Virmajoki, P. Fränti and T. Kaukoranta, "Practical methods for
speeding-up the pairwise nearest neighbor method ", Optical
Engineering, 40 (11), 2495-2504, November 2001.
8.
O. Virmajoki and P. Fränti, "Fast pairwise nearest neighbor based
algorithm for multilevel thresholding", Journal of Electronic Imaging,
12 (4), 648-659, October 2003.
9.
O. Virmajoki, Pairwise Nearest Neighbor Method Revisited, PhD
thesis, Computer Science, University of Joensuu, 2004.
10. J.H. Ward, Hierarchical grouping to optimize an objective function,
J. Amer. Statist.Assoc. 58 (1963) 236-244.