Clustering-Lisboa-2010.ppt

Transcript Clustering-Lisboa-2010.ppt

Robust methodologies for
partition clustering
Paulo Lisboa
Terence Etchells, Ian Jarman and Simon Chambers
Computing and Mathematical Sciences
Liverpool John Moores University
Overview
Original
data projected onto the first 3 eigenvalues of the scatter matrix in the original domain - Clusters r+:1 go:2 ms:3 c*:4 bv:5 kp:6 y.:7 kx:8
300
200
100
Partition clustering - critique
Projection onto axis 3
0
9
20
-100
19
25
1
1185
1810
23 22
24
-200
-300
Decomposition of the covariance
matrix
-400
-500
-500
0
150
500
100
Projection onto axis 1
50
0
-50
-100
-150
-200
-250
-300
-350
Projection onto axis 2
Clustering Performance: Median Cramer V of individual Clusters Vs Invariant J value
18
14
Invariant J Value
Landscape mapping of cluster
solutions
2 clusters
3 clusters
4 clusters
5 clusters
6 clusters
7 clusters
8 clusters
9 clusters
10 clusters
11 clusters
12 clusters
13 clusters
14 clusters
15 clusters.
16
12
10
8
6
Validation for two synthetic data sets
and metabolic sub-typing
4
2
0.65
0.7
0.75
0.8
0.85
0.9
Median Cramer V Concordance Value
0.95
1
Bioinformatics
Nottingham Tenovous Primary
Breast Carcinoma Series
Consecutive series of 1,944 cases of
primary operable invasive breast cancer
(n=1,076 with all markers present)
Patients presenting during 1986-98
Protein expression comprising
25 immunohistochemical markers
related to tumour malignancy
derived through high-throughput protein
expression using TMA
Abd El-Rehim et al, Int J Cancer,
116, 340-350, 2005.
Partition clustering
– relevance to bioinformatics
Clusters
go:2
bs:3scatter
c*:4 matrices
mv:5 - kd:6
Original data projected onto the space of cluster
meansr+:1
then onto
2D using
Clusters r+:1 go:2 ms:3 c*:4 bv:5 kp:6 y.:7
600
1.2
22
p53
1
400
0.8
200
CK 5/623
0.60
0.4
-200
BRCA1
6
20
18
19
9 4
8111 2518
10
14
24
7
12
ER13
C-erbB-2
5
2
3 17
16
21
PgR
15
0.2
-400
0
-600
-0.2
-800
-300
-0.6
-200
-0.5
-100 -0.4
0
-0.3100
200
-0.2
300
-0.1
400 0
500
0.1600
700
0.2
Partition clustering –open issues
Identify a suitable algorithm:
Model-based or model-free ?
Hierarchical, K-means, PAM ?
Return {Sa,...,Sz} solutions
Validate & interpret each solution
K-means
i. Assume #K
ii. Initialise #N ?
iii. Sort by optimality ?
iv. Select best for #K ?
v. Select #K(s) ?
vi. Single cluster or
ensemble ?
Clusters r+:1 go:2 bs:3 c*:4 mv:5 kd:6
Original data projected onto the space of cluster means then onto 2D using scatter matrices - Clusters r+:1 go:2 ms:3 c*:4 bv:5 kp:6 y.:7
600
22
1.2
1
400
0.8
200
0
6
23
20
18
19
9 4
8111 25
10
14
7
5
12
-200
2
3 17
0.6
24
18
0.4
16
13 21
15
-400
0.2
-600
0
-800
-300
-200
-100
0
100
200
300
400
500
600
700
-0.2
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
Separation index:
Decomposition of the scatter matrix
Scatter matrices
T



X

m
. X i  m 
i
i1
Nc
N
SW     X i  m j T .X i  m j 
j 1
i 1
ST 
N
SW
SW
1
2
j
SB 
 N m  m .m  m
Nc
j 1
j.
j
T
j
ST  SW  S B
S
B
Separation index:
Decomposition of the scatter matrix
Invariant separation matrix
and index
1
M  SW
.S B

1
J  tr SW
.S B

SW
~
1 X  A. X
~
S  AT .S . A
(.)
(.)
~
M  AT .M . A
SW
2
J  tr ( AT .M . A)  tr ( M )
S
B
N.B. If |ST|=0 → Project onto subspace
of cohort means

X  XC 
Nc

X .aîT
a1
i 1
  1 
M  SW .S B

 1 

J  tr SW .S B
a3

a2
Theorem: Ĵ is invariant to dimensionality
reduction under Mahalanobis rotations
~
X  ~ X .NcX
1
2
X
îT aî

X
.
a

where i 1
~ ~ 1 ~
SMT  SW .STB D
 X   U . X .U
N 1
J  tr TSW .SD B  J
 X  U . X  .U
1
2


1
2
~
a1
~
a3
~
a2
K-means clustering
~
x  C i:
~
i  min{ x  p j }
j
~
~
p j  x :{x  C j }
~
x x
2
x

Adaptive Resonance Theory (ART) clustering
~ ~
~
x . p  . xi
i
~
~
~ ~
x p  1  2 x . p
~
x x
2
x

Adaptive Resonance Theory (ART) clustering
Concordance measure
Cluster
Membership
1
…
M
1…
O11
…
O1M
N
ON1
…
ONM
N
 
2
M

i 1 j 1
Oij  Eij 
2
Eij
2
CV 
n. min(N  1, M  1)
Optimality principle
Reproducibility with
Best Separation - max(J)
Best Concordance – max(CV)
under repeated initialisations
i.
ii.
iii.
iv.
N initialisations
Sort by J
Select top p%
Calculate
pairwise CV
v. Retain med(CV)
vi. Plot (J, med_CV)
Synthetic data (10 cohorts)
Artificial Data Showing 10 Cluster Allocations
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 9
Cluster 10
4
2
Z
0
-2
-4
-6
-8
-3
-2
-4
-1
-3
0
-2
1
-1
2
0
3
1
4
2
5
6
3
X
Y
Synthetic data (10 cohorts)
Clustering Performance: Median Cramer V of individual Clusters Vs Invariant J value
19
5 clusters
6 clusters
7 clusters
8 clusters
9 clusters
10 clusters
Best 20 for 5 Clusters
Best 20 for 6 Clusters
Best 20 for 7 Clusters
Best 20 for 8 Clusters
Best 20 for 9 Clusters
Best 20 for 10 Clusters
18
17
Invariant J Value
16
15
14
13
12
11
10
9
0.7
0.75
0.8
0.85
0.9
Median Cramer V Concordance Value
0.95
1
Synthetic data (10 cohorts)
x
Mean
y
z
11
12
13
C1
-0.799
-1.011
-3.336
0.336
0.044
0.074
C2
-0.441
-0.569
-2.331
0.428
0.060
-0.002
C3
0.649
-0.344
-4.154
C4
1.077
Original
C5
cohorts
C6
C7
C8
C9
C10
Total
Covariance Matrix (i,j)
21
22
23
31
32
33
N
0.210
0.582
64
0.157
0.648
42
50.070
0.076
0.104
C7.
0.031
-0.117
-0.055 . -0.013
0.027 . -0.060
0.175
0.003
. -0.081
0.055
0.041 16
-0.014
0.446
61
0.044
0.371
0.074
Solution
with
80.210
Clusters
0.123
0.157
-0.002
0.070
3
0.104
.
C6
-0.117
.
-0.013
-0.060
.
0.003
.
-0.081
13
-0.014
.
-0.035
1
2
-1.358
C2
1.261
C3
3
-0.593
C4
4
0.251
C5
5
0.374
C6
0.620
0.023
2
4 0.023 -0.035
7
1 0.137
0.072
-2.815
0.366
-0.002
0.076
-0.002
0.043
58
2 0.013
. C4 0.013 4C50.348
C1
C2
C3
-0.242
0.256
0.536
0.031
.
-0.658 28
1.639 . 0.309 1 -0.060
-0.055
-0.060 130.245
0.7805
0.125 11
0.862
0.323500.017
0.027
0.017 . 0.386
1.2105
1.4828
.
.
3.024
-0.498
0.776 1.0687
0.033
0.175 .
0.033
0.491
1.5054
1 1.1924
26-0.025 0.055
.
5 0.352
-0.539
-0.530
0.711
-0.025
2.4975 1.7636 3.0649 2.3119
.
.
.
109
43
-0.267
1.973
0.390
-0.097
0.041
-0.097
0.343
3.3913 2.8294
4.476
3.8029 1.1757
C7
9
C8
C9
6
C10
3.2516
2.9776
2.0388
3.7087
2.2233
.
2.6082
1.8176
103
1.4141
-0.390
0.060
2.5575
3.7002
2.7302
1.2969
3.0487
.4.4727 25
3.5977
2.4543
1.6846
2 2.4341 .3.0901 23
2.4774
.
1.2151
64
2.025
0.7109
.
1.2717
.
14
2.2314
8
0.563
C8 .
0.689
.
0.532
0.403
.
0.695
.
0.576
15
0.322
32
197
131
163
97
106
183
C9
6
.
.
.
.
1
Total
64
42
61
32
197
.
106
1.2393
.
1.233
.3
2.2086
3
2.5497
.
.
1.6952
131
7
.
.
4
4
.
134
21
.
163
10
.
.
10
.
16
9
148
.
183
8
.
.
1
172
.
96
97
1076
100
79
133
.
132
.
173
.
190
97
Synthetic data – mixing structure
(Sammon Map)
2
6
1.5
8
10
1
5
0.5
7
0
9
-0.5
2
-1 1
-1.5
-2
-2.5
-1.5
4
3
-1
-0.5
0
0.5
1
1.5
2
Synthetic data – Visualisation in data space
Synthetic data (10 cohorts)
2
100
6
52
1
133
3
84
118
92
3
4
88
93
101
189
2
49
238
738
2
23
177
5
294
137
4
2
47
97
142
8
190
1
181
28
177
169
144
5
6
173
161
176
190
97
96
19
5
4
132
96
171
3
176
160
96
9
181
177
2
2
153
8
2
164
190
177
170
192
1
60
2
173
383
100
7
192
185
28
4
59
7
143
29
361
69
161
455
1
127
5
183
94
6
1
172
388
455
98
219
212
738
1
112
24
21
42
124
4
1
54
2
238
98
63
150
335
978
78
79
1
117
8
3
208
3
238
42
24
55
189
48
44
89
7
5
5
18
177
118
66
38
55
59
97
58
45
113
6
10
9
85
3
4
3
132
129
1
129
126
127
97
97
3
97
6
97
97
7
95
95
4
95
96
Max J
SeCo
Max Cv
1
Cramer V of Best cf Solution
1
Cramer V of Best cf Solution
Cra me r V me a sure s for 5 Cl uste rs
0.9
0.8
0.7
0.75
0.8
0.85
0.9
Median Cramer V of Solution
0.95
Cra me r V me a sure s for 7 Cl uste rs
0.9
0.8
0.7
0.75
0.8
0.85
0.9
Median Cramer V of Solution
0.95
Cra me r V me a sure s for 9 Cl uste rs
0.95
0.9
0.85
0.8
0.75
0.7
0.75
0.8
0.85
0.9
Median Cramer V of Solution
0.95
1
Cramer V of Best cf Solution
Cramer V of Best cf Solution
Cramer V of Best cf Solution
Cramer V of Best cf Solution
Synthetic data (10 cohorts)
Cra me r V me a sure s for 6 Cl uste rs
0.9
0.8
0.7
0.75
0.8
0.85
0.9
Median Cramer V of Solution
0.95
1
Cra me r V me a sure s for 8 Cl uste rs
0.9
0.8
0.7
0.75
0.8
0.85
0.9
Median Cramer V of Solution
0.95
1
Cra me r V me a sure s for 1 0 Cl uste rs
0.95
0.9
0.85
0.8
0.75
0.7
0.75
0.8
0.85
0.9
Median Cramer V of Solution
0.95
1
Bioinformatics
Nottingham Tenovous Primary
Breast Carcinoma Series
Consecutive series of 1,944 cases of
primary operable invasive breast cancer
(n=1,076 with all markers present)
Patients presenting during 1986-98
Protein expression comprising
25 immunohistochemical markers
related to tumour malignancy
derived through high-throughput protein
expression using TMA
Abd El-Rehim et al, Int J Cancer,
116, 340-350, 2005.
Marginal distributions
p53
cerbb2
cerbb4
800
800
200
700
700
180
600
600
500
500
160
400
Frequency
Frequency
Frequency
140
400
300
300
200
200
100
100
120
100
80
60
40
0
-50
0
50
450
100
150
PgR
Expression value
200
250
0
-50
300
20
0
50
350
400
100 ER 150
Expression value
200
250
0
-150
300
-100
-50
120
300
0
50
muc1
Expression value
100
150
200
100
350
250
80
250
200
Frequency
Frequency
Frequency
300
200
150
60
40
150
100
100
20
50
50
0
-100
-50
0
50
100
Expression value
150
200
250
0
-150
-100
-50
0
50
Expression value
100
150
200
0
-200
-150
-100
-50
0
Expression value
50
100
150
Landscape map (SeCo)
Clustering Performance: Median Cramer V of individual Clusters Vs Invariant J value
18
2 clusters
3 clusters
4 clusters
5 clusters
6 clusters
7 clusters
8 clusters
9 clusters
10 clusters
11 clusters
12 clusters
13 clusters
14 clusters
15 clusters.
16
Invariant J Value
14
12
10
8
6
4
2
0.65
0.7
0.75
0.8
0.85
0.9
Median Cramer V Concordance Value
0.95
1
Stability index (Cv)
Box Plots of the Exhaustive Cramer V values
1
0.9
0.8
Cramer V values
0.7
0.6
0.5
0.4
0.3
0.2
Blue=Median, Green=Mean
0.1
0
2
3
4
5
6
7
8
9
10
11
Number of Cluster Centres
12
13
14
15
Total
A
B
1
5
7
6
8
2
3
4
Total
1
2
3
8
7
5
6
4
1
118
4
0
1
6
1
0
12
21
125
0
33
0
0
0
0
37
0
122
4
0
2
0
2
0
0
29
145
0
0
0
0
0
2
6
0
98
0
0
0
0
0
6
0
0
94
1
5
1
0
0
0
1
0
64
42
0
0
0
0
1
0
61
32
94
177
131
163
183
106
97
126
93
1076
142
179
167
174
106
106
108
Landscape map (SeCo)
Clustering Performance: Median Cramer V of individual Clusters Vs Invariant J value
18
2 clusters
3 clusters
4 clusters
5 clusters
6 clusters
7 clusters
8 clusters
9 clusters
10 clusters
11 clusters
12 clusters
13 clusters
14 clusters
15 clusters.
16
Invariant J Value
14
12
10
8
6
4
2
0.65
0.7
0.75
0.8
0.85
0.9
Median Cramer V Concordance Value
0.95
1
Cluster hierarchy (1)
C5, 179
159
C7, 186
160
C2, 106
C4, 230
206
105
67
C1, 266
C5, 120
240
105
44
C4, 430
107
C3, 108
C2, 109
407
107
C3, 459
112
C3, 130
458
C4, 116
114
C1, 781
C3, 285
322
202
22
246
C6, 209
C4, 94
62
C2, 373
103
94
C5, 205
201
93
24
C1, 96
51
24
C2, 295
65
C2, 209
C1, 121
C8, 106
102
244
C1, 244
C2, 198
208
105
112
C6, 119
26
116
C1, 152
219
79
C6, 174
C3, 215
172
186
169
C2, 234
C4, 277
91
44
51
C1, 142
C5, 192
101
127
C3, 205
94
C7, 167
Cluster hierarchy (2)
C1, 177
164
C3, 185
172
C2, 131
C5, 184
120
167
C5, 237
C4, 189
15
183
201
46
65
C8, 183
C4, 209
C1, 338
300
134
161
C3, 459
116
228
C1, 241
458
155
C1, 781
125
78
105
C2, 365
C2, 249
C3, 246
C3, 163
209
322
151
C2, 373
240
C6, 121
C4, 252
114
91
51
124
C2, 295
C3, 238
102
C1, 119
C7, 106
19
243
C1, 244
229
228
116
C2, 229
C5, 104
93
99
101
C4, 135
113
C5, 97
C6, 120
117
C7, 138
17
C3, 117
116
136
198
C2, 198
C6, 126
20
62
C1, 90
66
C4, 93
Solution A
Original data projected onto the first 3 eigenvalues of the scatter matrix in the original domain - Clusters r+:1 go:2 ms:3 c*:4 bv:5 kp:6 y.:7 kx:8
300
200
100
8
Projection onto axis 3
0
11
5
9
1
4
20
24
10
18
23
-100
19
22
-200
-300
-400
-500
-500
0
-250
500
-200
-150
-100
-50
Projection onto axis 2
Projection onto axis 1
0
50
100
150
200
250
Solution A
Original data projected onto the first 3 eigenvalues of the scatter matrix in the original domain - Clusters r+:1 go:2 ms:3 c*:4 bv:5 kp:6 y.:7 kx:8
300
200
8
100
Projection onto axis 3
11
519
0
4 20
18 10
24
23
19
22
-100
-200
-300
-400
-500
-400
-300
-200
-100
0
100
500
0
-500
200
Projection onto axis 2
Projection onto axis 1
Solution B
Original data projected onto the first 3 eigenvalues of the scatter matrix in the original domain - Clusters r+:1 go:2 ms:3 c*:4 bv:5 kp:6 y.:7 kx:8
300
200
100
Projection onto axis 3
0
25
9
11
5
1
20
10
8
18
24
23 22
19
-100
-200
-300
-400
-500
-500
0
-250
500
Projection onto axis 2
-200
-150
-100
-50
0
Projection onto axis 1
50
100
150
200
250
Solution A
Original
data projected onto the first 3 eigenvalues of the scatter matrix in the original domain - Clusters r+:1 go:2 ms:3 c*:4 bv:5 kp:6 y.:7 kx:8
300
200
100
Projection onto axis 3
0
9
20
-100
19
25
1
1185
1810
23 22
24
-200
-300
-400
-500
-500
Projection onto axis 1
0
150
500
100
50
0
Projection onto axis 2
-50
-100
-150
-200
-250
-300
-350
muc2
e-cad
cerbb2
ck5/6
p53
pgr
ck14
50
actin
100
p-cad1
150
ck19
200
ck7/8
200
ck18
250
gcdfp
250
er
300
ar
Luminal N
chromo
egfr
Cluster 2 of 8 (m9)
150
100
50
0
0
e-cad
cerbb2
ck5/6
p53
pgr
ck14
actin
p-cad1
ck19
ck7/8
ck18
gcdfp
er
ar
chromo
synapto
Luminal New 2
synapto
300
fhit
0
p63
Clusters A
egfr
0
nbrca1
50
p63
50
cerbb4
100
fhit
100
nbrca1
150
cerbb4
150
muc1
200
cerbb3
200
muc1
250
muc1 co
250
cerbb3
Cluster 5 of 8 (m4)
muc2
e-cad
cerbb2
ck5/6
p53
pgr
ck14
actin
p-cad1
ck19
ck7/8
ck18
gcdfp
er
ar
chromo
synapto
egfr
p63
fhit
nbrca1
cerbb4
300
muc2
muc1
cerbb3
300
muc1 co
muc2
muc1 co
Cluster 1 of 8 (m4)
e-cad
cerbb2
ck5/6
p53
pgr
ck14
actin
p-cad1
ck19
ck7/8
ck18
gcdfp
er
ar
chromo
synapto
egfr
p63
fhit
nbrca1
cerbb4
cerbb3
muc1
muc1 co
Sub-type profiling
Clusters B
Cluster 1 of 8 (m9)
muc2
0
0
egfr
ck14
actin
p-cad1
ck19
ck7/8
ck18
gcdfp
er
ar
chromo
ck5/6
cerbb2
e-cad
ck5/6
cerbb2
e-cad
pgr
50
p53
50
p53
Cluster 5 of 8 (m9)
pgr
100
ck14
100
actin
150
ck19
150
p-cad1
200
ck18
200
ck7/8
250
gcdfp
250
er
300
ar
300
chromo
HER2
synapto
Luminal A
egfr
Clusters A
synapto
0
fhit
0
p63
50
p63
50
nbrca1
100
fhit
100
cerbb4
150
nbrca1
150
cerbb4
200
muc1
200
cerbb3
250
muc1 co
250
muc1
Cluster 2 of 8 (m4)
muc2
e-cad
cerbb2
ck5/6
p53
pgr
ck14
actin
p-cad1
ck19
ck7/8
ck18
gcdfp
er
ar
chromo
synapto
egfr
p63
fhit
nbrca1
300
cerbb3
muc1
cerbb4
cerbb3
300
muc1 co
muc2
muc1 co
Cluster 6 of 8 (m4)
muc2
e-cad
cerbb2
ck5/6
p53
pgr
ck14
actin
p-cad1
ck19
ck7/8
ck18
gcdfp
er
ar
chromo
synapto
egfr
p63
fhit
nbrca1
cerbb4
cerbb3
muc1
muc1 co
Sub-type profiling
Clusters B
Cluster 8 of 8 (m9)
muc2
e-cad
cerbb2
ck5/6
p53
pgr
ck14
actin
p-cad1
ck19
0
ck7/8
50
ck18
100
gcdfp
150
er
200
ar
250
chromo
Basal muc1 -
egfr
e-cad
cerbb2
ck5/6
p53
pgr
ck14
actin
p-cad1
ck19
ck7/8
ck18
gcdfp
er
ar
chromo
synapto
Basal muc1 +
egfr
Clusters A
synapto
300
fhit
0
p63
0
p63
50
nbrca1
50
fhit
100
cerbb4
100
nbrca1
150
cerbb4
150
muc1
200
cerbb3
200
muc1
250
muc1 co
250
cerbb3
Cluster 4 of 8 (m4)
muc2
e-cad
cerbb2
ck5/6
p53
pgr
ck14
actin
p-cad1
ck19
ck7/8
ck18
gcdfp
er
ar
chromo
synapto
egfr
p63
fhit
nbrca1
cerbb4
300
muc1 co
muc1
cerbb3
Cluster 3 of 8 (m4)
muc2
e-cad
cerbb2
ck5/6
p53
pgr
ck14
actin
p-cad1
ck19
ck7/8
ck18
gcdfp
er
ar
chromo
muc2
muc1 co
Basal p53 -
synapto
Basal p53 +
egfr
p63
fhit
nbrca1
cerbb4
cerbb3
muc1
muc1 co
Sub-type profiling
Clusters B
300
Cluster 4 of 8 (m9)
300
Cluster 6 of 8 (m9)
250
200
150
100
50
0
Consistency with consensus clustering
CoRe 5 Clusters Solution
ClusterA**Clusters
ClusterB
ClustersininGreen
Greenatatalal(2007)
(2007)Crosstabulation
Crosstabulation
2
Count
Count
Clusters in
Green et al
2007
Cluster
ClusterAB 35
C1
81
22
53
64
46
77
18
C3
Total
Total
C2
C4
C5
C6
NC
3
1
129
12
107
67
45
124
10
00
00
00
01
100
153
202
0
Clusters4in Green at al (2007)
Clusters
4
26
65
1
138
00
00
70
0
14 74
11 09
37
0
65
100
0
0
0 60
65
20
58
00
80
00
02
57
0
0 60
56
00
240
011
08
1
8 0
0
460
40
58 153
7277
54
82
69
77
15
4
5
3
66
60 7
40
00
00
430
138
21
540
05
202
69
16
17
13
37
119
33
02
20
01
10
01
00
2
75
751
80
80
Total
Total
7 113
76
64
2 131
65
102
0 61
60
82
101
0 138
51
76
30 65
75
66
110663
663
Molecular sub-typing
Molecular sub-typing
Summary
Original
data projected onto the first 3 eigenvalues of the scatter matrix in the original domain - Clusters r+:1 go:2 ms:3 c*:4 bv:5 kp:6 y.:7 kx:8
300
200
100
Partition clustering - critique
Projection onto axis 3
0
9
20
-100
19
25
1
1185
1810
23 22
24
-200
-300
Decomposition of the covariance
matrix
-400
-500
-500
0
150
500
100
Projection onto axis 1
50
0
-50
-100
-150
-200
-250
-300
-350
Projection onto axis 2
Clustering Performance: Median Cramer V of individual Clusters Vs Invariant J value
18
14
Invariant J Value
Landscape mapping of cluster
solutions
2 clusters
3 clusters
4 clusters
5 clusters
6 clusters
7 clusters
8 clusters
9 clusters
10 clusters
11 clusters
12 clusters
13 clusters
14 clusters
15 clusters.
16
12
10
8
6
Validation for two synthetic data sets
and metabolic sub-typing
4
2
0.65
0.7
0.75
0.8
0.85
0.9
Median Cramer V Concordance Value
0.95
1
Ferrara data (n=633)
er
pr
PROLIND
neu
P53
Clustering Performance: Median Cramer V of individual Clusters Vs Invariant J value
30
25
Invariant J Value
20
15
10
5
0
0
0.2
0.4
0.6
0.8
Median Cramer V Concordance Value
1
2 clusters
3 clusters
4 clusters
5 clusters
6 clusters
7 clusters
8 clusters
9 clusters
10 clusters.
Ferrara data (n=633)
Box Plots of the Exhaustive Cramer V values
1
0.9
0.8
Cramer V values
0.7
0.6
0.5
0.4
0.3
0.2
Blue=Median, Green=Mean
0.1
0
2
3
4
5
6
7
Number of Cluster Centres
8
9
10
Ferrara data (n=633)
SeCo method
Ambrogi et al [7]
Total
Total
1
2
3
4
5
1
213
13
0
4
26
256
2
0
203
0
1
3
207
3
0
1
68
0
22
91
4
0
2
0
77
0
79
213
219
68
82
51
633
Ferrara data (n=633)
JMU Cluster 1/5
100
100
80
80
40
20
20
0
0
40
20
P53
neu
PROLIND
JMU Cluster 4/5
JMU Cluster 2/5
100
100
100
pr
er
P53
neu
PROLIND
pr
er
P53
0
P53
neu
PROLIND
pr
er
0
40
60
neu
20
60
PROLIND
40
60
pr
60
80
er
80
JMU Cluster 3/5
100
100
100
20
JMU Cluster 5/5
100
80
60
40
20
P53
neu
PROLIND
pr
er
0
P53
neu
PROLIND
pr
P53
er
0
neu
0
40
PROLIND
0
P53
20
neu
20
PROLIND
P53
neu
PROLIND
pr
er
0
40
pr
20
40
er
40
60
60
60
pr
60
80
80
80
er
80
Ferrara data (n=633)
Original data projected onto the first 3 eigenvalues of the scatter matrix in the original domain - Clusters r+:1 go:2 ms:3 c*:4 bv:5 kp:6 y.:7 kx:8
4
50
3
-50
-100
150
-150
100
-200
-200
50
0
-150
-100
5
0
Projection onto axis 1
Original data projected onto the first 3 eigenvalues of the scatter matrix in the original domain - Clusters r+:1 go:2 ms:3 c*:4 bv:5 kp:6 y.:7 kx:8
50
-50
-50
50
-100
Projection onto0axis 2
Projection onto axis 3
Projection onto axis 3
0
3
-50
-100
-150
4
-200
5
50
0
-50
-100
-150
-200
Projection onto axis 1
150
100
50
Projection onto axis 2
0
-50
-100
Ferrara data (n=633)
Cluster 1 (213)
Cluster 2 (219)
Cluster 3 (68)
Cluster 4 (82)
Cluster 5 (51)
100
90
80
100
70
er
60
80
50
40
60
30
40
20
10
20
0
0
10
20
Pr
30
40
50
60
70
P53
80
90
0
100