Transcript Information Theoretic Clustering and Co
Information Theoretic Clustering, Co clustering and Matrix Approximations
Inderjit S. Dhillon University of Texas, Austin
Data Mining Seminar Series, Mar 26, 2004 Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela, S. Merugu & D. Modha
Clustering: Unsupervised Learning
Grouping together of “similar” objects Hard Clustering -- Each object belongs to a single cluster Soft Clustering -- Each object is probabilistically assigned to clusters
Contingency Tables
Let
X
and
Y
be discrete random variables
X
and
Y
take values in
{1, 2, …, m}
and
{1, 2, …, n} p(X, Y)
denotes the joint probability distribution —if not known, it is often estimated based on co-occurrence data Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables High Dimensionality, Sparsity, Noise Need for robust and scalable algorithms
Co-Clustering
Simultaneously Cluster rows of
p(X, Y)
into
k
disjoint groups Cluster columns of
p(X, Y)
into
l
disjoint groups Key goal is to exploit the “duality” between row and column clustering to overcome sparsity and noise
Co-clustering Example for Text Data
Co-clustering clusters both words and documents
simultaneousl
y using the underlying co-occurrence frequency matrix document document clusters word word clusters
Co-clustering and Information Theory
View “co-occurrence” matrix as a joint probability distribution over row & column random variables
Y Y
ˆ
X X
ˆ We seek a “hard-clustering” of both rows and columns such that “information” in the compressed matrix is maximized.
Information Theory Concepts
Entropy of a random variable
X
with probability distribution
p
:
H
(
p
)
x p
(
x
) log
p
(
x
) The Kullback Leibler (KL) Divergence or “Relative Entropy” between two probability distributions
p
and
q
:
KL
(
p
,
q
)
x p
(
x
) log(
p
(
x
)
q
(
x
)) Mutual Information between random variables
X
and
Y
:
I
(
X
,
Y
)
x y p
(
x
,
y
) log
p
(
x
,
y
)
p
(
x
)
p
(
y
)
“Optimal” Co-Clustering
in
{1, 2, …, k}
and
{1, 2, …, l}
such that mutual information is maximized:
I
( ,
Y
ˆ ) where =
R(X)
is a function of
X
alone where =
C(Y)
is a function of
Y
alone
Related Work
Distributional Clustering Pereira, Tishby & Lee (1993), Baker & McCallum (1998) Information Bottleneck Tishby, Pereira & Bialek(1999), Slonim, Friedman & Tishby (2001), Berkhin & Becher(2002) Probabilistic Latent Semantic Indexing Hofmann (1999), Hofmann & Puzicha (1999) Non-Negative Matrix Approximation Lee & Seung(2000)
Information-Theoretic Co-clustering
Lemma:
“Loss in mutual information” equals
I
(
X
,
Y
) -
I
( ,
Y
ˆ )
KL
(
p
(
x
,
y
) ||
q
(
x
,
y
))
H
(
X
ˆ ,
Y
ˆ )
H
(
X
| )
H
(
Y
|
Y
ˆ )
H
(
X
,
Y
)
p
is the input distribution
q
is an
approximation
to
p q
(
x
,
y
)
p
(
x
ˆ ,
y
ˆ )
p
(
x
|
x
ˆ )
p
(
y
|
y
ˆ ),
x
x
ˆ ,
y
y
ˆ Can be shown that
q(x,y)
is a
maximum entropy approximation
subject to cluster constraints.
p
(
x
,
y
) .
05 .
05 0 0 .
04 .
04 0 0 .
04 .
04 .
05 .
05 .
05 .
05 0 0 0 .
04 0 0 .
05 .
05 .
04 0 0 0 .
05 .
05 .
04 .
04 .
.
0 0 05 05 .
04 .
04
p
(
x
,
y
) .
05 .
05 0 0 .
04 .
04 0 0 .
04 .
04 .
05 .
05 .
05 .
05 0 0 0 .
04 0 0 .
05 .
05 .
04 0 0 0 .
05 .
05 .
04 .
04 .
.
0 0 05 05 .
04 .
04 .
5 .
5 0 0 0 0 0 0 0 0 .
5 .
5
p
(
x
| ˆ
x
) 0 0 0 0 .
5 .
5
p
(
x
,
y
) .
5 .
5 0 0 0 0
p
(
x
| ˆ
x
) 0 0 0 0 .
5 .
5 .
05 .
05 0 0 .
04 .
04 0 0 .
04 .
04 .
05 .
05 .
05 .
05 0 0 0 .
04 0 0 .
05 .
05 .
04 0 0 0 .
05 .
05 .
04 .
04 0 0 0 0 .
5 .
5 .
.
0 0 05 05 .
04 .
04 0 .
36 .
36 0 .
28 0 0 .
28
p
(
y
| ˆ
y
) 0 .
36 0 .
36
p
(
x
,
y
) .
5 .
5 0 0 0 0
p
(
x
| ˆ
x
) 0 0 0 0 .
5 .
5 .
05 .
05 0 0 .
04 .
04 0 0 .
04 .
04 .
05 .
05 .
05 .
05 0 0 0 .
04 0 0 .
05 .
05 .
04 0 0 0 0 0 .
5 .
5 .
0 .
2 3 0 0 .
05 .
05 .
04 .
04 .
0 .
2 3 .
.
0 0 05 05 .
04 .
04 0 .
36
p
( ˆ
x
, ˆ
y
) .
36 0 .
28 0 0 .
28
p
(
y
| ˆ
y
) 0 .
36 0 .
36
p
(
x
,
y
) .
5 .
5 0 0 0 0
p
(
x
| ˆ
x
) 0 0 0 0 .
5 .
5 .
05 .
05 0 0 .
04 .
04 0 0 .
04 .
04 .
05 .
05 .
05 .
05 0 0 0 .
04 0 0 .
05 .
05 .
04 0 0 0 0 0 .
5 .
5 .
0 .
2 3 0 0 .
05 .
05 .
04 .
04 .
0 .
2 3 .
.
0 0 05 05 .
04 .
04 0 .
36
p
( ˆ
x
, ˆ
y
) .
36 0 .
28 0 0 .
28
p
(
y
| ˆ
y
) 0 .
36 0 .
36 .
054 .
054 0 0 .
036 .
036 .
054 .
054 0 0 .
036 .
036 .
042 .
042 0 0 028 .
028 0 0 .
042 .
042 .
028 .
028 0 0 .
054 .
054 .
036 .
036 0 0 .
054 .
054 .
036 .
036
q
(
x
,
y
) #parameters that determine
q(x,y)
are: (
m
k
) (
kl
1 ) (
n
l
)
Decomposition Lemma
Following
Lemma
reveals the Answer
:
KL
(
p
(
x
, where
q
(
y
|
x
ˆ )
y
) ||
p
(
y
|
q
(
x
,
y
)) ˆ
y
)
p
(
y
ˆ | ˆ
x
)
x x
ˆ
p
(
y
|
p
(
x
)
KL
(
p
(
y
| ˆ
y
)
x
ˆ
x p
( ˆ
y
|
x
) ||
x
)
p
(
x
|
x
ˆ )
q
(
y
|
x
ˆ )) cluster.
Similarly,
KL
(
p
(
x
,
y
) ||
q
(
x
,
y
))
y y
ˆ
p
(
y
)
KL
(
p
(
x
|
y
) ||
q
(
x
| ))
Co-Clustering Algorithm
i
1 (
R
,
C
)
q
[Step 2] For every row , assign it to the cluster that minimizes
KL
(
p
(
x y
|
x
) ||
q
[
i
,
i
] (
y
|
x
ˆ )) (
R
,
C
)
q
minimizes
KL
(
p
(
x
|
y y
) ||
q
[
i
1 ,
i
] (
x
|
y
ˆ )) (
R
,
C
)
q
ˆ
y
Properties of Co-clustering Algorithm
Main Theorem
: Co clustering “monotonically” decreases loss in mutual information Co-clustering converges to a local minimum Can be generalized to multi-dimensional contingency tables
q
can be viewed as a “low complexity” non-negative matrix approximation
q
preserves marginals of
p,
and co-cluster statistics Implicit dimensionality reduction at each step helps overcome sparsity & high-dimensionality Computationally economical
p
(
x
,
y
) .
.
.
05 .
05 0 0 04 04 0 0 1 0 0 0 0 0 0 0 .
5 .
5 .
28 0 0 0 .
36 .
36
p
(
x
| ˆ
x
) .
05 .
05 0 0 .
04 .
04 .
05 .
05 0 0 0 .
04 0 0 .
05 .
05 .
04 0 .
10 .
10 .
30 .
05 .
20 .
25
p
( ˆ
x
, ˆ
y
) 0 0 .
05 .
05 .
04 .
04 0 0 .
05 .
05 .
04 .
04 0 .
36 .
36 0 0 .
28 .
28 0 0 .
36 .
0 36
p
(
y
| ˆ
y
) .
.
.
029 .
036 018 018 .
039 .
039 .
029 .
036 .
018 .
018 .
039 .
039 .
019 .
014 .
028 .
028 .
025 .
025 .
022 .
028 .
014 .
014 .
030 .
030 .
024 .
018 .
036 .
036 .
032 .
032 .
.
.
.
.
.
024 018 036 036 032 032
q
(
x
,
y
)
.
.
.
05 .
05 0 0 04 04 .
5 .
5 0 0 0 0 0 0 .
3 .
3 0 .
4 0 0 0 0 1 0
p
(
x
| ˆ
x
) .
05 .
05 0 0 .
04 .
04
p
(
x
,
y
) .
05 .
05 0 0 0 .
04 0 0 .
05 .
05 .
04 0 .
20 .
18 .
12 .
10 .
32 .
08
p
( ˆ
x
, ˆ
y
) 0 0 .
05 .
05 .
04 .
04 0 0 .
05 .
05 .
04 .
04 0 .
36 .
36 0 0 .
28 .
28 0 0 .
36 .
0 36
p
(
y
| ˆ
y
) .
.
.
036 .
036 .
019 .
019 043 025 .
036 .
036 .
019 .
019 .
043 .
025 .
014 .
014 .
026 .
026 .
022 .
035 .
028 .
028 .
015 .
015 .
033 .
020 .
018 .
018 .
034 .
034 .
028 .
046 .
.
.
.
.
.
018 018 034 034 028 046
q
(
x
,
y
)
.
.
.
05 .
05 0 0 04 04 .
5 .
5 0 0 0 0 0 0 .
3 .
3 0 .
4 0 0 0 0 1 0
p
(
x
| ˆ
x
) .
05 .
05 0 0 .
04 .
04
p
(
x
,
y
) .
05 .
05 0 0 0 .
04 0 0 .
05 .
05 .
04 0 .
30 .
12 .
08 0 .
38 .
12
p
( ˆ
x
, ˆ
y
) 0 0 .
05 .
05 .
04 .
04 0 0 .
05 .
05 .
04 .
04 0 .
36 .
36 0 .
28 0 0 .
28 0 .
36 .
0 36
p
(
y
| ˆ
y
) .
.
.
.
054 .
054 013 013 028 .
017 .
054 .
054 .
013 .
013 .
028 .
017 .
042 .
042 .
010 .
010 .
022 .
013 0 0 .
031 .
031 .
033 .
042 0 0 .
041 .
041 .
043 .
054 .
0 0 .
041 .
041 .
043 054
q
(
x
,
y
)
p
(
x
,
y
) .
.
.
05 .
05 0 0 04 04 .
5 .
5 0 0 0 0 0 0 0 0 .
5 .
5
p
(
x
| ˆ
x
) .
05 .
05 0 0 .
04 .
04 0 0 0 0 .
5 .
5 .
05 .
05 0 0 0 .
04 0 0 .
05 .
05 .
04 0 .
0 .
2 3 .
0 .
2 3
p
( ˆ
x
, ˆ
y
) 0 0 .
05 .
05 .
04 .
04 0 .
36 0 0 .
05 .
05 .
04 .
04 .
36 0 .
28 0 0 .
28
p
(
y
| ˆ
y
) 0 .
36 0 .
36 .
054 .
054 0 0 .
036 .
036 .
054 .
054 0 0 .
036 .
036 .
042 .
042 0 0 028 .
028 0 0 .
042 .
042 .
028 .
028 0 0 .
054 .
054 .
036 .
036 0 0 .
054 .
054 .
036 .
036
q
(
x
,
y
)
Applications -- Text Classification
Assigning class labels to text documents Training and Testing Phases
Document collection Training Data Class-1 Grouped into classes Class-m New Document Classifier (Learns from Training data) New Document With Assigned class
Feature Clustering (dimensionality reduction)
Feature Selection
Document Bag-of-words Vector Of words 1 Word#1 Word#k
• • • •
Select the “best” words Throw away rest Frequency based pruning Information criterion based pruning m
Feature Clustering
Vector Document Of Bag-of-words words 1 Cluster#1
• • •
Do
not
throw away words Cluster words instead Use clusters as features Cluster#k m
Experiments
Data sets 20 Newsgroups data 20 classes, 20000 documents Classic3 data set 3 classes (cisi, med and cran), 3893 documents Dmoz Science HTML data 49 leaves in the hierarchy 5000 documents with 14538 words Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt
Implementation Details Bow – for indexing,co-clustering, clustering and classifying
Results (20Ng)
Classification Accuracy on 20 Newsgroups data with 1/3-2/3 test-train split Divisive clustering beats feature selection algorithms by a large margin The effect is more significant at lower number of features
Results (Dmoz)
Classification Accuracy on Dmoz data with 1/3-2/3 test train split Divisive Clustering is better at lower number of features Note contrasting behavior of Naïve Bayes and SVMs
Results (Dmoz)
Naïve Bayes on Dmoz data with only 2% Training data Note that Divisive Clustering achieves higher maximum than IG with a significant 13% increase Divisive Clustering performs better than IG at lower training data
Hierarchical Classification
Math Science Physics Social Science Number Theory Logic Mechanics Quantum Theory Economics Archeology •Flat classifier builds a classifier over the leaf classes in the above hierarchy •Hierarchical Classifier builds a classifier at each internal node of the hierarchy
Results (Dmoz)
• Hierarchical Classifier (Naïve Bayes at each node) • Hierarchical Classifier: 64.54% accuracy at just 10 features (Flat achieves 64.04% accuracy at 1000 features) • Hierarchical Classifier improves accuracy to 68.42 % from 64.42%(maximum) achieved by flat classifiers 80 70 60 50 40 30 20 10 0
Dmoz data Number of Features
Hierarchical Flat(DC) Flat(IG)
Anecdotal Evidence
Cluster 10 Divisive Clustering (rec.sport.hockey)
team game play hockey Season boston chicago pit van nhl
Cluster 9 Divisive Clustering (rec.sport.baseball)
hit runs Baseball base Ball greg morris Ted Pitcher Hitting
Cluster 12 Agglomerative Clustering (rec.sport.hockey and rec.sport.baseball)
team detroit hockey pitching Games hitter Players rangers baseball nyi league morris player blues nhl shots Pit Vancouver buffalo ens
Top few words sorted in Clusters obtained by Divisive and Agglomerative approaches on 20 Newsgroups data
Co-Clustering Results (CLASSIC3)
Co-Clustering (0.9835) 992 4 8 1-D Clustering (0.821) 847 142 44 40 1452 1 4 7 1387 41 275 954 86 405 1099
Results – Binary (subset of 20Ng data)
Binary (0.852,0.67) Co clustering 1-D Clustering
207
31
178
104 Binary_subject (0.946,0.648) Co clustering 1-D Clustering
234
11
179
94 43
219
72
146
16
239
71
156
Precision – 20Ng data
Binary Co clustering
0.98
Binary_Subject
0.96
Multi5
0.87
Multi5_Subject
0.89
Multi10
0.56
Multi10_Subject
0.54
1D clustering 0.64
0.67
0.34
0.37
0.17
0.19
IB-Double 0.70
IDC 0.85
0.5
0.88
0.35
0.55
Results: Sparsity (Binary_subject data)
Results: Sparsity (Binary_subject data)
Results (Monotonicity)
Conclusions
Information-theoretic approach to clustering, co-clustering and matrix approximation Implicit dimensionality reduction at each step to overcome sparsity & high-dimensionality Theoretical approach has the potential of extending to other problems: Multi-dimensional co-clustering MDL to choose number of co-clusters Generalized co-clustering by Bregman divergence
More Information
Email: [email protected]
Papers are available at: http://www.cs.utexas.edu/users/inderjit “ Divisive Information-Theoretic Feature Clustering for Text Classification ”, Dhillon, Mallela & Kumar,
Journal of Machine Learning Research(JMLR), March 2003 (also KDD, 2002)
“ Information-Theoretic Co-clustering ”, Dhillon, Mallela & Modha,
KDD, 2003.
“Clustering with Bregman Divergences ”, Banerjee, Merugu, Dhillon & Ghosh, SIAM Data Mining Proceedings, April, 2004.
“A Generalized Maximum Entropy Approach to Bregman Co-clustering & Matrix Approximation ”, Banerjee, Dhillon, Ghosh, Merugu & Modha, working manuscript, 2004.