Information Theoretic Clustering and Co

Download Report

Transcript Information Theoretic Clustering and Co

Information Theoretic Clustering, Co clustering and Matrix Approximations

Inderjit S. Dhillon University of Texas, Austin

Data Mining Seminar Series, Mar 26, 2004 Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela, S. Merugu & D. Modha

Clustering: Unsupervised Learning

 Grouping together of “similar” objects  Hard Clustering -- Each object belongs to a single cluster  Soft Clustering -- Each object is probabilistically assigned to clusters

Contingency Tables

  Let  

X

and

Y

be discrete random variables

X

and

Y

take values in

{1, 2, …, m}

and

{1, 2, …, n} p(X, Y)

denotes the joint probability distribution —if not known, it is often estimated based on co-occurrence data  Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables   High Dimensionality, Sparsity, Noise Need for robust and scalable algorithms

Co-Clustering

 Simultaneously  Cluster rows of

p(X, Y)

into

k

disjoint groups   Cluster columns of

p(X, Y)

into

l

disjoint groups Key goal is to exploit the “duality” between row and column clustering to overcome sparsity and noise

Co-clustering Example for Text Data

 Co-clustering clusters both words and documents

simultaneousl

y using the underlying co-occurrence frequency matrix document document clusters word word clusters

Co-clustering and Information Theory

 View “co-occurrence” matrix as a joint probability distribution over row & column random variables

Y Y

ˆ

X X

ˆ  We seek a “hard-clustering” of both rows and columns such that “information” in the compressed matrix is maximized.

Information Theory Concepts

 Entropy of a random variable

X

with probability distribution

p

:

H

(

p

)   

x p

(

x

) log

p

(

x

)  The Kullback Leibler (KL) Divergence or “Relative Entropy” between two probability distributions

p

and

q

:

KL

(

p

,

q

)  

x p

(

x

) log(

p

(

x

)

q

(

x

))  Mutual Information between random variables

X

and

Y

:

I

(

X

,

Y

)  

x y p

(

x

,

y

) log  

p

(

x

,

y

)

p

(

x

)

p

(

y

)  

“Optimal” Co-Clustering

 in

{1, 2, …, k}

and

{1, 2, …, l}

such that mutual information is maximized:

I

( ,

Y

ˆ ) where =

R(X)

is a function of

X

alone where =

C(Y)

is a function of

Y

alone

Related Work

    Distributional Clustering  Pereira, Tishby & Lee (1993), Baker & McCallum (1998) Information Bottleneck  Tishby, Pereira & Bialek(1999), Slonim, Friedman & Tishby (2001), Berkhin & Becher(2002) Probabilistic Latent Semantic Indexing  Hofmann (1999), Hofmann & Puzicha (1999) Non-Negative Matrix Approximation  Lee & Seung(2000)

Information-Theoretic Co-clustering

Lemma:

“Loss in mutual information” equals

I

(

X

,

Y

) -

I

( ,

Y

ˆ ) 

KL

(

p

(

x

,

y

) ||

q

(

x

,

y

)) 

H

(

X

ˆ ,

Y

ˆ ) 

H

(

X

| ) 

H

(

Y

|

Y

ˆ ) 

H

(

X

,

Y

) 

p

is the input distribution 

q

is an

approximation

to

p q

(

x

,

y

) 

p

(

x

ˆ ,

y

ˆ )

p

(

x

|

x

ˆ )

p

(

y

|

y

ˆ ),

x

x

ˆ ,

y

y

ˆ  Can be shown that

q(x,y)

is a

maximum entropy approximation

subject to cluster constraints.

p

(

x

,

y

)       .

05 .

05 0 0 .

04 .

04 0 0 .

04 .

04 .

05 .

05 .

05 .

05 0 0 0 .

04 0 0 .

05 .

05 .

04 0 0 0 .

05 .

05 .

04 .

04 .

.

0 0 05 05 .

04 .

04      

p

(

x

,

y

)       .

05 .

05 0 0 .

04 .

04 0 0 .

04 .

04 .

05 .

05 .

05 .

05 0 0 0 .

04 0 0 .

05 .

05 .

04 0 0 0 .

05 .

05 .

04 .

04 .

.

0 0 05 05 .

04 .

04             .

5 .

5 0 0 0 0 0 0 0 0 .

5 .

5

p

(

x

| ˆ

x

) 0 0 0 0 .

5 .

5      

p

(

x

,

y

)       .

5 .

5 0 0 0 0

p

(

x

| ˆ

x

) 0 0 0 0 .

5 .

5       .

05 .

05 0 0 .

04 .

04 0 0 .

04 .

04 .

05 .

05 .

05 .

05 0 0 0 .

04 0 0 .

05 .

05 .

04 0 0 0 .

05 .

05 .

04 .

04 0 0 0 0 .

5 .

5       .

.

0 0 05 05 .

04 .

04       0  .

36 .

36 0 .

28 0 0 .

28

p

(

y

| ˆ

y

) 0 .

36 0 .

36  

p

(

x

,

y

)       .

5 .

5 0 0 0 0

p

(

x

| ˆ

x

) 0 0 0 0 .

5 .

5       .

05 .

05 0 0 .

04 .

04 0 0 .

04 .

04 .

05 .

05 .

05 .

05 0 0 0 .

04 0 0 .

05 .

05 .

04 0 0 0 0 0 .

5 .

5         .

0 .

2 3 0 0 .

05 .

05 .

04 .

04 .

0 .

2 3   .

.

0 0 05 05 .

04 .

04       0  .

36

p

( ˆ

x

, ˆ

y

) .

36 0 .

28 0 0 .

28

p

(

y

| ˆ

y

) 0 .

36 0 .

36  

p

(

x

,

y

)       .

5 .

5 0 0 0 0

p

(

x

| ˆ

x

) 0 0 0 0 .

5 .

5       .

05 .

05 0 0 .

04 .

04 0 0 .

04 .

04 .

05 .

05 .

05 .

05 0 0 0 .

04 0 0 .

05 .

05 .

04 0 0 0 0 0 .

5 .

5         .

0 .

2 3 0 0 .

05 .

05 .

04 .

04 .

0 .

2 3   .

.

0 0 05 05 .

04 .

04       0  .

36

p

( ˆ

x

, ˆ

y

) .

36 0 .

28 0 0 .

28

p

(

y

| ˆ

y

) 0 .

36 0 .

36         .

054 .

054 0 0 .

036 .

036 .

054 .

054 0 0 .

036 .

036 .

042 .

042 0 0 028 .

028 0 0 .

042 .

042 .

028 .

028 0 0 .

054 .

054 .

036 .

036 0 0 .

054 .

054 .

036 .

036      

q

(

x

,

y

) #parameters that determine

q(x,y)

are: (

m

k

)  (

kl

 1 )  (

n

l

)

Decomposition Lemma

  Following

Lemma

reveals the Answer

:

KL

(

p

(

x

, where

q

(

y

|

x

ˆ ) 

y

) ||

p

(

y

|

q

(

x

,

y

)) ˆ

y

)

p

(

y

ˆ | ˆ

x

)    

x x

ˆ

p

(

y

|

p

(

x

)

KL

(

p

(

y

| ˆ

y

)

x

  ˆ

x p

( ˆ

y

|

x

) ||

x

)

p

(

x

|

x

ˆ )

q

(

y

|

x

ˆ )) cluster.

Similarly,

KL

(

p

(

x

,

y

) ||

q

(

x

,

y

))   

y y

ˆ

p

(

y

)

KL

(

p

(

x

|

y

) ||

q

(

x

| ))

Co-Clustering Algorithm

i

 1 (

R

,

C

)

q

 [Step 2] For every row , assign it to the cluster that minimizes

KL

(

p

(

x y

|

x

) ||

q

[

i

,

i

] (

y

|

x

ˆ ))  (

R

,

C

)

q

  minimizes

KL

(

p

(

x

|

y y

) ||

q

[

i

 1 ,

i

] (

x

|

y

ˆ )) (

R

,

C

)

q

ˆ

y

Properties of Co-clustering Algorithm

      

Main Theorem

: Co clustering “monotonically” decreases loss in mutual information Co-clustering converges to a local minimum Can be generalized to multi-dimensional contingency tables

q

can be viewed as a “low complexity” non-negative matrix approximation

q

preserves marginals of

p,

and co-cluster statistics Implicit dimensionality reduction at each step helps overcome sparsity & high-dimensionality Computationally economical

p

(

x

,

y

) .

      .

.

05 .

05 0 0 04 04       0 0 1 0 0 0 0 0 0 0 .

5 .

5 .

28 0 0 0 .

36 .

36      

p

(

x

| ˆ

x

) .

05 .

05 0 0 .

04 .

04 .

05 .

05 0 0 0 .

04 0 0 .

05 .

05 .

04 0    .

10 .

10 .

30 .

05 .

20 .

25   

p

( ˆ

x

, ˆ

y

) 0 0 .

05 .

05 .

04 .

04 0 0 .

05 .

05 .

04 .

04       0  .

36 .

36 0 0 .

28 .

28 0 0 .

36 .

0  36 

p

(

y

| ˆ

y

)       .

.

.

029 .

036 018 018 .

039 .

039 .

029 .

036 .

018 .

018 .

039 .

039 .

019 .

014 .

028 .

028 .

025 .

025 .

022 .

028 .

014 .

014 .

030 .

030 .

024 .

018 .

036 .

036 .

032 .

032 .

.

.

.

.

.

024 018 036 036 032 032      

q

(

x

,

y

)

.

      .

.

05 .

05 0 0 04 04       .

5 .

5 0 0 0 0 0 0 .

3 .

3 0 .

4 0 0 0 0 1 0      

p

(

x

| ˆ

x

) .

05 .

05 0 0 .

04 .

04

p

(

x

,

y

) .

05 .

05 0 0 0 .

04 0 0 .

05 .

05 .

04 0    .

20 .

18 .

12 .

10 .

32 .

08   

p

( ˆ

x

, ˆ

y

) 0 0 .

05 .

05 .

04 .

04 0 0 .

05 .

05 .

04 .

04       0  .

36 .

36 0 0 .

28 .

28 0 0 .

36 .

0  36 

p

(

y

| ˆ

y

)       .

.

.

036 .

036 .

019 .

019 043 025 .

036 .

036 .

019 .

019 .

043 .

025 .

014 .

014 .

026 .

026 .

022 .

035 .

028 .

028 .

015 .

015 .

033 .

020 .

018 .

018 .

034 .

034 .

028 .

046 .

.

.

.

.

.

018 018 034 034 028 046      

q

(

x

,

y

)

.

      .

.

05 .

05 0 0 04 04       .

5 .

5 0 0 0 0 0 0 .

3 .

3 0 .

4 0 0 0 0 1 0      

p

(

x

| ˆ

x

) .

05 .

05 0 0 .

04 .

04

p

(

x

,

y

) .

05 .

05 0 0 0 .

04 0 0 .

05 .

05 .

04 0    .

30 .

12 .

08 0 .

38 .

12   

p

( ˆ

x

, ˆ

y

) 0 0 .

05 .

05 .

04 .

04 0 0 .

05 .

05 .

04 .

04       0  .

36 .

36 0 .

28 0 0 .

28 0 .

36 .

0  36 

p

(

y

| ˆ

y

)       .

.

.

.

054 .

054 013 013 028 .

017 .

054 .

054 .

013 .

013 .

028 .

017 .

042 .

042 .

010 .

010 .

022 .

013 0 0 .

031 .

031 .

033 .

042 0 0 .

041 .

041 .

043 .

054 .

0 0 .

041 .

041 .

043 054      

q

(

x

,

y

)

p

(

x

,

y

) .

      .

.

05 .

05 0 0 04 04       .

5 .

5 0 0 0 0 0 0 0 0 .

5 .

5

p

(

x

| ˆ

x

) .

05 .

05 0 0 .

04 .

04 0 0 0 0 .

5 .

5       .

05 .

05 0 0 0 .

04 0 0 .

05 .

05 .

04 0   .

0 .

2 3 .

0 .

2 3  

p

( ˆ

x

, ˆ

y

) 0 0 .

05 .

05 .

04 .

04 0  .

36 0 0 .

05 .

05 .

04 .

04       .

36 0 .

28 0 0 .

28

p

(

y

| ˆ

y

) 0 .

36 0 .

36         .

054 .

054 0 0 .

036 .

036 .

054 .

054 0 0 .

036 .

036 .

042 .

042 0 0 028 .

028 0 0 .

042 .

042 .

028 .

028 0 0 .

054 .

054 .

036 .

036 0 0 .

054 .

054 .

036 .

036      

q

(

x

,

y

)

Applications -- Text Classification

  Assigning class labels to text documents Training and Testing Phases

Document collection Training Data Class-1 Grouped into classes Class-m New Document Classifier (Learns from Training data) New Document With Assigned class

Feature Clustering (dimensionality reduction)

 Feature Selection

Document Bag-of-words Vector Of words 1 Word#1 Word#k

• • • •

Select the “best” words Throw away rest Frequency based pruning Information criterion based pruning m

 Feature Clustering

Vector Document Of Bag-of-words words 1 Cluster#1

• • •

Do

not

throw away words Cluster words instead Use clusters as features Cluster#k m

Experiments

  Data sets  20 Newsgroups data  20 classes, 20000 documents  Classic3 data set   3 classes (cisi, med and cran), 3893 documents Dmoz Science HTML data    49 leaves in the hierarchy 5000 documents with 14538 words Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt

Implementation Details  Bow – for indexing,co-clustering, clustering and classifying

Results (20Ng)

   Classification Accuracy on 20 Newsgroups data with 1/3-2/3 test-train split Divisive clustering beats feature selection algorithms by a large margin The effect is more significant at lower number of features

Results (Dmoz)

   Classification Accuracy on Dmoz data with 1/3-2/3 test train split Divisive Clustering is better at lower number of features Note contrasting behavior of Naïve Bayes and SVMs

Results (Dmoz)

   Naïve Bayes on Dmoz data with only 2% Training data Note that Divisive Clustering achieves higher maximum than IG with a significant 13% increase Divisive Clustering performs better than IG at lower training data

Hierarchical Classification

Math Science Physics Social Science Number Theory Logic Mechanics Quantum Theory Economics Archeology •Flat classifier builds a classifier over the leaf classes in the above hierarchy •Hierarchical Classifier builds a classifier at each internal node of the hierarchy

Results (Dmoz)

• Hierarchical Classifier (Naïve Bayes at each node) • Hierarchical Classifier: 64.54% accuracy at just 10 features (Flat achieves 64.04% accuracy at 1000 features) • Hierarchical Classifier improves accuracy to 68.42 % from 64.42%(maximum) achieved by flat classifiers 80 70 60 50 40 30 20 10 0

Dmoz data Number of Features

Hierarchical Flat(DC) Flat(IG)

Anecdotal Evidence

Cluster 10 Divisive Clustering (rec.sport.hockey)

team game play hockey Season boston chicago pit van nhl

Cluster 9 Divisive Clustering (rec.sport.baseball)

hit runs Baseball base Ball greg morris Ted Pitcher Hitting

Cluster 12 Agglomerative Clustering (rec.sport.hockey and rec.sport.baseball)

team detroit hockey pitching Games hitter Players rangers baseball nyi league morris player blues nhl shots Pit Vancouver buffalo ens

Top few words sorted in Clusters obtained by Divisive and Agglomerative approaches on 20 Newsgroups data

Co-Clustering Results (CLASSIC3)

Co-Clustering (0.9835) 992 4 8 1-D Clustering (0.821) 847 142 44 40 1452 1 4 7 1387 41 275 954 86 405 1099

Results – Binary (subset of 20Ng data)

Binary (0.852,0.67) Co clustering 1-D Clustering

207

31

178

104 Binary_subject (0.946,0.648) Co clustering 1-D Clustering

234

11

179

94 43

219

72

146

16

239

71

156

Precision – 20Ng data

Binary Co clustering

0.98

Binary_Subject

0.96

Multi5

0.87

Multi5_Subject

0.89

Multi10

0.56

Multi10_Subject

0.54

1D clustering 0.64

0.67

0.34

0.37

0.17

0.19

IB-Double 0.70

IDC 0.85

0.5

0.88

0.35

0.55

Results: Sparsity (Binary_subject data)

Results: Sparsity (Binary_subject data)

Results (Monotonicity)

Conclusions

 Information-theoretic approach to clustering, co-clustering and matrix approximation  Implicit dimensionality reduction at each step to overcome sparsity & high-dimensionality  Theoretical approach has the potential of extending to other problems:  Multi-dimensional co-clustering  MDL to choose number of co-clusters  Generalized co-clustering by Bregman divergence

More Information

      Email: [email protected]

Papers are available at: http://www.cs.utexas.edu/users/inderjit “ Divisive Information-Theoretic Feature Clustering for Text Classification ”, Dhillon, Mallela & Kumar,

Journal of Machine Learning Research(JMLR), March 2003 (also KDD, 2002)

“ Information-Theoretic Co-clustering ”, Dhillon, Mallela & Modha,

KDD, 2003.

“Clustering with Bregman Divergences ”, Banerjee, Merugu, Dhillon & Ghosh, SIAM Data Mining Proceedings, April, 2004.

“A Generalized Maximum Entropy Approach to Bregman Co-clustering & Matrix Approximation ”, Banerjee, Dhillon, Ghosh, Merugu & Modha, working manuscript, 2004.