Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Download Report

Transcript Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Fully Automatic
Cross-Associations
Deepayan Chakrabarti (CMU)
Spiros Papadimitriou (CMU)
Dharmendra Modha (IBM)
Christos Faloutsos (CMU and IBM)
Customers
Customer Groups
Problem Definition
Products
Product Groups
Simultaneously group customers and products,
or, documents and words,
or, users and preferences …
Problem Definition
Desiderata:
1. Simultaneously discover row and column groups
2. Fully Automatic: No “magic numbers”
3. Scalable to large graphs
Cross-Associations ≠ Co-clustering !
Information-theoretic
co-clustering
Cross-Associations
1. Lossy Compression.
1. Lossless Compression.
2. Approximates the
original matrix, while
trying to minimize KLdivergence.
2. Always provides complete
information about the
matrix, for any number of
row and column groups.
3. The number of row and
column groups must be
given by the user.
3. Chosen automatically
using the MDL principle.
Related Work
Dimensionality curse

K-means and variants:

“Frequent itemsets”:
User must specify “support”

Information Retrieval:
Choosing the number of
“concepts”
Choosing the number of clusters
Number of partitions

Graph Partitioning:
Measure of imbalance between
clusters
versus
Column groups
Better
Clustering
Why is this
better?
Row groups
Row groups
What makes a cross-association “good”?
Column groups
1. Similar nodes are
grouped together
2. As few groups as
necessary
A few,
homogeneous
blocks
implies
Good
Compression
Main Idea
Good
Compression
implies
Better
Clustering
Binary Matrix
Row groups
pi1 = ni1 / (ni1 + ni0)
Cost of describing
1+n 0)* H(p 1)
(n
Σi i i
+Σi n 1 and n 0
i
i
i
Code Cost
Column groups
Description
Cost
Examples
One row group,
one column group
high
Total Encoding Cost =
low
describing
Σi (ni1+ni0)* H(pi1) +ΣiCostnof1 and
ni 0
i
Code Cost
low
m row group,
n column group
Description
Cost
high
versus
Column groups
Why is this
better?
Row groups
Row groups
What makes a cross-association “good”?
Column groups
low
low
Cost of describing
1+n 0)* H(p 1)
(n
+Σi n 1 and n 0
i
i
Total Encoding Cost = Σi i
i
i
Code Cost
Description
Cost
Algorithms
l = 5 col groups
k = 5 row groups
k=1,
l=2
k=2,
l=2
k=2,
l=3
k=3,
l=3
k=3,
l=4
k=4,
l=4
k=4,
l=5
Algorithms
l=5
k=5
Find good groups
for fixed k and l
Start with
initial matrix
Lower the
encoding cost
Choose better
values for k and l
Final crossassociations
Fixed k and l
l=5
k=5
Find good groups
for fixed k and l
Start with
initial matrix
Lower the
encoding cost
Choose better
values for k and l
Final crossassociations
Fixed k and l
Row groups
Swaps:
for each row:
Column groups
swap it to the row group
which minimizes the code
cost
Fixed k and l
Row groups
Ditto for column swaps
… and repeat …
Column groups
Choosing k and l
l=5
k=5
Find good groups
for fixed k and l
Start with
initial matrix
Lower the
encoding cost
Choose better
values for k and l
Final crossassociations
Choosing k and l
l=5
k=5
Split:
1. Find the row group R with the maximum entropy per row
2. Choose the rows in R whose removal reduces the entropy
per row in R
3. Send these rows to the new row group, and set k=k+1
Choosing k and l
l=5
k=5
Split:
Similar for column groups too.
Algorithms
l=5
k=5
Find good groups
Swaps
for fixed k and l
Start with
initial matrix
Lower the
encoding cost
Choose better
Splits
values
for k and l
Final crossassociations
Experiments
l = 5 col groups
k = 5 row groups
“Customer-Product” graph
with Zipfian sizes, no noise
Experiments
l = 8 col groups
k = 6 row groups
“Caveman” graph with Zipfian
cave sizes, noise=10%
Experiments
l = 3 col groups
k = 2 row groups
“White Noise” graph
Documents
Experiments
Words
“CLASSIC” graph of documents &
words: k=15, l=19
NSF Grant Proposals
Experiments
Words in abstract
“GRANTS” graph of documents & words:
k=41, l=28
Epinions.com user
Experiments
Epinions.com user
“Who-trusts-whom” graph from
epinions.com: k=18, l=16
Users
Experiments
Webpages
“Clickstream” graph of users and
websites: k=15, l=13
Experiments
Time (secs)
Splits
Swaps
Number of non-zeros
Linear on the number of “ones”: Scalable
Conclusions
Desiderata:
 Simultaneously discover row and column groups
 Fully Automatic: No “magic numbers”
 Scalable to large graphs
Fixed k and l
l=5
swaps
k=5
swaps
Find good groups
for fixed k and l
Start with
initial matrix
Lower the
encoding cost
Choose better
values for k and l
Final crossassociations
Experiments
l = 5 col groups
k = 5 row groups
“Caveman” graph with Zipfian
cave sizes, no noise
Aim
l = 5 col groups
k = 5 row groups
Given any binary matrix
a “good” cross-association
will have low cost
But how can we find such a
cross-association?
Main Idea
Good
Compression
Total Encoding Cost =
implies
Better
Clustering
Cost of describing
size
*
H(p
)
+
Σi
i
i
cross-associations
Code Cost
Minimize the total cost
Description
Cost
Main Idea
Good
Compression

implies
Better
Clustering
How well does a cross-association compress
the matrix?



Encode the matrix in a lossless fashion
Compute the encoding cost
Low encoding cost  good compression  good
clustering