Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)
Download ReportTranscript Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)
Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM) Customers Customer Groups Problem Definition Products Product Groups Simultaneously group customers and products, or, documents and words, or, users and preferences … Problem Definition Desiderata: 1. Simultaneously discover row and column groups 2. Fully Automatic: No “magic numbers” 3. Scalable to large graphs Cross-Associations ≠ Co-clustering ! Information-theoretic co-clustering Cross-Associations 1. Lossy Compression. 1. Lossless Compression. 2. Approximates the original matrix, while trying to minimize KLdivergence. 2. Always provides complete information about the matrix, for any number of row and column groups. 3. The number of row and column groups must be given by the user. 3. Chosen automatically using the MDL principle. Related Work Dimensionality curse K-means and variants: “Frequent itemsets”: User must specify “support” Information Retrieval: Choosing the number of “concepts” Choosing the number of clusters Number of partitions Graph Partitioning: Measure of imbalance between clusters versus Column groups Better Clustering Why is this better? Row groups Row groups What makes a cross-association “good”? Column groups 1. Similar nodes are grouped together 2. As few groups as necessary A few, homogeneous blocks implies Good Compression Main Idea Good Compression implies Better Clustering Binary Matrix Row groups pi1 = ni1 / (ni1 + ni0) Cost of describing 1+n 0)* H(p 1) (n Σi i i +Σi n 1 and n 0 i i i Code Cost Column groups Description Cost Examples One row group, one column group high Total Encoding Cost = low describing Σi (ni1+ni0)* H(pi1) +ΣiCostnof1 and ni 0 i Code Cost low m row group, n column group Description Cost high versus Column groups Why is this better? Row groups Row groups What makes a cross-association “good”? Column groups low low Cost of describing 1+n 0)* H(p 1) (n +Σi n 1 and n 0 i i Total Encoding Cost = Σi i i i Code Cost Description Cost Algorithms l = 5 col groups k = 5 row groups k=1, l=2 k=2, l=2 k=2, l=3 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 Algorithms l=5 k=5 Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Choose better values for k and l Final crossassociations Fixed k and l l=5 k=5 Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Choose better values for k and l Final crossassociations Fixed k and l Row groups Swaps: for each row: Column groups swap it to the row group which minimizes the code cost Fixed k and l Row groups Ditto for column swaps … and repeat … Column groups Choosing k and l l=5 k=5 Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Choose better values for k and l Final crossassociations Choosing k and l l=5 k=5 Split: 1. Find the row group R with the maximum entropy per row 2. Choose the rows in R whose removal reduces the entropy per row in R 3. Send these rows to the new row group, and set k=k+1 Choosing k and l l=5 k=5 Split: Similar for column groups too. Algorithms l=5 k=5 Find good groups Swaps for fixed k and l Start with initial matrix Lower the encoding cost Choose better Splits values for k and l Final crossassociations Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise Experiments l = 8 col groups k = 6 row groups “Caveman” graph with Zipfian cave sizes, noise=10% Experiments l = 3 col groups k = 2 row groups “White Noise” graph Documents Experiments Words “CLASSIC” graph of documents & words: k=15, l=19 NSF Grant Proposals Experiments Words in abstract “GRANTS” graph of documents & words: k=41, l=28 Epinions.com user Experiments Epinions.com user “Who-trusts-whom” graph from epinions.com: k=18, l=16 Users Experiments Webpages “Clickstream” graph of users and websites: k=15, l=13 Experiments Time (secs) Splits Swaps Number of non-zeros Linear on the number of “ones”: Scalable Conclusions Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large graphs Fixed k and l l=5 swaps k=5 swaps Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Choose better values for k and l Final crossassociations Experiments l = 5 col groups k = 5 row groups “Caveman” graph with Zipfian cave sizes, no noise Aim l = 5 col groups k = 5 row groups Given any binary matrix a “good” cross-association will have low cost But how can we find such a cross-association? Main Idea Good Compression Total Encoding Cost = implies Better Clustering Cost of describing size * H(p ) + Σi i i cross-associations Code Cost Minimize the total cost Description Cost Main Idea Good Compression implies Better Clustering How well does a cross-association compress the matrix? Encode the matrix in a lossless fashion Compute the encoding cost Low encoding cost good compression good clustering