Fast Effective Clustering for Graphs and Documents William W. Cohen Machine Learning Dept.
Download ReportTranscript Fast Effective Clustering for Graphs and Documents William W. Cohen Machine Learning Dept.
Fast Effective Clustering for Graphs and Documents William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University Joint work with: Frank Lin and Ramnath Balasubramanyan Introduction: trends in machine learning1 • Supervised learning: given data (x1,y1),…,(xn,yn), learn to predict y from x – y is a real number or member of small set – x is a (sparse) vector • Semi-supervised learning: given data (x1,y1),…,(xk,yk),xk+1,…,xn learn to predict y from x • Unsupervised learning: given data x1,…,xn find a “natural” clustering Introduction: trends in machine learning2 • Supervised learning: given data (x1,y1),…,(xn,yn), learn to predict y from x – – – – y is a real number or member of small set x is a (sparse) vector x’s are all i.i.d., independent of each other y depends only on the corresponding x • Structured learning: x’s and/or y’s are related to each other Introduction: trends in machine learning2 • Structured learning: x’s and/or y’s are related to each other – General: x and y are in two parallel 1d arrays • • • • x’s are words in a document, y is POS tag x’s are words, y=1 if x is part of a company name x’s are DNA codons, y=1 if x is part of a gene … – More general: x’s are nodes in a graph, y’s are labels for these nodes Examples of classification in graphs • x is a web page, edge is hyperlink, y is topic • x is a word, edge is co-occurrence in similar contexts, y is semantics (distributional clustering) • x is a protein, edge is interaction, y is subcellular location • x is a person, edge is email message, y is organization • x is a person, edge is friendship, y=1 if x smokes • … • x,y are anything, edge from x1 to x2 indicates similarity between x1 and x2 Examples: Zachary’s karate club, political books, protein-protein interactions, …. Political blog network Adamic & Glance “Divided They Blog:…” 2004 Outline • Spectral methods – Variant: Power Iteration Clustering [Lin & Cohen, ICML 2010] – Variant: PIC for document clustering • Stochastic block models – Mixed-membership sparse block model [Parkinnen et al, 2007] – Variants: BlockLDA with entropic regularization, BlockLDA with annotated documents This talk: • Typical experiments: – For networks with known “true” labels … – can unsupervised learning can recover these labels? Spectral Clustering: Graph = Matrix A A B C 1 1 C D B D E 1 G H I C 1 1 1 B 1 A 1 1 1 1 H 1 I 1 1 1 1 1 1 1 D G I 1 G J J 1 E F F J F E H Spectral Clustering: Graph = Matrix Transitively Closed Components = “Blocks” A B C A _ 1 1 B 1 _ 1 C 1 1 _ D E F _ 1 1 E 1 _ 1 1 1 _ G 1 I J C 1 D F G H B _ 1 1 H _ 1 1 I 1 1 _ 1 J 1 1 1 _ D A G I J F E Of course we can’t see the “blocks” unless the nodes are sorted by cluster… H Spectral Clustering: Graph = Matrix Vector = Node Weight v M A B C A _ 1 1 B 1 _ C 1 1 D E F G H I J 1 A A 3 1 B 2 _ C 3 D _ 1 1 D E 1 _ 1 E 1 1 _ F F G 1 _ H 1 1 G _ 1 1 H I 1 1 _ 1 I J 1 1 1 _ J M C B D A G I J F E H Spectral Clustering: Graph = Matrix M*v1 = v2 “propogates weights from neighbors” A B C A _ 1 1 B 1 _ C 1 1 D E F G M * v1 = v2 H I J 1 A 3 1 B 2 _ C 3 D _ 1 1 D E 1 _ 1 E F 1 1 _ F G 2*1+3*1+0* 1 B 3*1+3*1 C 3*1+2*1 E 1 1 G _ 1 1 H G I 1 1 _ 1 I H J 1 1 1 _ J I J M C A B D F _ H A F D E Spectral Clustering: Graph = Matrix W*v1 = v2 “propogates weights from neighbors” W * v1 = v2 W: normalized so columns sum to 1 A B C A _ .5 .5 B .3 _ .5 C .3 .5 _ D E F G H I J .3 A 3 B 2 C 3 D _ .5 .3 D E .5 _ .3 E .5 .5 _ F G .3 F _ H A 2*.5+3*.5+0*.3 B 3*.3+3*.5 C 3*.33+2*.5 C A B D E F .3 .3 G _ .3 .3 H H G I .5 .5 _ .3 I I J .5 .5 .3 _ J J F D E Spectral Clustering: Graph = Matrix W*v1 = v2 “propogates weights from neighbors” W v v : v is an eigenvector with eigenvalue Q: How do I pick v to be an eigenvector for a blockstochastic matrix? Spectral Clustering: Graph = Matrix W*v1 = v2 “propogates weights from neighbors” W v v : v is an eigenvector with eigenvalue How do I pick v to be an eigenvector for a blockstochastic matrix? Spectral Clustering: Graph = Matrix W*v1 = v2 “propogates weights from neighbors” W v v : v is an eigenvector with eigenvalue λ1 e1 λ2 “eigengap” e3 λ3 λ4 e2 λ5,6,7,… . [Shi & Meila, 2002] Spectral Clustering: Graph = Matrix W*v1 = v2 “propogates weights from neighbors” W v v : v is an eigenvector with eigenvalue e2 0.4 0.2 xxxxxx xx x xxx 0.0 -0.2 yyyy zzzzzz zz zz e1 y -0.4 -0.4 -0.2 0 [Shi & Meila, 2002] 0.2 e3 e2 Spectral Clustering: Graph = Matrix W*v1 = v2 “propogates weights from neighbors” W v v : v is an eigenvector with eigenvalue If W is connected but roughly block diagonal with k blocks then • the top eigenvector is a constant vector • the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks M Spectral Clustering: Graph = Matrix W*v1 = v2 “propogates weights from neighbors” W v v : v is an eigenvector with eigenvalue If W is connected but roughly block diagonal with k blocks then • the “top” eigenvector is a constant vector • the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks M Spectral clustering: • Find the top k+1 eigenvectors v1,…,vk+1 • Discard the “top” one • Replace every node a with k-dimensional vector xa = <v2(a),…,vk+1 (a) > • Cluster with k-means Spectral Clustering: Pros and Cons • Elegant, and well-founded mathematically • Tends to avoid local minima – Optimal solution to relaxed version of mincut problem (Normalized cut, aka NCut) • Works quite well when relations are approximately transitive (like similarity, social connections) • Expensive for very large datasets – Computing eigenvectors is the bottleneck – Approximate eigenvector computation not always useful • Noisy datasets sometimes cause problems – Picking number of eigenvectors and k is tricky – “Informative” eigenvectors need not be in top few – Performance can drop suddenly from good to terrible Experimental results: best-case assignment of class labels to clusters Spectral Clustering: Graph = Matrix M*v1 = v2 “propogates weights from neighbors” M * v1 = v 2 A B C D E F G H I A _ 1 1 J 1 A 3 B 1 _ 1 B 2 C 1 1 _ C 3 D _ 1 1 D E 1 _ 1 E F 1 1 _ F G _ A 5 B 6 C 5 A B D E F G 1 1 G H _ 1 1 H I I 1 1 _ 1 I J J 1 1 1 _ J M C H F D E Repeated averaging with neighbors as a clustering method • Pick a vector v0 (maybe at random) • Compute v1 = Wv0 – i.e., replace v0[x] with weighted average of v0[y] for the neighbors y of x • Plot v1[x] for each x • Repeat for v2, v3, … • Variants widely used for semi-supervised learning – clamping of labels for nodes with known labels • Without clamping, will converge to constant vt • What are the dynamics of this process? Repeated averaging with neighbors on a sample problem… blue green ___red___ g g g • Create a graph, connecting all points in the 2-D initial space to all other points • Weighted by distance • Run power iteration for 10 steps • Plot node id x vs v10(x) • nodes are ordered by actual cluster number g g g g g g r b b b b b r r r r r r … Repeated averaging with neighbors on a sample problem… blue green ___red___ blue green ___red___ smaller larger blue green ___red___ Repeated averaging with neighbors on a sample problem… blue green ___red___ blue green ___red___ blue green ___red___ blue green ___red___ blue green ___red___ very small Repeated averaging with neighbors on a sample problem… PIC: Power Iteration Clustering run power iteration (repeated averaging w/ neighbors) with early stopping – V0: random start, or “degree matrix” D, or … – Easy to implement and efficient – Very easily parallelized – Experimentally, often better than traditional spectral methods – Surprising since the embedded space is 1-dimensional! Experiments • “Network” problems: natural graph structure – PolBooks: 105 political books, 3 classes, linked by copurchaser – UMBCBlog: 404 political blogs, 2 classes, blogroll links – AGBlog: 1222 political blogs, 2 classes, blogroll links • “Manifold” problems: cosine distance between classification instances – – – – – – Iris: 150 flowers, 3 classes PenDigits01,17: 200 handwritten digits, 2 classes (0-1 or 1-7) 20ngA: 200 docs, misc.forsale vs soc.religion.christian 20ngB: 400 docs, misc.forsale vs soc.religion.christian 20ngC: 20ngB + 200 docs from talk.politics.guns 20ngD: 20ngC + 200 docs from rec.sport.baseball Experimental results: best-case assignment of class labels to clusters Outline • Spectral methods – Variant: Power Iteration Clustering [Lin & Cohen, ICML 2010] • Experiments • Analysis – Variant: PIC for document clustering • Stochastic block models – Mixed-membership sparse block model [...] – Variants: BlockLDA etc Analysis: why is this working? Analysis: why is this working? Analysis: why is this working? L2 distance scaling? differences might cancel? “noise” terms Analysis: why is this working? • If – eigenvectors e2,…,ek are approximately piecewise constant on blocks; – λ2,…, λk are “large” and λk+1,… are “small”; • e.g., if matrix is block-stochastic – the ci’s for v0 are bounded; – for any a,b from distinct blocks there is at least one ei with ei(a)-ei(b) “large” • Then exists an R so that – spec(a,b) small R*pic(a,b) small Analysis: why is this working? • Sum of differences vs sum-of-squared differences • “soft” eigenvector selection Outline • Spectral methods – Variant: Power Iteration Clustering [Lin & Cohen, ICML 2010] – Variant: PIC for document clustering • Stochastic block models – Mixed-membership sparse block model [...] – Variants: BlockLDA with entropic regularization, BlockLDA with annotated documents Motivation: Experimental Datasets are… • “Network” problems: natural graph structure – – – – PolBooks: 105 political books, 3 classes, linked by copurchaser UMBCBlog: 404 political blogs, 2 classes, blogroll links AGBlog: 1222 political blogs, 2 classes, blogroll links Also: Zachary’s karate club, citation networks, ... • “Manifold” problems: cosine distance between all of classification instances – – – – – Gets expensive fast pairs Iris: 150 flowers, 3 classes PenDigits01,17: 200 handwritten digits, 2 classes (0-1 or 1-7) 20ngA: 200 docs, misc.forsale vs soc.religion.christian 20ngB: 400 docs, misc.forsale vs soc.religion.christian … Lazy computation of distances and normalizers • Recall PIC’s update is – vt = W * vt-1 = = D-1A * vt-1 1 is a column vector of 1’s – …where D is the [diagonal] degree matrix: D=A*1 • My favorite distance metric for text is length<u,v>=inner product normalized TFIDF: ||u|| is L2-norm – Def’n: A(i,j)=<vi,vj>/||vi||*||vj|| – Let N(i,i)=||vi|| … and N(i,j)=0 for i!=j – Let F(i,k)=TFIDF weight of word wk in document vi – Then: A = N-1FTFN-1 Lazy computation of distances and Equivalent to using normalizers TFIDF/cosine on all pairs of • Recall PIC’s update is examples but requires only sparse matrices – vt = W * vt-1 = = D-1A * vt-1 – …where D is the [diagonal] degree matrix: D=A*1 – Let F(i,k)=TFIDF weight of word wk in document vi – Compute N(i,i)=||vi|| … and N(i,j)=0 for i!=j – Don’t compute A = N-1FTFN-1 – Let D(i,i)= N-1FTFN-1*1 where 1 is an all-1’s vector • Computed as D=N-1(FT(F(N-1*1))) for efficiency – New update: • vt = D-1A * vt-1 = D-1 N-1FTFN-1 *vt-1 Experimental results • RCV1 text classification dataset – 800k + newswire stories – Category labels from industry vocabulary – Took single-label documents and categories with at least 500 instances – Result: 193,844 documents, 103 categories • Generated 100 random category pairs – – – – Each is all documents from two categories Range in size and difficulty Pick category 1, with m1 examples Pick category 2 such that 0.5m1<m2<2m1 Results •NCUTevd: Ncut with exact eigenvectors •NCUTiram: Implicit restarted Arnoldi method •No stat. signif. diffs between NCUTevd and PIC Results Results Results Outline • Spectral methods – Variant: Power Iteration Clustering [Lin & Cohen, ICML 2010] – Variant: PIC for document clustering • Stochastic block models – Mixed-membership sparse block model [Parkinnen et al, 2007] – Variants: BlockLDA with entropic regularization, BlockLDA with annotated documents Question: How to model this? MMSBM of Airoldi et al 1. 2. 3. Draw K2 Bernoulli distributions Draw a θi for each protein For each entry i,j, in matrix a) Draw zi* from θi b) Draw z*j from θj c) Draw mij from a Bernoulli associated with the pair of z’s. MMSBM of Airoldi et al 1. 2. 3. Draw K2 Bernoulli distributions Draw a θi for each protein For each entry i,j, in matrix a) Draw zi* from θi b) Draw z*j from θj c) Draw mij from a Bernoulli associated with the pair of z’s. p1, p2 do interact Index of protein 2 Question: How to model this? Index of protein 1 Question: How to model this? p1, p2 do interact Sparse block model of Parkinnen et al, 2007 Index of protein 2 we prefer… 1. 2. Draw K multinomial distributions β For each row in the link relation: a) b) c) These define the “blocks” d) , Draw (zL* z*R) from Draw a protein i from left multinomial associated with zL Draw a protein j from right multinomial associated with z*R Add i,j to the link relation Index of protein 1 Learning method: Gibbs sampling • Pick random cluster labels (z1,z2) for each link • Repeat until convergence: – For each link (e1,e2): • Re-estimate Pr(e1|Z1=z1), Pr(e2|Z2=z2) from current clusterings Easy to update! • Re-estimate Pr(Z1,Z2=z1,z2|.) from current clusterings Easy to update! • Re-assign (e1,e2) to z1’,z2’ randomly according to these estimates Gibbs sampler for sparse block model Sampling the class pair for a link probability of class pair in the link corpus probability of the two entities in their respective classes How do these methods compare? How do these methods compare? Also model entity-annotated text…. a English text z z word prot N L M Vac1p coordinates Rab and phosphatidylinositol 3-kinase signaling in Vps45p-dependent vesicle docking/fusion at the endosome. The vacuolar protein sorting (VPS) pathway of Saccharomyces cerevisiae mediates transport of vacuolar protein precursors from the late Golgi to the lysosome-like vacuole. Sorting of some vacuolar proteins occurs via a prevacuolar endosomal compartment and mutations in a subset of VPS genes (the class D VPS genes) interfere with the Golgi-to-endosome transport step. Several of the encoded proteins, including Pep12p/Vps6p (an endosomal target (t) SNARE) and Vps45p (a Sec1p homologue), bind each other directly [1]. Another of these proteins, Vac1p/Pep7p/Vps19p, associates with Pep12p and binds phosphatidylinositol 3phosphate (PI(3)P), the product of the Vps34 phosphatidylinositol 3-kinase (PI 3-kinase) ...... Protein annotations g EP7, VPS45, VPS34, PEP12, VPS21 BlockLDA: jointly modeling entity-entity links and entity-annotated text Entity distributions shared between “blocks” and “topics” Another Performance Test • Goal: predict “functional categories” of proteins – 15 categories at top-level (e.g., metabolism, cellular communication, cell fate, …) – Proteins have 2.1 categories on average – Method for predicting categories: • Run with 15 topics • Using held-out labeled data, associate topics with closest category • If category has n true members, pick top n proteins by probability of membership in associated topic. – Metric: F1, Precision, Recall Performance Another test: manual evaluation of topics by experts Evaluation Topics from BlockLDA Topics from plain vanilla LDA Joint with Katie Rivard (MLD), John Woolford, Jelena Jakovljevic (CMU Biology) Trained on yeast publications + protein-protein interaction networks Trained on only yeast publications Evaluate topics by asking – •is the topic meaningful? •if so: • which of the top 10 words are consistent with the topic’s meaning? • which of the top 10 genes? top 10 papers? Let’s ask people who know Yeast Biologists Evaluate topics by asking - • are the top words for a topic meaningful? • are the top papers for a topic meaningful? • are the top genes for a topic meaningful? Outline • Spectral methods – Variant: Power Iteration Clustering [Lin & Cohen, ICML 2010] – Variant: PIC for document clustering • Stochastic block models – Mixed-membership sparse block model [Parkinnen et al, 2007] [in progress] – Variants: BlockLDA with entropic regularization, BlockLDA with annotated documents BlockLDA: adding regularization terms BlockLDA + regularization •“ Pseudo-observe” low entropy for role assignment to nodes slightly mixed membership • Similar idea balances cluster sizes Outline • Spectral methods – Variant: Power Iteration Clustering [Lin & Cohen, ICML 2010] – Variant: PIC for document clustering • Stochastic block models – Mixed-membership sparse block model [Parkinnen et al, 2007] – Variants: BlockLDA with entropic regularization, BlockLDA with annotated documents Conclusions • Two new methods – PIC (~= fast spectral clustering) • Fast, robust • Easily extends to bipartite graphs (e.g., document-term graphs) – BlockLDA (~= mixed-membership block models ++) • Slower – longer convergence • More flexible (mixed-membership) model • Easier to extend to use side information Thanks to… • • • • NIH/NIGMS NSF Google Microsoft LiveLabs