Fast Effective Clustering for Graphs and Documents William W. Cohen Machine Learning Dept.

Download Report

Transcript Fast Effective Clustering for Graphs and Documents William W. Cohen Machine Learning Dept.

Fast Effective Clustering
for Graphs and Documents
William W. Cohen
Machine Learning Dept. and Language Technologies Institute
School of Computer Science
Carnegie Mellon University
Joint work with: Frank Lin and
Ramnath Balasubramanyan
Introduction: trends in machine learning1
• Supervised learning: given data
(x1,y1),…,(xn,yn), learn to predict y from x
– y is a real number or member of small set
– x is a (sparse) vector
• Semi-supervised learning: given data
(x1,y1),…,(xk,yk),xk+1,…,xn learn to predict
y from x
• Unsupervised learning: given data x1,…,xn
find a “natural” clustering
Introduction: trends in machine learning2
• Supervised learning: given data
(x1,y1),…,(xn,yn), learn to predict y from x
–
–
–
–
y is a real number or member of small set
x is a (sparse) vector
x’s are all i.i.d., independent of each other
y depends only on the corresponding x
• Structured learning: x’s and/or y’s are
related to each other
Introduction: trends in machine learning2
• Structured learning: x’s and/or y’s are
related to each other
– General: x and y are in two parallel 1d arrays
•
•
•
•
x’s are words in a document, y is POS tag
x’s are words, y=1 if x is part of a company name
x’s are DNA codons, y=1 if x is part of a gene
…
– More general: x’s are nodes in a graph, y’s are
labels for these nodes
Examples of classification in graphs
• x is a web page, edge is hyperlink, y is topic
• x is a word, edge is co-occurrence in similar contexts,
y is semantics (distributional clustering)
• x is a protein, edge is interaction, y is subcellular
location
• x is a person, edge is email message, y is organization
• x is a person, edge is friendship, y=1 if x smokes
• …
• x,y are anything, edge from x1 to x2 indicates
similarity between x1 and x2
Examples: Zachary’s karate club,
political books, protein-protein
interactions, ….
Political blog network
Adamic & Glance
“Divided They Blog:…” 2004
Outline
• Spectral methods
– Variant: Power Iteration Clustering [Lin &
Cohen, ICML 2010]
– Variant: PIC for document clustering
• Stochastic block models
– Mixed-membership sparse block model
[Parkinnen et al, 2007]
– Variants: BlockLDA with entropic
regularization, BlockLDA with annotated
documents
This talk:
• Typical experiments:
– For networks with known “true” labels …
– can unsupervised learning can recover these labels?
Spectral Clustering: Graph = Matrix
A
A
B
C
1
1
C
D
B
D
E
1
G
H I
C
1
1
1
B
1
A
1
1
1
1
H
1
I
1
1
1
1
1
1
1
D
G
I
1
G
J
J
1
E
F
F
J
F
E
H
Spectral Clustering: Graph = Matrix
Transitively Closed Components = “Blocks”
A
B
C
A
_
1
1
B
1
_
1
C
1
1
_
D
E
F
_
1
1
E
1
_
1
1
1
_
G
1
I
J
C
1
D
F
G H
B
_
1
1
H
_
1
1
I
1 1
_
1
J
1 1
1
_
D
A
G
I
J
F
E
Of course we can’t see the “blocks” unless the nodes
are sorted by cluster…
H
Spectral Clustering: Graph = Matrix
Vector = Node  Weight
v
M
A
B
C
A
_
1
1
B
1
_
C
1
1
D
E
F
G
H I
J
1
A
A
3
1
B
2
_
C
3
D
_
1
1
D
E
1
_
1
E
1
1
_
F
F
G
1
_
H
1
1
G
_
1
1
H
I
1
1
_
1
I
J
1
1
1
_
J
M
C
B
D
A
G
I
J
F
E
H
Spectral
Clustering: Graph = Matrix
M*v1 = v2 “propogates weights from neighbors”
A
B
C
A
_
1
1
B
1
_
C
1
1
D
E
F
G
M * v1 = v2
H I
J
1
A
3
1
B
2
_
C
3
D
_
1
1
D
E
1
_
1
E
F
1
1
_
F
G
2*1+3*1+0*
1
B
3*1+3*1
C
3*1+2*1
E
1
1
G
_
1
1
H
G
I
1
1
_
1
I
H
J
1
1
1
_
J
I
J
M
C
A
B
D
F
_
H
A
F
D
E
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propogates weights from neighbors”
W * v1 = v2
W: normalized so columns sum to 1
A
B
C
A
_
.5
.5
B
.3
_
.5
C
.3
.5
_
D
E
F
G
H
I
J
.3
A
3
B
2
C
3
D
_
.5
.3
D
E
.5
_
.3
E
.5
.5
_
F
G
.3
F
_
H
A
2*.5+3*.5+0*.3
B
3*.3+3*.5
C
3*.33+2*.5
C
A
B
D
E
F
.3
.3
G
_
.3
.3
H
H
G
I
.5
.5
_
.3
I
I
J
.5
.5
.3
_
J
J
F
D
E
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propogates weights from neighbors”
W  v  v : v is an eigenvector with eigenvalue
Q: How do I pick v
to be an eigenvector
for a blockstochastic matrix?
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propogates weights from neighbors”
W  v  v : v is an eigenvector with eigenvalue
How do I pick v to
be an eigenvector
for a blockstochastic matrix?
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propogates weights from neighbors”
W  v  v : v is an eigenvector with eigenvalue
λ1
e1
λ2
“eigengap”
e3
λ3
λ4
e2
λ5,6,7,…
.
[Shi & Meila, 2002]
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propogates weights from neighbors”
W  v  v : v is an eigenvector with eigenvalue
e2
0.4
0.2
xxxxxx xx
x xxx
0.0
-0.2
yyyy zzzzzz
zz zz e1
y
-0.4
-0.4
-0.2
0
[Shi & Meila, 2002]
0.2
e3
e2
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propogates weights from neighbors”
W  v  v : v is an eigenvector with eigenvalue
If W is connected but roughly
block diagonal with k blocks
then
• the top eigenvector is a
constant vector
• the next k eigenvectors are
roughly piecewise constant
with “pieces” corresponding
to blocks
M
Spectral Clustering: Graph = Matrix
W*v1 = v2 “propogates weights from neighbors”
W  v  v : v is an eigenvector with eigenvalue
If W is connected but
roughly block diagonal with k
blocks then
• the “top” eigenvector is a
constant vector
• the next k eigenvectors are
roughly piecewise constant
with “pieces” corresponding
to blocks
M
Spectral clustering:
• Find the top k+1
eigenvectors v1,…,vk+1
• Discard the “top” one
• Replace every node a
with k-dimensional vector
xa = <v2(a),…,vk+1 (a) >
• Cluster with k-means
Spectral Clustering: Pros and Cons
• Elegant, and well-founded mathematically
• Tends to avoid local minima
– Optimal solution to relaxed version of mincut problem
(Normalized cut, aka NCut)
• Works quite well when relations are approximately
transitive (like similarity, social connections)
• Expensive for very large datasets
– Computing eigenvectors is the bottleneck
– Approximate eigenvector computation not always useful
• Noisy datasets sometimes cause problems
– Picking number of eigenvectors and k is tricky
– “Informative” eigenvectors need not be in top few
– Performance can drop suddenly from good to terrible
Experimental results:
best-case assignment of class labels to clusters
Spectral Clustering: Graph = Matrix
M*v1 = v2 “propogates weights from
neighbors”
M * v1 = v 2
A B C D E F G H I
A _ 1 1
J
1
A 3
B 1 _ 1
B 2
C 1 1 _
C 3
D
_ 1 1
D
E
1 _ 1
E
F
1 1 _
F
G
_
A
5
B
6
C
5
A
B
D
E
F
G
1 1
G
H
_ 1 1
H
I
I
1 1 _ 1
I
J
J
1 1 1 _
J
M
C
H
F
D
E
Repeated averaging with neighbors as a
clustering method
• Pick a vector v0 (maybe at random)
• Compute v1 = Wv0
– i.e., replace v0[x] with weighted average of v0[y] for the
neighbors y of x
• Plot v1[x] for each x
• Repeat for v2, v3, …
• Variants widely used for semi-supervised learning
– clamping of labels for nodes with known labels
• Without clamping, will converge to constant vt
• What are the dynamics of this process?
Repeated averaging with neighbors on a sample
problem…
blue green ___red___
g
g
g
• Create a graph, connecting
all points in the 2-D initial
space to all other points
• Weighted by distance
• Run power iteration for 10
steps
• Plot node id x vs v10(x)
• nodes are ordered by
actual cluster number
g
g
g
g
g
g
r
b
b
b
b
b
r
r
r
r
r
r
…
Repeated averaging with neighbors on a sample
problem…
blue green ___red___
blue green ___red___
smaller
larger
blue green ___red___
Repeated averaging with neighbors on a sample
problem…
blue green ___red___
blue green ___red___
blue green ___red___
blue green ___red___
blue green ___red___
very small
Repeated averaging with neighbors on a sample
problem…
PIC: Power Iteration Clustering
run power iteration (repeated averaging w/
neighbors) with early stopping
– V0: random start, or “degree matrix” D, or …
– Easy to implement and efficient
– Very easily parallelized
– Experimentally, often better than traditional spectral methods
– Surprising since the embedded space is 1-dimensional!
Experiments
• “Network” problems: natural graph structure
– PolBooks: 105 political books, 3 classes, linked by copurchaser
– UMBCBlog: 404 political blogs, 2 classes, blogroll links
– AGBlog: 1222 political blogs, 2 classes, blogroll links
• “Manifold” problems: cosine distance between
classification instances
–
–
–
–
–
–
Iris: 150 flowers, 3 classes
PenDigits01,17: 200 handwritten digits, 2 classes (0-1 or 1-7)
20ngA: 200 docs, misc.forsale vs soc.religion.christian
20ngB: 400 docs, misc.forsale vs soc.religion.christian
20ngC: 20ngB + 200 docs from talk.politics.guns
20ngD: 20ngC + 200 docs from rec.sport.baseball
Experimental results:
best-case assignment of class labels to clusters
Outline
• Spectral methods
– Variant: Power Iteration Clustering [Lin &
Cohen, ICML 2010]
• Experiments
• Analysis
– Variant: PIC for document clustering
• Stochastic block models
– Mixed-membership sparse block model [...]
– Variants: BlockLDA etc
Analysis: why is this working?
Analysis: why is this working?
Analysis: why is this working?
L2 distance
scaling?
differences might cancel?
“noise” terms
Analysis: why is this working?
• If
– eigenvectors e2,…,ek are approximately
piecewise constant on blocks;
– λ2,…, λk are “large” and λk+1,… are “small”;
• e.g., if matrix is block-stochastic
– the ci’s for v0 are bounded;
– for any a,b from distinct blocks there is at
least one ei with ei(a)-ei(b) “large”
• Then exists an R so that
– spec(a,b) small  R*pic(a,b) small
Analysis: why is this working?
• Sum of differences vs sum-of-squared differences
• “soft” eigenvector selection
Outline
• Spectral methods
– Variant: Power Iteration Clustering [Lin &
Cohen, ICML 2010]
– Variant: PIC for document clustering
• Stochastic block models
– Mixed-membership sparse block model [...]
– Variants: BlockLDA with entropic
regularization, BlockLDA with annotated
documents
Motivation: Experimental Datasets are…
• “Network” problems: natural graph structure
–
–
–
–
PolBooks: 105 political books, 3 classes, linked by copurchaser
UMBCBlog: 404 political blogs, 2 classes, blogroll links
AGBlog: 1222 political blogs, 2 classes, blogroll links
Also: Zachary’s karate club, citation networks, ...
• “Manifold” problems: cosine distance between all
of classification instances
–
–
–
–
–
Gets expensive fast
pairs
Iris: 150 flowers, 3 classes
PenDigits01,17: 200 handwritten digits, 2 classes (0-1 or 1-7)
20ngA: 200 docs, misc.forsale vs soc.religion.christian
20ngB: 400 docs, misc.forsale vs soc.religion.christian
…
Lazy computation of distances and
normalizers
• Recall PIC’s update is
– vt = W * vt-1 = = D-1A * vt-1
1 is a column vector of
1’s
– …where D is the [diagonal] degree matrix: D=A*1
• My favorite distance metric for text is length<u,v>=inner product
normalized TFIDF:
||u|| is L2-norm
– Def’n: A(i,j)=<vi,vj>/||vi||*||vj||
– Let N(i,i)=||vi|| … and N(i,j)=0 for i!=j
– Let F(i,k)=TFIDF weight of word wk in document vi
– Then: A = N-1FTFN-1
Lazy computation of distances and
Equivalent to using
normalizers
TFIDF/cosine on all pairs of
• Recall PIC’s update is
examples but requires only
sparse matrices
– vt = W * vt-1 = = D-1A * vt-1
– …where D is the [diagonal] degree matrix: D=A*1
– Let F(i,k)=TFIDF weight of word wk in document vi
– Compute N(i,i)=||vi|| … and N(i,j)=0 for i!=j
– Don’t compute A = N-1FTFN-1
– Let D(i,i)= N-1FTFN-1*1 where 1 is an all-1’s vector
• Computed as D=N-1(FT(F(N-1*1))) for efficiency
– New update:
• vt = D-1A * vt-1 = D-1 N-1FTFN-1 *vt-1
Experimental results
• RCV1 text classification dataset
– 800k + newswire stories
– Category labels from industry vocabulary
– Took single-label documents and categories with at
least 500 instances
– Result: 193,844 documents, 103 categories
• Generated 100 random category pairs
–
–
–
–
Each is all documents from two categories
Range in size and difficulty
Pick category 1, with m1 examples
Pick category 2 such that 0.5m1<m2<2m1
Results
•NCUTevd: Ncut with exact eigenvectors
•NCUTiram: Implicit restarted Arnoldi method
•No stat. signif. diffs between NCUTevd and PIC
Results
Results
Results
Outline
• Spectral methods
– Variant: Power Iteration Clustering [Lin &
Cohen, ICML 2010]
– Variant: PIC for document clustering
• Stochastic block models
– Mixed-membership sparse block model
[Parkinnen et al, 2007]
– Variants: BlockLDA with entropic
regularization, BlockLDA with annotated
documents
Question: How to model this?
MMSBM of
Airoldi et al
1.
2.
3.
Draw K2 Bernoulli
distributions
Draw a θi for each protein
For each entry i,j, in matrix
a) Draw zi* from θi
b) Draw z*j from θj
c) Draw mij from a
Bernoulli associated
with the pair of z’s.
MMSBM of
Airoldi et al
1.
2.
3.
Draw K2 Bernoulli
distributions
Draw a θi for each protein
For each entry i,j, in matrix
a) Draw zi* from θi
b) Draw z*j from θj
c) Draw mij from a
Bernoulli associated
with the pair of z’s.
p1, p2 do interact
Index of protein 2
Question: How to model this?
Index of protein 1
Question: How to model this?
p1, p2 do interact
Sparse block
model of
Parkinnen et
al, 2007
Index of protein 2
we prefer…
1.
2.
Draw K multinomial
distributions β
For each row in the link
relation:
a)
b)
c)
These define
the “blocks”
d)
,

Draw (zL* z*R) from
Draw a protein i from left
multinomial associated
with zL
Draw a protein j from right
multinomial associated
with z*R
Add i,j to the link relation
Index of protein 1
Learning method: Gibbs sampling
• Pick random cluster labels (z1,z2) for
each link
• Repeat until convergence:
– For each link (e1,e2):
• Re-estimate Pr(e1|Z1=z1), Pr(e2|Z2=z2) from
current clusterings
Easy to update!
• Re-estimate Pr(Z1,Z2=z1,z2|.) from current
clusterings
Easy to update!
• Re-assign (e1,e2) to z1’,z2’ randomly according
to these estimates
Gibbs sampler for sparse block model
Sampling the class pair for a link
probability of class pair in the
link corpus
probability of the two entities in
their respective classes
How do these methods compare?
How do these methods compare?
Also model entity-annotated text….
a
English text

z
z
word
prot
N
L
M

Vac1p coordinates Rab and phosphatidylinositol 3-kinase
signaling in Vps45p-dependent vesicle docking/fusion at
the endosome. The vacuolar protein sorting (VPS) pathway
of Saccharomyces cerevisiae mediates transport of vacuolar
protein precursors from the late Golgi to the lysosome-like
vacuole. Sorting of some vacuolar proteins occurs via a
prevacuolar endosomal compartment and mutations in a
subset of VPS genes (the class D VPS genes) interfere with
the Golgi-to-endosome transport step. Several of the encoded
proteins, including Pep12p/Vps6p (an endosomal target (t)
SNARE) and Vps45p (a Sec1p homologue), bind each other
directly [1]. Another of these proteins, Vac1p/Pep7p/Vps19p,
associates with Pep12p and binds phosphatidylinositol 3phosphate (PI(3)P), the product of the Vps34
phosphatidylinositol 3-kinase (PI 3-kinase) ......
Protein
annotations
g
EP7, VPS45, VPS34, PEP12, VPS21
BlockLDA: jointly modeling entity-entity links
and entity-annotated text
Entity
distributions
shared
between
“blocks”
and
“topics”
Another Performance Test
• Goal: predict “functional categories” of
proteins
– 15 categories at top-level (e.g., metabolism, cellular
communication, cell fate, …)
– Proteins have 2.1 categories on average
– Method for predicting categories:
• Run with 15 topics
• Using held-out labeled data, associate topics with closest
category
• If category has n true members, pick top n proteins by
probability of membership in associated topic.
– Metric: F1, Precision, Recall
Performance
Another test: manual evaluation of
topics by experts
Evaluation
Topics from BlockLDA
Topics from plain
vanilla LDA
Joint with Katie Rivard (MLD), John
Woolford, Jelena Jakovljevic (CMU
Biology)
Trained on yeast publications
+ protein-protein interaction
networks
Trained on only yeast
publications
Evaluate topics by asking –
•is the topic meaningful?
•if so:
• which of the top 10 words are consistent with the topic’s
meaning?
• which of the top 10 genes? top 10 papers?
Let’s ask people who know Yeast Biologists
Evaluate topics by asking -
• are the top words for a topic meaningful?
• are the top papers for a topic meaningful?
• are the top genes for a topic meaningful?
Outline
• Spectral methods
– Variant: Power Iteration Clustering [Lin &
Cohen, ICML 2010]
– Variant: PIC for document clustering
• Stochastic block models
– Mixed-membership sparse block model
[Parkinnen et al, 2007]
[in progress]
– Variants: BlockLDA with entropic
regularization, BlockLDA with annotated
documents
BlockLDA: adding regularization terms
BlockLDA + regularization
•“ Pseudo-observe” low entropy
for role assignment to nodes 
slightly mixed membership
• Similar idea balances cluster
sizes
Outline
• Spectral methods
– Variant: Power Iteration Clustering [Lin &
Cohen, ICML 2010]
– Variant: PIC for document clustering
• Stochastic block models
– Mixed-membership sparse block model
[Parkinnen et al, 2007]
– Variants: BlockLDA with entropic
regularization, BlockLDA with annotated
documents
Conclusions
• Two new methods
– PIC (~= fast spectral clustering)
• Fast, robust
• Easily extends to bipartite graphs (e.g.,
document-term graphs)
– BlockLDA (~= mixed-membership block
models ++)
• Slower – longer convergence
• More flexible (mixed-membership) model
• Easier to extend to use side information
Thanks to…
•
•
•
•
NIH/NIGMS
NSF
Google
Microsoft LiveLabs