Transcript cluster.ppt
Similarity and clustering
Motivation
• Problem: Query word could be ambiguous:
– Eg: Query“Star” retrieves documents about
astronomy, plants, animals etc.
– Solution: Visualisation
• Clustering document responses to queries along lines of
different topics.
• Problem 2: Manual construction of topic
hierarchies and taxonomies
– Solution:
• Preliminary clustering of large samples of web
documents.
• Problem 3: Speeding up similarity search
– Solution:
Clustering
• Restrict the search for documents similar to a query to
most representative cluster(s).
2
Example
Scatter/Gather, a text clustering system, can separate salient topics in the response t
keyword queries. (Image courtesy of Hearst)
Clustering
3
Clustering
• Task : Evolve measures of similarity to cluster a collection of
documents/terms into groups within which similarity within a
cluster is larger than across clusters.
• Cluster Hypothesis: Given a `suitable‘ clustering of a
collection, if the user is interested in document/term d/t, he is
likely to be interested in other members of the cluster to
which d/t belongs.
• Similarity measures
– Represent documents by TFIDF vectors
– Distance between document vectors
– Cosine of angle between document vectors
• Issues
– Large number of noisy dimensions
4
–Clustering
Notion of noise is application dependent
Top-down clustering
• k-Means: Repeat…
– Choose k arbitrary ‘centroids’
– Assign each document to nearest centroid
– Recompute centroids
• Expectation maximization (EM):
– Pick k arbitrary ‘distributions’
– Repeat:
• Find probability that document d is generated
from distribution f for all d and f
• Estimate distribution parameters from weighted
contribution of documents
Clustering
5
Choosing `k’
• Mostly problem driven
• Could be ‘data driven’ only when either
– Data is not sparse
– Measurement dimensions are not too noisy
• Interactive
– Data analyst interprets results of structure
discovery
Clustering
6
Choosing ‘k’ : Approaches
• Hypothesis testing:
– Null Hypothesis (Ho): Underlying density is a
mixture of ‘k’ distributions
– Require regularity conditions on the mixture
likelihood function (Smith’85)
• Bayesian Estimation
– Estimate posterior distribution on k, given data
and prior on k.
– Difficulty: Computational complexity of integration
– Autoclass algorithm of (Cheeseman’98) uses
approximations
– (Diebolt’94) suggests sampling techniques
Clustering
7
Choosing ‘k’ : Approaches
• Penalised Likelihood
– To account for the fact that Lk(D) is a nondecreasing function of k.
– Penalise the number of parameters
– Examples : Bayesian Information Criterion (BIC),
Minimum Description Length(MDL), MML.
– Assumption: Penalised criteria are asymptotically
optimal (Titterington 1985)
• Cross Validation Likelihood
– Find ML estimate on part of training data
– Choose k that maximises average of the M crossvalidated average likelihoods on held-out data Dtest
– Cross Validation techniques: Monte Carlo Cross
Validation (MCCV), v-fold cross validation (vCV)
Clustering
8
Similarity and clustering
Motivation
• Problem: Query word could be ambiguous:
– Eg: Query“Star” retrieves documents about
astronomy, plants, animals etc.
– Solution: Visualisation
• Clustering document responses to queries along lines of
different topics.
• Problem 2: Manual construction of topic
hierarchies and taxonomies
– Solution:
• Preliminary clustering of large samples of web
documents.
• Problem 3: Speeding up similarity search
– Solution:
Clustering
• Restrict the search for documents similar to a query to
most representative cluster(s).
10
Example
Scatter/Gather, a text clustering system, can separate salient topics in the response t
keyword queries. (Image courtesy of Hearst)
Clustering
11
Clustering
• Task : Evolve measures of similarity to cluster a
collection of documents/terms into groups within
which similarity within a cluster is larger than across
clusters.
• Cluster Hypothesis: Given a `suitable‘ clustering of
a collection, if the user is interested in
document/term d/t, he is likely to be interested in
other members of the cluster to which d/t belongs.
• Collaborative filtering: Clustering of two/more
objects which have bipartite relationship
Clustering
12
Clustering (contd)
• Two important paradigms:
– Bottom-up agglomerative clustering
– Top-down partitioning
• Visualisation techniques: Embedding of
corpus in a low-dimensional space
• Characterising the entities:
– Internally : Vector space model, probabilistic
models
– Externally: Measure of similarity/dissimilarity
between pairs
• Learning: Supplement stock algorithms with
experience with data
Clustering
13
Clustering: Parameters
• Similarity measure: (eg: cosine similarity)
(d1 , d2 )
• Distance measure: (eg: eucledian distance)
(d1 , d2 )
• Number “k” of clusters
• Issues
– Large number of noisy dimensions
– Notion of noise is application dependent
Clustering
14
Clustering: Formal specification
• Partitioning Approaches
– Bottom-up clustering
– Top-down clustering
• Geometric Embedding Approaches
– Self-organization map
– Multidimensional scaling
– Latent semantic indexing
• Generative models and probabilistic
approaches
– Single topic per document
– Documents correspond to mixtures of multiple
topics
Clustering
15
Partitioning Approaches
• Partition document collection into k clusters
• Choices: {D1 , D2 .....Dk }
(d , d )
– Minimize intra-cluster distance
(d , d )
– Maximize intra-cluster semblance
• If cluster representations Di are available
– Minimize (d , D )
– Maximize (d , D )
i
d1 , d 2 Di
i
d1 , d 2 Di
1
2
i
dDi
i
2
i
d Di
i
1
• Soft clustering
– d assigned to Di with `confidence’ z d ,i
z d ,i
– Find
so as to minimize z (d , D ) or
maximize z (d , D )
i
dDi
d ,i
i
i
d Di
d ,i
i
• Two ways to get partitions - bottom-up
clustering and top-down clustering
Clustering
16
Bottom-up clustering(HAC)
• Initially G is a collection of singleton groups,
each with one document d
• Repeat
– Find , in G with max similarity measure,
s()
– Merge group with group
• For each keep track of best
• Use above info to plot the hierarchical
merging process (DENDOGRAM)
• To get desired number of clusters: cut across
any level of the dendogram
Clustering
17
Dendogram
A dendogram presents the progressive, hierarchy-forming merging process pictorially.
Clustering
18
Similarity measure
• Typically s() decreases with
increasing number of merges
• Self-Similarity
– Average pair wise similarity between
documents in
s()
1
s(d , d )
C2 d1 ,d 2
1
2
– s(d1 , d 2 ) = inter-document similarity measure
(say cosine of tfidf vectors)
– Other criteria: Maximium/Minimum pair
wise similarity between documents in the
clusters
Clustering
19
Computation
Un-normalized
group profile:
p̂ d pd
Can show:
s
pˆ ( ), pˆ ( )
s
1
pˆ ( ), pˆ ( )
1
pˆ , pˆ pˆ , pˆ pˆ , pˆ
2 pˆ , pˆ
O(n2logn) algorithm with n2 space
Clustering
20
Similarity
s ( , )
g (c( )), g (c( ))
g (c( )) g (c( ))
, inner product
Normalized
document profile:
Profile for
document group :
Clustering
g (c( ))
p( )
g (c( ))
p ( )
p
(
)
p( )
21
Switch to top-down
• Bottom-up
– Requires quadratic time and space
• Top-down or move-to-nearest
– Internal representation for documents as well as
clusters
– Partition documents into `k’ clusters
– 2 variants
• “Hard” (0/1) assignment of documents to clusters
• “soft” : documents belong to clusters, with fractional
scores
– Termination
• when assignment of documents to clusters ceases to
change much OR
• When cluster centroids move negligibly over successive
iterations
Clustering
22
Top-down clustering
• Hard k-Means: Repeat…
– Choose k arbitrary ‘centroids’
– Assign each document to nearest centroid
– Recompute centroids
• Soft k-Means :
– Don’t break close ties between document assignments to
clusters
– Don’t make documents contribute to a single cluster which
wins narrowly
• Contribution for updating cluster centroid c from document
related to the current similarity between c and
.
d
d
exp( | d c |2 )
c
exp( | d |2 )
Clustering
c c c
23
Seeding `k’ clusters
• Randomly sample O kn documents
• Run bottom-up group average
clustering algorithm to reduce to k
groups or clusters : O(knlogn) time
• Iterate assign-to-nearest O(1) times
– Move each document to nearest cluster
– Recompute cluster centroids
• Total time taken is O(kn)
• Non-deterministic behavior
Clustering
24
Choosing `k’
• Mostly problem driven
• Could be ‘data driven’ only when either
– Data is not sparse
– Measurement dimensions are not too noisy
• Interactive
– Data analyst interprets results of structure
discovery
Clustering
25
Choosing ‘k’ : Approaches
• Hypothesis testing:
– Null Hypothesis (Ho): Underlying density is a
mixture of ‘k’ distributions
– Require regularity conditions on the mixture
likelihood function (Smith’85)
• Bayesian Estimation
– Estimate posterior distribution on k, given data
and prior on k.
– Difficulty: Computational complexity of integration
– Autoclass algorithm of (Cheeseman’98) uses
approximations
– (Diebolt’94) suggests sampling techniques
Clustering
26
Choosing ‘k’ : Approaches
• Penalised Likelihood
– To account for the fact that Lk(D) is a nondecreasing function of k.
– Penalise the number of parameters
– Examples : Bayesian Information Criterion (BIC),
Minimum Description Length(MDL), MML.
– Assumption: Penalised criteria are asymptotically
optimal (Titterington 1985)
• Cross Validation Likelihood
– Find ML estimate on part of training data
– Choose k that maximises average of the M crossvalidated average likelihoods on held-out data Dtest
– Cross Validation techniques: Monte Carlo Cross
Validation (MCCV), v-fold cross validation (vCV)
Clustering
27
Visualisation techniques
• Goal: Embedding of corpus in a lowdimensional space
• Hierarchical Agglomerative Clustering (HAC)
– lends itself easily to visualisaton
• Self-Organization map (SOM)
– A close cousin of k-means
• Multidimensional scaling (MDS)
– minimize the distortion of interpoint distances in
the low-dimensional embedding as compared to
the dissimilarity given in the input data.
• Latent Semantic Indexing (LSI)
– Linear transformations to reduce number of
dimensions
Clustering
28
Self-Organization Map (SOM)
• Like soft k-means
– Determine association between clusters and documents
– Associate a representative vector c with each cluster and
iteratively refine c
• Unlike k-means
– Embed the clusters in a low-dimensional space right from
the beginning
– Large number of clusters can be initialised even if eventually
many are to remain devoid of documents
• Each cluster can be a slot in a square/hexagonal grid.
• The grid structure defines the neighborhood N(c) for
each cluster c
• Also involves a proximity function h(c, ) between
clusters and c
Clustering
29
SOM : Update Rule
• Like Neural network
– Data item d activates neuron (closest
cluster) cd as well as the neighborhood
neurons N (cd )
– Eg Gaussian neighborhood function
h(c, ) exp(
|| c ||2
2 (t )
2
)
– Update rule for node under the influence
of d is:
(t 1) (t ) (t )h( , cd )( d )
– Where 2 (t ) is the ndb width and (t ) is the
learning rate parameter
Clustering
30
SOM : Example I
SOM computed from over a million documents taken from 80 Usenet newsgroups. Ligh
areas have a high density of documents.
Clustering
31
SOM: Example II
Another example of SOM at work: the sites listed in the Open Directory
have beenorganized within a map of Antarctica at http://antarcti.ca/.
Clustering
32
Multidimensional Scaling(MDS)
• Goal
– “Distance preserving” low dimensional embedding of
documents
• Symmetric inter-document distances dij
– Given apriori or computed from internal representation
• Coarse-grained user feedback
^
– User provides similarity d between documents i and j .
ij
– With increasing feedback, prior distances are overridden
• Objective : Minimize the stress of embedding
^
stress
Clustering
2
(
d
d
)
ij ij
i, j
dij
i, j
2
33
MDS: issues
•
•
Stress not easy to optimize
Iterative hill climbing
1. Points (documents) assigned random
coordinates by external heuristic
2. Points moved by small distance in
direction of locally decreasing stress
For n documents
•
– Each takes O (n) time to be moved
– Totally O(n 2 ) time per relaxation
Clustering
34
Fast Map [Faloutsos ’95]
•
No internal representation of
documents available
Goal
•
–
find a projection from an ‘n’ dimensional space
to a space with a smaller number `k‘’ of
dimensions.
•
Iterative projection of documents
along lines of maximum spread
•
Each 1D projection preserves distance
information
Clustering
35
Best line
• Pivots for a line: two points (a and b)
that determine it
• Avoid exhaustive checking by picking
pivots that are far apart
• First coordinates x1 of point x on “best
line” (a, b)
x1
Clustering
d a2, x d a2,b d b2, x
2 d a ,b
36
Iterative projection
• For i = 1 to k
1.Find a next (ith ) “best” line
A “best” line is one which gives maximum
variance of the point-set in the direction of the
line
2.Project points on the line
3.Project points on the “hyperspace” orthogonal to
the above line
Clustering
37
Projection
• Purpose
dx ,y
– To correct inter-point distances
'
, y' )
between( xpoints
by taking into
account the (components
already
x1 , y1 )
accounted for by the first pivot line.
'
'
d x' ' , y ' d x2, y ( x1 y1 ) 2
• Project recursively upto 1-D space
• Time: O(nk) time
Clustering
38
Issues
• Detecting noise dimensions
– Bottom-up dimension composition too slow
– Definition of noise depends on application
• Running time
– Distance computation dominates
– Random projections
– Sublinear time w/o losing small clusters
• Integrating semi-structured information
– Hyperlinks, tags embed similarity clues
– A link is worth a ? words
Clustering
39
• Expectation maximization (EM):
– Pick k arbitrary ‘distributions’
– Repeat:
• Find probability that document d is generated
from distribution f for all d and f
• Estimate distribution parameters from weighted
contribution of documents
Clustering
40
Extended similarity
• Where can I fix my scooter?
• A great garage to repair your
2-wheeler is at …
• auto and car co-occur often
• Documents having related
words are related
• Useful for search and clustering
• Two basic approaches
– Hand-made thesaurus
(WordNet)
– Co-occurrence and
associations
Clustering
… auto …car
… auto …car
… car
… auto
… auto
…car
… car … auto
… car … auto
car auto
… auto …
… car …
41
Latent semantic indexing
Term
Document
k
Documents
d
Terms
car
A
t
SVD
D
V
U
auto
d
Clustering
r
k-dim vector
42
Collaborative recommendation
• People=record, movies=features
• People and features to be clustered
– Mutual reinforcement of similarity
• Need advanced models
Batman
Rambo
Andre
Hiver
Whispers StarWars
Lyle
Ellen
Jason
Fred
Dean
Karen
From Clustering methods in collaborative filtering, by Ungar and Foster
Clustering
43
A model for collaboration
• People and movies belong to unknown
classes
• Pk = probability a random person is in class k
• Pl = probability a random movie is in class l
• Pkl = probability of a class-k person liking a
class-l movie
• Gibbs sampling: iterate
– Pick a person or movie at random and assign to a
class with probability proportional to Pk or Pl
– Estimate new parameters
Clustering
44
Aspect Model
• Metric data vs Dyadic data vs Proximity data vs
Ranked preference data.
• Dyadic data : domain with two finite sets of
objects
• Observations : Of dyads X and Y
• Unsupervised learning from dyadic data
• Two sets of objects
X {x1 ....xi , xn }, Y { y1 .... yi , yn }
Clustering
45
Aspect Model (contd)
• Two main tasks
– Probabilistic modeling:
• learning a joint or conditional probability model
over X Y
– structure discovery:
• identifying clusters and data hierarchies.
Clustering
46
Aspect Model
• Statistical models
– Empirical co-occurrence frequencies
• Sufficient statistics
– Data spareseness:
• Empirical frequencies either 0 or significantly
corrupted by sampling noise
– Solution
• Smoothing
– Back-of method [Katz’87]
– Model interpolation with held-out data [JM’80, Jel’85]
– Similarity-based smoothing techniques [ES’92]
• Model-based statistical approach: a principled
approach to deal with data sparseness
Clustering
47
Aspect Model
• Model-based statistical approach: a principled
approach to deal with data sparseness
– Finite Mixture Models [TSM’85]
– Latent class [And’97]
– Specification of a joint probability distribution for
latent and observable variables [Hoffmann’98]
• Unifies
– statistical modeling
• Probabilistic modeling by marginalization
– structure detection (exploratory data analysis)
• Posterior probabilities by baye’s rule on latent space of
structures
Clustering
48
Aspect Model
•
S ( x n , y n )1 n N :
Realisation of an
underlying sequence of random
n
n
S
(
X
,
Y
)1 n N :
variables
• 2 assumptions
– All co-occurrences in sample S are iid
n
n
n
X
,
Y
A
–
are independent given
• P(c) are the mixture components
Clustering
49
Increasing
Degree of
Restriction
On Latent
space
Aspect Model: Latent classes
An ( X n , Y n )1 n N
A {a1 ,....aK }
{C ( X n ), Y n }1 n N
{C ( X n ), Y n }1 n N
C {c1 ,...cK }
C {c1 ,...cK }
{C ( X n ), D (Y n )}1 n N
C {c1 ,...cK }
Clustering
D {d1 ,..d L }
50
Aspect Model
Symmetric
N
Asymmetric
N
P( S , a) P( x , y , a ) P(a n ) P( x n | a n ) P( y n | a n )
n
n 1
n
n
n 1
P ( S ) P ( x, y ) n ( x , y )
xX yY
xX
[P(a) P( x | a)P( y | a)]
n( x, y )
yY aA
N
N
P( S , a) P( x , y , a ) P(a n ) P( x n | a n ) P( y n | a n )
n
n 1
n
n
n 1
P( S ) P( x, y ) n ( x , y ) P( x) [P(a | x) P( y | a)]n ( x , y )
xX yY
Clustering
xX
yY aA
51
Clustering vs Aspect
• Clustering model
– constrained aspect model
P(a | x, c) P( An a | X n x, C ( x) c} ac
• For flat:
ck ak ac
• For hierarchical ak ck ac .P(a | x, c)
– Group structure on object spaces as
against partition the observations
– Notation
• P(.) : are the parameters
• P{.}: are posteriors
Clustering
52
Hierarchical Clustering model
One-sided clustering
Hierarchical clustering
P( S ) P( x, y ) n ( x, y ) P( x) [P(a | x) P( y | a)]n ( x, y )
xX yY
xX
yY aA
P( x) P(c)[P(a | x, c) P( y | a)]n ( x , y ) P(c) [ P( x)]n ( x ) [ P( y | a)]n ( x , y )
xX
yY aA cC
cC
xX
yY
P( S ) P( x, y ) n ( x , y ) P( x) [P(a | x) P( y | a)]n ( x , y )
xX yY
xX
yY aA
P( x) P(c)[P(a | x, c) P( y | a)]n ( x , y )
Clustering
xX
yY aA cC
53
Comparison of E’s
•Aspect model
P(a) P( x | a) P( y | a)
P{ A a | X x, Y y; }
P(a' ) P( x | a' ) P( y | a' )
n
n
n
a 'A
•One-sided aspect model
P{C ( x) c | S x , }
•Hierarchical aspect model
P{C ( x) c | S , }
P(c) [P( y | c)]n ( x , y )
yY
n( x, y )
P
(
c
'
)
[
P
(
y
|
c
'
)]
c 'C
yY
P(c) [P( y | a) P(a | x, c)]n ( x , y )
yY aA
n( x, y )
P
(
c
'
)
[
P
(
y
|
a
)
P
(
a
|
x
,
c
'
)]
c 'C
yY
P{ An a | X n x, Y n y, C ( x) c; }
P ( a | x, c ) P ( y | a )
P ( a ' | x, c ) P ( y | a ' )
a 'A
Clustering
54
Tempered EM(TEM)
•
Additively (on the log scale) discount
the likelihood part in Baye’s formula:
1.
2.
3.
4.
5.
Set and perform EM until the performance on held--out data
deteriorates (early stopping).
Decrease e.g., by setting with some rate parameter . 1
As long as the performance on held-out data improves continue TEM
iterations at this value of
Stop on i.e., stop when decreasing does not yield further
improvements, otherwise goto step (2)
Perform some final iterations using both, training and heldout data.
P
(
a
)[
P
(
x
|
a
)
P
(
y
|
a
)]
P{ An a | X n x, Y n y; }
P
(
a
'
)[
P
(
x
|
a
'
)
P
(
y
|
a
'
)]
a 'A
Clustering
55
1.
M-Steps
Aspect
P( x | a)
n:x n x
N
P ( a | x n , y n ; ' )
P( a | x n , y n ; ' )
n 1
2.
n( x, y)P(a | x, y; ' )
y
n( x' , y)P(a | x' , y; ' )
n: y n y
N
P( y | a)
P(a | x)
n( x )
P( x)
N
P ( a | x n , y n ; ' )
n 1
x ', y
Assymetric
P (a | x n , y n ; ' )
P ( a | x n , y n ; ' )
P( a | x n , y n ; ' )
n
n 1
3.
n( x, y' )P(a | x, y' ; ' )
n( x, y)P(a | x, y; ' )
y
n( x' , y)P(a | x' , y; ' )
x ', y
Hierarchical x-clustering
P( x)
n( x )
N
P( y | a)
n: y n y
N
P{a | x n , y n ; '}
P{a | x n , y n ; '}
n 1
4.
x
x, y '
n:x x
N
n( x, y)P(a | x, y; ' )
n( x, y)P{a | x, y; '}
x
n( x, y' )P{a | x, y' ; '}
x, y '
One-sided x-clustering
P( x)
n( x )
N
n( x, y)P{C ( x) c | S ; '}
P( y | c)
n( x)P{C ( x) c | S ; '}
x
x
x
x
Clustering
56
Example Model
[Hofmann and Popat CIKM 2001]
• Hierarchy of document categories
Clustering
57
Example Application
Clustering
58
Topic Hierarchies
• To overcome sparseness problem in topic
hierarchies with large number of classes
• Sparseness Problem: Small number of
positive examples
• Topic hierarchies to reduce variance in
parameter estimation
Automatically differentiate
Make use of term distributions estimated for more general,
coarser text aspects to provide better, smoothed estimates
of class conditional term distributions
Convex combination of term distributions in a Hierarchical
Mixture Model
P( w | c) P( a | c) P( w | a )
a c
refers to all inner nodes a above the terminal class node
c.
Clustering
59
Topic Hierarchies
(Hierarchical X-clustering)
• X = document, Y = word
P( y | a)
n: y y
N
P{a | x n , y n ; '}
n
P{a | x n , y n ; '}
n 1
n( x, y)P{a | x, y; '}
x
n( x, y' )P{a | x, y' ; '}
c ( x ) a
n(c( x), y' )P{a | c( x), y' ; '}
c ( x ) a , y '
x, y '
P{a | x, y, c( x); } P{a | y, c( x); }
n(c( x), y)P{a | c( x), y; '}
P ( a | x, c ) P ( y | a )
P ( a ' | x, c ) P ( y | a ' )
a 'A
n( y, c)P(a | y, c( x))
P{a | c( x); }
P(a'| y, c( x))
y
P( x)
P(a | c) P( y | a)
P(a'| c( x)) P( y | a' )
a 'c
n( x )
N
a 'c
P{C ( x) c | S , }
P(c) [P( y | a) P(a | x, c( x))]n ( x , y )
yY aA
P(c' )[ P( y | a) P(a | x, c' ( x))]
c 'C
Clustering
yY
n( x, y )
P(c) [P( y | a) P(a | c( x))]n ( x , y )
yY a c
P(c' )[ P( y | a) P(a | c' ( x))]
c 'C
n( x, y )
yY
60
Document Classification Exercise
• Modification of Naïve Bayes
P( w | c) P( a | c) P( w | a )
a c
P (c | x )
P(c) P( y i | c)
yi x
P(c' ) P( y | c' )
i
c'
Clustering
yi x
61
Mixture vs Shrinkage
• Shrinkage [McCallum Rosenfeld AAAI’98]: Interior
nodes in the hierarchy represent
coarser views of the data which are
obtained by simple pooling scheme of
term counts
• Mixture : Interior nodes represent
abstraction levels with their
corresponding specific vocabulary
– Predefined hierarchy [Hofmann and Popat CIKM 2001]
– Creation of hierarchical model from unlabeled data
[Hofmann IJCAI’99]
Clustering
62
Mixture Density Networks(MDN)
[Bishop CM ’94 Mixture Density Networks]
.
• broad and flexible class of distributions that
are capable of modeling completely general
continuous distributions
• superimpose simple component densities with
well known properties to generate or
approximate more complex distributions
• Two modules:
– Mixture models: Output has a distribution given as
mixture of distributions
– Neural Network: Outputs determine parameters of
the mixture model
Clustering
63
MDN: Example
A conditional mixture density network with Gaussian component densities
Clustering
64
MDN
• Parameter Estimation :
– Using Generalized EM (GEM) algo to speed
up.
• Inference
– Even for a linear mixture, closed form
solution not possible
– Use of Monte Carlo Simulations as a
substitute
Clustering
65
Document model
• Vocabulary V, term wi, document
represented by c( ) f (wi , )wi V
• f ( wi , ) is the number of times wi
occurs in document
• Most f’s are zeroes for a single
document
• Monotone component-wise damping
function g such as log or square-root
g (c( )) g ( f (wi , ))wi V
Clustering
66