Transcript cluster.ppt

Similarity and clustering
Motivation
• Problem: Query word could be ambiguous:
– Eg: Query“Star” retrieves documents about
astronomy, plants, animals etc.
– Solution: Visualisation
• Clustering document responses to queries along lines of
different topics.
• Problem 2: Manual construction of topic
hierarchies and taxonomies
– Solution:
• Preliminary clustering of large samples of web
documents.
• Problem 3: Speeding up similarity search
– Solution:
Clustering
• Restrict the search for documents similar to a query to
most representative cluster(s).
2
Example
Scatter/Gather, a text clustering system, can separate salient topics in the response t
keyword queries. (Image courtesy of Hearst)
Clustering
3
Clustering
• Task : Evolve measures of similarity to cluster a collection of
documents/terms into groups within which similarity within a
cluster is larger than across clusters.
• Cluster Hypothesis: Given a `suitable‘ clustering of a
collection, if the user is interested in document/term d/t, he is
likely to be interested in other members of the cluster to
which d/t belongs.
• Similarity measures
– Represent documents by TFIDF vectors
– Distance between document vectors
– Cosine of angle between document vectors
• Issues
– Large number of noisy dimensions
4
–Clustering
Notion of noise is application dependent
Top-down clustering
• k-Means: Repeat…
– Choose k arbitrary ‘centroids’
– Assign each document to nearest centroid
– Recompute centroids
• Expectation maximization (EM):
– Pick k arbitrary ‘distributions’
– Repeat:
• Find probability that document d is generated
from distribution f for all d and f
• Estimate distribution parameters from weighted
contribution of documents
Clustering
5
Choosing `k’
• Mostly problem driven
• Could be ‘data driven’ only when either
– Data is not sparse
– Measurement dimensions are not too noisy
• Interactive
– Data analyst interprets results of structure
discovery
Clustering
6
Choosing ‘k’ : Approaches
• Hypothesis testing:
– Null Hypothesis (Ho): Underlying density is a
mixture of ‘k’ distributions
– Require regularity conditions on the mixture
likelihood function (Smith’85)
• Bayesian Estimation
– Estimate posterior distribution on k, given data
and prior on k.
– Difficulty: Computational complexity of integration
– Autoclass algorithm of (Cheeseman’98) uses
approximations
– (Diebolt’94) suggests sampling techniques
Clustering
7
Choosing ‘k’ : Approaches
• Penalised Likelihood
– To account for the fact that Lk(D) is a nondecreasing function of k.
– Penalise the number of parameters
– Examples : Bayesian Information Criterion (BIC),
Minimum Description Length(MDL), MML.
– Assumption: Penalised criteria are asymptotically
optimal (Titterington 1985)
• Cross Validation Likelihood
– Find ML estimate on part of training data
– Choose k that maximises average of the M crossvalidated average likelihoods on held-out data Dtest
– Cross Validation techniques: Monte Carlo Cross
Validation (MCCV), v-fold cross validation (vCV)
Clustering
8
Similarity and clustering
Motivation
• Problem: Query word could be ambiguous:
– Eg: Query“Star” retrieves documents about
astronomy, plants, animals etc.
– Solution: Visualisation
• Clustering document responses to queries along lines of
different topics.
• Problem 2: Manual construction of topic
hierarchies and taxonomies
– Solution:
• Preliminary clustering of large samples of web
documents.
• Problem 3: Speeding up similarity search
– Solution:
Clustering
• Restrict the search for documents similar to a query to
most representative cluster(s).
10
Example
Scatter/Gather, a text clustering system, can separate salient topics in the response t
keyword queries. (Image courtesy of Hearst)
Clustering
11
Clustering
• Task : Evolve measures of similarity to cluster a
collection of documents/terms into groups within
which similarity within a cluster is larger than across
clusters.
• Cluster Hypothesis: Given a `suitable‘ clustering of
a collection, if the user is interested in
document/term d/t, he is likely to be interested in
other members of the cluster to which d/t belongs.
• Collaborative filtering: Clustering of two/more
objects which have bipartite relationship
Clustering
12
Clustering (contd)
• Two important paradigms:
– Bottom-up agglomerative clustering
– Top-down partitioning
• Visualisation techniques: Embedding of
corpus in a low-dimensional space
• Characterising the entities:
– Internally : Vector space model, probabilistic
models
– Externally: Measure of similarity/dissimilarity
between pairs
• Learning: Supplement stock algorithms with
experience with data
Clustering
13
Clustering: Parameters
• Similarity measure: (eg: cosine similarity)
 (d1 , d2 )
• Distance measure: (eg: eucledian distance)
 (d1 , d2 )
• Number “k” of clusters
• Issues
– Large number of noisy dimensions
– Notion of noise is application dependent
Clustering
14
Clustering: Formal specification
• Partitioning Approaches
– Bottom-up clustering
– Top-down clustering
• Geometric Embedding Approaches
– Self-organization map
– Multidimensional scaling
– Latent semantic indexing
• Generative models and probabilistic
approaches
– Single topic per document
– Documents correspond to mixtures of multiple
topics
Clustering
15
Partitioning Approaches
• Partition document collection into k clusters
• Choices: {D1 , D2 .....Dk }
 (d , d )
 
– Minimize intra-cluster distance
 (d , d )
– Maximize intra-cluster semblance  
• If cluster representations Di are available
– Minimize   (d , D )
– Maximize    (d , D )
i
d1 , d 2 Di
i
d1 , d 2 Di
1
2
i
dDi
i
2
i
d Di
i
1
• Soft clustering
– d assigned to Di with `confidence’ z d ,i
z d ,i
– Find
so as to minimize   z  (d , D ) or
maximize   z  (d , D )
i
dDi
d ,i
i
i
d Di
d ,i
i
• Two ways to get partitions - bottom-up
clustering and top-down clustering
Clustering
16
Bottom-up clustering(HAC)
• Initially G is a collection of singleton groups,
each with one document d
• Repeat
– Find ,  in G with max similarity measure,
s()
– Merge group  with group 
• For each  keep track of best 
• Use above info to plot the hierarchical
merging process (DENDOGRAM)
• To get desired number of clusters: cut across
any level of the dendogram
Clustering
17
Dendogram
A dendogram presents the progressive, hierarchy-forming merging process pictorially.
Clustering
18
Similarity measure
• Typically s() decreases with
increasing number of merges
• Self-Similarity
– Average pair wise similarity between
documents in 
s() 
1

 s(d , d )
C2 d1 ,d 2
1
2
– s(d1 , d 2 ) = inter-document similarity measure
(say cosine of tfidf vectors)
– Other criteria: Maximium/Minimum pair
wise similarity between documents in the
clusters
Clustering
19
Computation
Un-normalized
group profile:
p̂   d pd 
Can show:
s   
pˆ ( ), pˆ ( )  
s     
    1
pˆ (  ), pˆ (  )      
         1
pˆ    , pˆ      pˆ  , pˆ    pˆ  , pˆ  
 2 pˆ  , pˆ  
O(n2logn) algorithm with n2 space
Clustering
20
Similarity
s ( ,  ) 
g (c( )), g (c(  ))
g (c( ))  g (c(  ))
,  inner product
Normalized
document profile:
Profile for
document group :
Clustering
g (c( ))
p( ) 
g (c( ))
p ( ) 


p
(

)


p( )
21
Switch to top-down
• Bottom-up
– Requires quadratic time and space
• Top-down or move-to-nearest
– Internal representation for documents as well as
clusters
– Partition documents into `k’ clusters
– 2 variants
• “Hard” (0/1) assignment of documents to clusters
• “soft” : documents belong to clusters, with fractional
scores
– Termination
• when assignment of documents to clusters ceases to
change much OR
• When cluster centroids move negligibly over successive
iterations
Clustering
22
Top-down clustering
• Hard k-Means: Repeat…
– Choose k arbitrary ‘centroids’
– Assign each document to nearest centroid
– Recompute centroids
• Soft k-Means :
– Don’t break close ties between document assignments to
clusters
– Don’t make documents contribute to a single cluster which
wins narrowly
• Contribution for updating cluster centroid  c from document
related to the current similarity between  c and
.
d
d
exp(  | d   c |2 )
 c  
 exp(  | d   |2 )

Clustering
 c   c   c
23
Seeding `k’ clusters
 
• Randomly sample O kn documents
• Run bottom-up group average
clustering algorithm to reduce to k
groups or clusters : O(knlogn) time
• Iterate assign-to-nearest O(1) times
– Move each document to nearest cluster
– Recompute cluster centroids
• Total time taken is O(kn)
• Non-deterministic behavior
Clustering
24
Choosing `k’
• Mostly problem driven
• Could be ‘data driven’ only when either
– Data is not sparse
– Measurement dimensions are not too noisy
• Interactive
– Data analyst interprets results of structure
discovery
Clustering
25
Choosing ‘k’ : Approaches
• Hypothesis testing:
– Null Hypothesis (Ho): Underlying density is a
mixture of ‘k’ distributions
– Require regularity conditions on the mixture
likelihood function (Smith’85)
• Bayesian Estimation
– Estimate posterior distribution on k, given data
and prior on k.
– Difficulty: Computational complexity of integration
– Autoclass algorithm of (Cheeseman’98) uses
approximations
– (Diebolt’94) suggests sampling techniques
Clustering
26
Choosing ‘k’ : Approaches
• Penalised Likelihood
– To account for the fact that Lk(D) is a nondecreasing function of k.
– Penalise the number of parameters
– Examples : Bayesian Information Criterion (BIC),
Minimum Description Length(MDL), MML.
– Assumption: Penalised criteria are asymptotically
optimal (Titterington 1985)
• Cross Validation Likelihood
– Find ML estimate on part of training data
– Choose k that maximises average of the M crossvalidated average likelihoods on held-out data Dtest
– Cross Validation techniques: Monte Carlo Cross
Validation (MCCV), v-fold cross validation (vCV)
Clustering
27
Visualisation techniques
• Goal: Embedding of corpus in a lowdimensional space
• Hierarchical Agglomerative Clustering (HAC)
– lends itself easily to visualisaton
• Self-Organization map (SOM)
– A close cousin of k-means
• Multidimensional scaling (MDS)
– minimize the distortion of interpoint distances in
the low-dimensional embedding as compared to
the dissimilarity given in the input data.
• Latent Semantic Indexing (LSI)
– Linear transformations to reduce number of
dimensions
Clustering
28
Self-Organization Map (SOM)
• Like soft k-means
– Determine association between clusters and documents
– Associate a representative vector  c with each cluster and
iteratively refine  c
• Unlike k-means
– Embed the clusters in a low-dimensional space right from
the beginning
– Large number of clusters can be initialised even if eventually
many are to remain devoid of documents
• Each cluster can be a slot in a square/hexagonal grid.
• The grid structure defines the neighborhood N(c) for
each cluster c
• Also involves a proximity function h(c,  ) between
clusters  and c
Clustering
29
SOM : Update Rule
• Like Neural network
– Data item d activates neuron (closest
cluster) cd as well as the neighborhood
neurons N (cd )
– Eg Gaussian neighborhood function
h(c,  )  exp(
|| c   ||2
2 (t )
2
)
– Update rule for node  under the influence
of d is:
 (t  1)   (t )   (t )h( , cd )( d   )
– Where  2 (t ) is the ndb width and  (t ) is the
learning rate parameter
Clustering
30
SOM : Example I
SOM computed from over a million documents taken from 80 Usenet newsgroups. Ligh
areas have a high density of documents.
Clustering
31
SOM: Example II
Another example of SOM at work: the sites listed in the Open Directory
have beenorganized within a map of Antarctica at http://antarcti.ca/.
Clustering
32
Multidimensional Scaling(MDS)
• Goal
– “Distance preserving” low dimensional embedding of
documents
• Symmetric inter-document distances dij
– Given apriori or computed from internal representation
• Coarse-grained user feedback
^
– User provides similarity d between documents i and j .
ij
– With increasing feedback, prior distances are overridden
• Objective : Minimize the stress of embedding
^
stress 
Clustering
2
(
d

d
)
 ij ij
i, j
 dij
i, j
2
33
MDS: issues
•
•
Stress not easy to optimize
Iterative hill climbing
1. Points (documents) assigned random
coordinates by external heuristic
2. Points moved by small distance in
direction of locally decreasing stress
For n documents
•
– Each takes O (n) time to be moved
– Totally O(n 2 ) time per relaxation
Clustering
34
Fast Map [Faloutsos ’95]
•
No internal representation of
documents available
Goal
•
–
find a projection from an ‘n’ dimensional space
to a space with a smaller number `k‘’ of
dimensions.
•
Iterative projection of documents
along lines of maximum spread
•
Each 1D projection preserves distance
information
Clustering
35
Best line
• Pivots for a line: two points (a and b)
that determine it
• Avoid exhaustive checking by picking
pivots that are far apart
• First coordinates x1 of point x on “best
line” (a, b)
x1 
Clustering
d a2, x  d a2,b  d b2, x
2 d a ,b
36
Iterative projection
• For i = 1 to k
1.Find a next (ith ) “best” line
A “best” line is one which gives maximum
variance of the point-set in the direction of the
line
2.Project points on the line
3.Project points on the “hyperspace” orthogonal to
the above line
Clustering
37
Projection
• Purpose
dx ,y
– To correct inter-point distances
'
, y' )
between( xpoints
by taking into
account the (components
already
x1 , y1 )
accounted for by the first pivot line.
'
'
d x' ' , y '  d x2, y  ( x1  y1 ) 2
• Project recursively upto 1-D space
• Time: O(nk) time
Clustering
38
Issues
• Detecting noise dimensions
– Bottom-up dimension composition too slow
– Definition of noise depends on application
• Running time
– Distance computation dominates
– Random projections
– Sublinear time w/o losing small clusters
• Integrating semi-structured information
– Hyperlinks, tags embed similarity clues
– A link is worth a ? words
Clustering
39
• Expectation maximization (EM):
– Pick k arbitrary ‘distributions’
– Repeat:
• Find probability that document d is generated
from distribution f for all d and f
• Estimate distribution parameters from weighted
contribution of documents
Clustering
40
Extended similarity
• Where can I fix my scooter?
• A great garage to repair your
2-wheeler is at …
• auto and car co-occur often
• Documents having related
words are related
• Useful for search and clustering
• Two basic approaches
– Hand-made thesaurus
(WordNet)
– Co-occurrence and
associations
Clustering
… auto …car
… auto …car
… car
… auto
… auto
…car
… car … auto
… car … auto
car  auto
… auto …

… car …
41
Latent semantic indexing
Term
Document
k
Documents
d
Terms
car
A
t
SVD
D
V
U
auto
d
Clustering
r
k-dim vector
42
Collaborative recommendation
• People=record, movies=features
• People and features to be clustered
– Mutual reinforcement of similarity
• Need advanced models
Batman
Rambo
Andre
Hiver
Whispers StarWars
Lyle
Ellen
Jason
Fred
Dean
Karen
From Clustering methods in collaborative filtering, by Ungar and Foster
Clustering
43
A model for collaboration
• People and movies belong to unknown
classes
• Pk = probability a random person is in class k
• Pl = probability a random movie is in class l
• Pkl = probability of a class-k person liking a
class-l movie
• Gibbs sampling: iterate
– Pick a person or movie at random and assign to a
class with probability proportional to Pk or Pl
– Estimate new parameters
Clustering
44
Aspect Model
• Metric data vs Dyadic data vs Proximity data vs
Ranked preference data.
• Dyadic data : domain with two finite sets of
objects
• Observations : Of dyads X and Y
• Unsupervised learning from dyadic data
• Two sets of objects
X  {x1 ....xi , xn }, Y  { y1 .... yi , yn }
Clustering
45
Aspect Model (contd)
• Two main tasks
– Probabilistic modeling:
• learning a joint or conditional probability model
over X  Y
– structure discovery:
• identifying clusters and data hierarchies.
Clustering
46
Aspect Model
• Statistical models
– Empirical co-occurrence frequencies
• Sufficient statistics
– Data spareseness:
• Empirical frequencies either 0 or significantly
corrupted by sampling noise
– Solution
• Smoothing
– Back-of method [Katz’87]
– Model interpolation with held-out data [JM’80, Jel’85]
– Similarity-based smoothing techniques [ES’92]
• Model-based statistical approach: a principled
approach to deal with data sparseness
Clustering
47
Aspect Model
• Model-based statistical approach: a principled
approach to deal with data sparseness
– Finite Mixture Models [TSM’85]
– Latent class [And’97]
– Specification of a joint probability distribution for
latent and observable variables [Hoffmann’98]
• Unifies
– statistical modeling
• Probabilistic modeling by marginalization
– structure detection (exploratory data analysis)
• Posterior probabilities by baye’s rule on latent space of
structures
Clustering
48
Aspect Model
•
S  ( x n , y n )1 n N :
Realisation of an
underlying sequence of random
n
n
S

(
X
,
Y
)1 n N :
variables
• 2 assumptions
– All co-occurrences in sample S are iid
n
n
n
X
,
Y
A
–
are independent given
• P(c) are the mixture components
Clustering
49
Increasing
Degree of
Restriction
On Latent
space
Aspect Model: Latent classes
An ( X n , Y n )1 n N
A  {a1 ,....aK }
{C ( X n ), Y n }1 n N
{C ( X n ), Y n }1 n N
C  {c1 ,...cK }
C  {c1 ,...cK }
{C ( X n ), D (Y n )}1 n N
C  {c1 ,...cK }
Clustering
D  {d1 ,..d L }
50
Aspect Model
Symmetric
N
Asymmetric
N
P( S , a)   P( x , y , a )   P(a n ) P( x n | a n ) P( y n | a n )
n
n 1
n
n
n 1
P ( S )    P ( x, y ) n ( x , y )  
xX yY
xX
[P(a) P( x | a)P( y | a)]
n( x, y )
yY aA
N
N
P( S , a)   P( x , y , a )   P(a n ) P( x n | a n ) P( y n | a n )
n
n 1
n
n
n 1
P( S )    P( x, y ) n ( x , y )   P( x) [P(a | x) P( y | a)]n ( x , y )
xX yY
Clustering
xX
yY aA
51
Clustering vs Aspect
• Clustering model
– constrained aspect model
P(a | x, c)  P( An  a | X n  x, C ( x)  c}   ac
• For flat:
ck  ak   ac
• For hierarchical ak  ck   ac .P(a | x, c)
– Group structure on object spaces as
against partition the observations
– Notation
• P(.) : are the parameters
• P{.}: are posteriors
Clustering
52
Hierarchical Clustering model
One-sided clustering
Hierarchical clustering
P( S )    P( x, y ) n ( x, y )   P( x) [P(a | x) P( y | a)]n ( x, y )
xX yY
xX
yY aA
  P( x) P(c)[P(a | x, c) P( y | a)]n ( x , y )   P(c) [ P( x)]n ( x )  [ P( y | a)]n ( x , y )
xX
yY aA cC
cC
xX
yY
P( S )    P( x, y ) n ( x , y )   P( x) [P(a | x) P( y | a)]n ( x , y )
xX yY
xX
yY aA
  P( x) P(c)[P(a | x, c) P( y | a)]n ( x , y )
Clustering
xX
yY aA cC
53
Comparison of E’s
•Aspect model
P(a) P( x | a) P( y | a)
P{ A  a | X  x, Y  y; } 
 P(a' ) P( x | a' ) P( y | a' )
n
n
n
a 'A
•One-sided aspect model
P{C ( x)  c | S x , } 
•Hierarchical aspect model
P{C ( x)  c | S , } 
P(c) [P( y | c)]n ( x , y )
yY
n( x, y )
P
(
c
'
)
[
P
(
y
|
c
'
)]
 
c 'C
yY
P(c) [P( y | a) P(a | x, c)]n ( x , y )
yY aA
n( x, y )
P
(
c
'
)
[
P
(
y
|
a
)
P
(
a
|
x
,
c
'
)]
 
c 'C
yY
P{ An  a | X n  x, Y n  y, C ( x)  c; } 
P ( a | x, c ) P ( y | a )
 P ( a ' | x, c ) P ( y | a ' )
a 'A
Clustering
54
Tempered EM(TEM)
•
Additively (on the log scale) discount
the likelihood part in Baye’s formula:
1.
2.
3.
4.
5.
Set  and perform EM until the performance on held--out data
deteriorates (early stopping).
Decrease  e.g., by setting    with some rate parameter .   1
As long as the performance on held-out data improves continue TEM
iterations at this value of 
Stop on  i.e., stop when decreasing  does not yield further
improvements, otherwise goto step (2)
Perform some final iterations using both, training and heldout data.

P
(
a
)[
P
(
x
|
a
)
P
(
y
|
a
)]
P{ An  a | X n  x, Y n  y; } 

P
(
a
'
)[
P
(
x
|
a
'
)
P
(
y
|
a
'
)]

a 'A
Clustering
55
1.
M-Steps
Aspect
P( x | a) 

n:x n  x
N

P ( a | x n , y n ; ' )

P( a | x n , y n ; ' )
n 1
2.
 n( x, y)P(a | x, y; ' )
y
 n( x' , y)P(a | x' , y; ' )

n: y n  y
N
P( y | a) 

P(a | x) 
n( x )
P( x) 
N

P ( a | x n , y n ; ' )
n 1
x ', y
Assymetric
P (a | x n , y n ; ' )
P ( a | x n , y n ; ' )

P( a | x n , y n ; ' )

n
n 1
3.
 n( x, y' )P(a | x, y' ; ' )
 n( x, y)P(a | x, y; ' )
y
 n( x' , y)P(a | x' , y; ' )
x ', y
Hierarchical x-clustering
P( x) 
n( x )
N
P( y | a) 

n: y n  y
N

P{a | x n , y n ; '}

P{a | x n , y n ; '}
n 1
4.
x
x, y '

n:x  x
N
 n( x, y)P(a | x, y; ' )
 n( x, y)P{a | x, y; '}
x
 n( x, y' )P{a | x, y' ; '}
x, y '
One-sided x-clustering
P( x) 
n( x )
N
 n( x, y)P{C ( x)  c | S ; '}
P( y | c) 
 n( x)P{C ( x)  c | S ; '}
x
x
x
x
Clustering
56
Example Model
[Hofmann and Popat CIKM 2001]
• Hierarchy of document categories
Clustering
57
Example Application
Clustering
58
Topic Hierarchies
• To overcome sparseness problem in topic
hierarchies with large number of classes
• Sparseness Problem: Small number of
positive examples
• Topic hierarchies to reduce variance in
parameter estimation
 Automatically differentiate
 Make use of term distributions estimated for more general,
coarser text aspects to provide better, smoothed estimates
of class conditional term distributions
 Convex combination of term distributions in a Hierarchical
Mixture Model
P( w | c)   P( a | c) P( w | a )

a c
  refers to all inner nodes a above the terminal class node
c.
Clustering
59
Topic Hierarchies
(Hierarchical X-clustering)
• X = document, Y = word
P( y | a) 

n: y  y
N
P{a | x n , y n ; '}
n


P{a | x n , y n ; '}
n 1
 n( x, y)P{a | x, y; '}
x
 n( x, y' )P{a | x, y' ; '}

c ( x ) a
 n(c( x), y' )P{a | c( x), y' ; '}
c ( x ) a , y '
x, y '
P{a | x, y, c( x);  }  P{a | y, c( x);  } 
 n(c( x), y)P{a | c( x), y; '}
P ( a | x, c ) P ( y | a )

 P ( a ' | x, c ) P ( y | a ' )
a 'A
 n( y, c)P(a | y, c( x))
P{a | c( x); } 
 P(a'| y, c( x))
y
P( x) 
P(a | c) P( y | a)
 P(a'| c( x)) P( y | a' )
a 'c
n( x )
N
a 'c
P{C ( x)  c | S , } 
P(c) [P( y | a) P(a | x, c( x))]n ( x , y )
yY aA
 P(c' )[ P( y | a) P(a | x, c' ( x))]
c 'C
Clustering
yY
n( x, y )

P(c) [P( y | a) P(a | c( x))]n ( x , y )
yY a c
 P(c' )[ P( y | a) P(a | c' ( x))]
c 'C
n( x, y )
yY
60
Document Classification Exercise
• Modification of Naïve Bayes
P( w | c)   P( a | c) P( w | a )
a c
P (c | x ) 
P(c) P( y i | c)
yi x
 P(c' ) P( y | c' )
i
c'
Clustering
yi x
61
Mixture vs Shrinkage
• Shrinkage [McCallum Rosenfeld AAAI’98]: Interior
nodes in the hierarchy represent
coarser views of the data which are
obtained by simple pooling scheme of
term counts
• Mixture : Interior nodes represent
abstraction levels with their
corresponding specific vocabulary
– Predefined hierarchy [Hofmann and Popat CIKM 2001]
– Creation of hierarchical model from unlabeled data
[Hofmann IJCAI’99]
Clustering
62
Mixture Density Networks(MDN)
[Bishop CM ’94 Mixture Density Networks]
.
• broad and flexible class of distributions that
are capable of modeling completely general
continuous distributions
• superimpose simple component densities with
well known properties to generate or
approximate more complex distributions
• Two modules:
– Mixture models: Output has a distribution given as
mixture of distributions
– Neural Network: Outputs determine parameters of
the mixture model
Clustering
63
MDN: Example
A conditional mixture density network with Gaussian component densities
Clustering
64
MDN
• Parameter Estimation :
– Using Generalized EM (GEM) algo to speed
up.
• Inference
– Even for a linear mixture, closed form
solution not possible
– Use of Monte Carlo Simulations as a
substitute
Clustering
65
Document model
• Vocabulary V, term wi, document 
represented by c( )   f (wi , )wi V
• f ( wi , ) is the number of times wi
occurs in document 
• Most f’s are zeroes for a single
document
• Monotone component-wise damping
function g such as log or square-root
g (c( ))  g ( f (wi , ))wi V
Clustering
66