Transcript Document
Clustering with Bregman Divergences
Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon,
Joydeep Ghosh
Presented by
Rohit Gupta
CSci 8980: Machine Learning
Outline
• Bregman Divergences – Basics and Examples
• Bregman Information
• Bregman Hard Clustering
• The Exponential Family and connection to Bregman
Divergence
• Bregman Soft Clustering
• Experiments and Results
• Conclusions
Bregman Hard and Soft Clustering
• Most existing parametric clustering methods partition the
data into pre-specified number of partitions with cluster
representative corresponding to every partition/cluster
Hard Clustering – disjoint partitioning of the data such
that each data point belongs to exactly one of the
partitions
Soft Clustering – each data point has a certain
probability of belonging to each of the partitions
Hard Clustering can be seen as Soft Clustering when
probabilities are either 0 or 1
Distortion or Loss Functions
• Squared euclidean distance is the most commonly used
loss function
Extensive literature
Easy to use – leads to simple calculations
Not appropriate for some domains
Difficult to compute for sparse data (missing
dimensions)
Example: Iterative K-means algorithm
• Question: How to choose a distortion/loss function for a
given problem?
Bregman Divergences
• Ref: Definition 1 in the paper:
Let : S
be differentiable and convex function on a convex set S
Bregman Divergence, d is defined as:
d ( x) ( y) x y, ( y)
• Examples:
Squared distance
Relative Entropy (KL divergence)
Itakura Saito distance
d
Few Take Home Points on Bregman Divergence
1.
2.
d ( x, y) d ( y, x)
(Not symmetric and therefore triangle property does not hold)
d ( x, y ) 0 if x
d ( x, y ) 0 if x
y
y
3. Three Point Property
d ( x, y) d ( z, y) d ( x, z) ( x z),(( y) ( z)
4. Strictly convex in the first argument but not necessarily so
in the second argument
Bregman Information
• Bregman Information of a random variable X is given by
I ( X ) min E[ d ( X , s )]
sS
• The optimal vector that achieves the minimal value will
be called Bregman representative of X
• For squared loss, minimum loss is variance
E[|| X ||2 ]
• Best predictor of the random variable is the mean
Bregman Information
• Bregman Information is the minimum loss that corresponds
to
arg min E[d ( X , s)]
s
• Points to note:
representative defined above always exists
uniquely determined
does not depend on the choice of Bregman
divergence
expectation of the random variable, X defines the
minimizer
Bregman Hard Clustering
• This problem is posed as a quantization problem that
involves minimizing the loss in Bregman information
• Very similar to squared distance based iterative K-means –
except that distortion function is general class of Bregman
Divergence
• Expected Bregman Divergence of the data points from their
Bregman representatives is minimized
• Procedure:
Initialize the representatives
Assign points to them
Re-estimate the representatives
Bregman Hard Clustering
• Algorithm:
Initialize {h }hk 1
While(converged )
Step 1:Assign each data point, x to the nearest cluster X h such that
h arg min d ( x, s )
s
Step 2: Re-estimate the representatives
h
x
xX h
nX h
Take home points
• Exhaustiveness: Bregman hard clustering algorithm works
for all Bregman divergences and in fact only for Bregman
Divergences
Arithmetic mean is the best predictor for Bregman
Divergences only
Possible to design clustering algorithms based on
distortion functions that are not Bregman divergences, but
in that case, cluster representative would not be the
arithmetic mean or the expectation
• Linear Separators: Clusters obtained are separated by
hyperplanes
Take home points
• Scalability: Each iteration of Bregman hard clustering
algorithm is linear in the number of data points and the
number of desired clusters
• Applicability to mixed data types: Allows choosing
different Bregman divergence that are meaningful and
appropriate for different subsets of features
• Also guarantees that the objective function will
monotonically decrease till convergence
Exponential families and Bregman Divergences
• [Forster & Warmuth] remarked that the log-likelihood of the
density of an exponential family distribution can be written as
follows:
log( p( , ) ( x)) d ( x, ( )) log(b ( x))
Here b is any uniquely determined function,
is the expectation parameter and is some other natural parameter
• Points to note:
is cumulant function and it determines the exponential family
fixes the distribution in the family
Bregman Soft Clustering
• Problem is posed as a parameter estimation problem for
mixture models based on exponential family distributions
• EM algorithm is used to design Bregman Soft Clustering
algorithm
• Maximizing log likelihood of data in the EM algorithm
would be equivalent to minimizing the Bregman Divergence in
the Bregman Soft Clustering algorithm (refer to the previous
slide)
• There is a Bregman Divergence for a defined exponential
family
Bregman Soft Clustering
• Algorithm:
k
Initialize { h ,h }h=1
While (converged)
Step 1: Expectation step
Compute the posterior probability for all x, h
p (h | x) h exp( d ( x, h ))b ( x)
Step 2: Maximization step
Recompute the paramters for all h, such that Bregman Divergence is minimized
h
h
1
p(h | x)
n x
p(h | x)x
p(h | x)
x
x
Experiments and Results
• Question: How the quality of clustering would depend on
the appropriateness of Bregman divergence?
• Experiments performed on synthetic data proved that cluster
quality is better when matching Bregman divergence is used
than the non-matching one
• Experiment 1:
Three 1-dimensional datasets of 100 samples each are
generated based on mixture models of Gaussian, Poisson,
and Binomial distributions respectively
datasets were clustered using three versions of
Bregman hard clustering corresponding to different
Bregman divergences
Experiments and Results
Mutual information is used to compare the results
Table 3 in the paper shows large numbers along the
diagonals, which shows the importance of using
appropriate Bregman divergence
• Experiment 2:
Similar as experiment 1 except that this is for multidimensional data.
Table 4 in the paper shows the results, which again
indicate the same observation as above
Conclusions
• Hard and Soft clustering algorithms are presented that
minimize the loss function based on Bregman Divergences
• It was shown that there is a one-to-one mapping between
regular exponential families and regular Bregman
Divergences – this helped formulating soft clustering
algorithm
• Connection of Bregman divergences to shannon’s rate
distortion theory is also established
• Experiments on synthetic data showed the importance of
choosing right Bregman divergence for the corresponding
family of exponential distributions