DENCLUE 2.0: Fast Clustering based on Kernel Density

Transcript DENCLUE 2.0: Fast Clustering based on Kernel Density

DENCLUE 2.0: Fast Clustering based
on Kernel Density Estimation
Alexander Hinneburg
Martin-Luther-University Halle-Wittenberg, Germany
Hans-Henning Gabriel
101tec GmbH, Halle, Germany
Overview
•
•
•
•
•
Density-based clustering and DENCLUE 1.0
Hill climbing as EM-algorithm
Identification of local maxima
Applications of general EM-acceleration
Experiments
Density-Based Clustering
• Assumption
– clusters are regions of high density in the data
space
,
• How to estimate density?
– parametric models
• mixture models
– non-parametric models
• histogram
• kernel density estimation
Kernel Density Estimation
• Idea
– influence of a data point is modeled by a kernel
– density is the normalized sum of all kernels
– smoothing parameter h
Gaussian Kernel
Density Estimate
DENCLUE 1.0 Framework
• Clusters are defined by local maxima of
the density estimate
– find all maxima by hill climbing
Gradient
Hill Climbing
const. step size
• Problem
– const. step size
Problem of const. Step Size
• Not efficient
– many unnecessary small steps
• Not effective
– does not converge to a local maximum
just comes close
• Example
New Hill Climbing Approach
• General approach
– differentiate density estimate and set to zero
– no solution, but can be used for iteration
New DENCLUE 2.0 Hill Climbing
• Efficient
– automatically adjusted step size at no extra costs
• Effective
– converges to local maximum (proof follows)
• Example
Proof of Convergence
• Cast the problem of maximizing kernel denstiy
as maximizing the likelihood of a mixture model
• Introduce hidden variable
Proof of Convergence
• Complete likelihood is maximized by
EM-Algorithm
E-Step
M-Step
• this also maximizes the original likelihood,
which is the kernel density estimate
• When starting the EM with
we do the hill climbing for
Identification of local Maxima
• EM-Algorithm iterates until
– reached end point
– sum of k last step sizes
• Assumption
– true local maximum is in a ball of
around
• Points
with end points
closer
belong to the same maximum M
• In case of non-unique assignment
do a few extra EM iterations
Acceleration
• Sparse EM
– update only the p% points with largest posterior
– saves 1-p% of kernel computations after first iteration
• Data Reduction
– use only %p of the data as representative points
– random sampling
– kMeans
Experiments
• Comparison of DENCLUE 1.0 (FS) vs. 2.0 (SSA)
• 16-dim. artificial data
• both methods are tuned
to find the correct clustering
Experiments
• Comparison of acceleration methods
Experiments
• Clustering quality (normalized mutual
information, NMI) vs. sample size (RS)
Experiments
• Cluster Quality (NMI) of DENCLUE 2.0 (SSA) and
acceleration methods and k-Means on real data
sample sizes 0.8, 0.4, 0.2
Conclusion
•
•
•
•
New hill climbing for DENCLUE
Automatic step size adjustment
Convergence proof by reduction to EM
Allows the application of general
EM accelerations
• Future work
– automatic setting of smoothing parameter h
(so far tuned manually)
Thank you for your attention!

DENCLUE 2.0: Fast Clustering based on Kernel Density

Transcript DENCLUE 2.0: Fast Clustering based on Kernel Density

Directory