Relevant Overlapping Subspace Clusters on Categorical Data

Download Report

Transcript Relevant Overlapping Subspace Clusters on Categorical Data

Relevant Overlapping Subspace
Clusters on CATegorical Data
(ROCAT)
Xiao He1, Jing Feng1, Bettina Konte1, Son T.Mai1, Claudia Plant2
1: University of Munich, 2: Helmholtz Zentrum München, Technische Universität München
{he, feng, konte, mtson}@dbs.ifi.lmu.de, [email protected]
Presented by George Hodulik
Motivation of ROCAT algorithm
• Subspace clusters are more common in data than full dimensional
clusters
• Most current subspace clustering algorithms have at least one of the
following problems:
•
•
•
•
•
•
Heavily depend on input parameters.
Produce many redundancies
Partition based (subspace clusters cannot overlap)
Require fault-tolerant data
Only relevant for numerical data
Greatly affected by outliers
Use data compression as a measurement of
similarity – Minimum Description Length (MDL)
• MDL Principle: The subspace clusters that compresses the data
optimally will be the most relevant subspace clusters.
6076.5 bits
6147.1 bits
6670.9 bits
Subspace clustering
Full-D clustering
No clustering
Subspace cluster Ci
Non clustered area
Shannon Entropy as a measurement of MDL
• Shannon Entropy is the lower bound of lossless compression
• We do not need to actually compress the data, so we will use
Shannon Entropy as a measurement of MDL
Entropy of an attribute Aj
Entropy of subspace cluster Ci
We want to minimize the sum of the coding cost of each cluster, the non-clustered
area, and the model description of the subspace clusters.
This minimization will give us the most relevant subspace clusters.
ROCAT Algorithm
• Input: Data set D
• Output: List of subspace clusters in D
• 3 phases:
• Searching
• Combining
• Reassigning
Searching : Find subspace clusters
• Keep finding the best pure subspace cluster until the Shannon
Entropy of the data set no longer decreases
Searching : Find best pure cluster
A pure subspace cluster is one in that has all the same values for each attribute in each
object.
Algorithm FindBestPure
How FindBestPure works
Combining Phase
For each pair of clusters Ci and Cj that
overlap, split/combine them as shown,
choosing the option which minimizes
the Shannon entropy of the data set.
Reassigning phase
• For each subspace cluster Ci,
• Find each object o which match the (attribute, value) description of Ci,
• Add or Remove o to/from Ci if It reduces the Shannon Entropy
• Then, for each Ci which was changed, try adding attributes to Ci if it
decreases the Shannon Entropy. We can try attributes in order of
their Shannon Entropy to be more efficient.
• Repeat both steps until nothing changes.
Runtime Complexity
• N objects, M attributes
• Searching Phase = O(M2 * N)
• Combining Phase = O(g2*M*N)
• g << N,M is the number of subspace clusters found in Searching phase
• Close to O(M*N)
• Reassigning Phase = O(i * (M * N))
• i is the number of times iterations in the reassigning phase until convergence
• Normally converges very fast, so close to O(M * N)
Comparable performance on synthetic data
Cluster quality
(F-Measure)
Subspace cluster
quality (F-Measure) 
Comparable scalability on synthetic data
52 attributes used on left, 960 objects on right
Robustness against outliers
Real world Data – Congressional Votes
Survey with 16 attributes, 435
instances, 2 classes (Democratic and
Republican)
ROCAT produces very pure classes
and notes outliers, while DHCC takes
no notice of outliers, and MTV is
overwhelmed by outliers.
SUBCAD also performs well, but it
should be noted that its subspace
clusters are over only 3 dimensions,
while ROCAT’s are 12 dimensions.
Real world data - Mushrooms
8124 records, 22 categorical
attributes, 2 classes (edible
and poisonous)
Nearly all ROCAT clusters
have a very high purity (15
being the only one not
pure), while all others have
significant impurity.
Notice that MTV has decent
precision, but fails to classify
hundreds of mushrooms left
in the Noise category.
Real world data - Splice
3190 instances, 60 attributes, 3 classes (EI Exon/Intron, IE
Intron/Exon, Neither).
ROCAT and DHCC produce quite pure results, while all
others perform relatively poorly.
Again, MTV performs well but is very sensitive to outliers.
Real world data – overall precision
ROCAT significantly outperforms almost all
other methods with respect to precision.
Recall that SUBCAD subspace clusters in Vote
have much lower dimensionality than
ROCAT’s.
Recall that MTV in Mushroom fails to classify
hundreds of samples.
DHCC and ROCAT both perform well on Splice.
Conclusions
• ROCAT is a notable algorithm for finding non-redundant overlapping
subspace clusters in categorical data, with no parameters, and
without being negatively affected by outliers.
• Data compression is an intuitive way to represent similarity
• The combining phase seems redundant since the reassigning phase
also works to remove redundancy, only it is more complete.
• No single algorithm is a fix-all (yet). Some algorithms had results as
good or better than ROCAT for certain data sets.
Thank you!
• Questions?