Bayesian Hierarchical Clustering Paper by K. Heller and Z
Download
Report
Transcript Bayesian Hierarchical Clustering Paper by K. Heller and Z
Bayesian Hierarchical
Clustering
Paper by K. Heller and Z. Ghahramani
ICML 2005
Presented by HAO-WEI, YEH
Outline
Background - Traditional Methods
Bayesian Hierarchical Clustering (BHC)
Basic ideas
Dirichlet Process Mixture Model (DPM)
Algorithm
Experiment results
Conclusion
Background
Traditional Method
Hierarchical Clustering
Given : data points
Output: a tree (series of clusters)
Leaves : data points
Internal nodes : nested clusters
Examples
Evolutionary tree of living organisms
Internet newsgroups
Newswire documents
Traditional Hierarchical Clustering
Bottom-up agglomerative algorithm
Closeness based on given distance measure
(e.g. Euclidean distance between cluster means)
Traditional Hierarchical Clustering (cont’d)
Limitations
No guide to choosing correct number of clusters, or where to prune tree.
Distance metric selection (especially for data such as images or sequences)
Evaluation (Probabilistic model)
How to evaluate how good result is ?
How to compare to other models ?
How to make predictions and cluster new data with existing hierarchy ?
BHC
Bayesian Hierarchical Clustering
Bayesian Hierarchical Clustering
Basic ideas:
Use marginal likelihoods to decide which clusters to merge
P(Data to merge were from the same mixture component)
vs. P(Data to merge were from different mixture components)
Generative Model : Dirichlet Process Mixture Model (DPM)
Dirichlet Process Mixture Model (DPM)
Formal Definition
Different Perspectives
Infinite version of Mixture Model (Motivation and Problems)
Stick-breaking Process (How generated distribution look like)
Chinese Restaurant Process, Polya urn scheme
Benefits
Conjugate prior
Unlimited clusters
“Rich-Get-Richer, ” Does it really work? Depends!
Pitman-Yor process, Uniform Process, …
BHC Algorithm - Overview
Same as traditional
One-pass, bottom-up method
Initializes each data point in own cluster, and iteratively merges pairs of clusters.
Difference
Uses a statistical hypothesis test to choose which clusters to merge.
BHC Algorithm - Concepts
Two hypotheses to compare
1. All data was generated i.i.d.
from the same probabilistic
model with unknown
parameters.
2. Data has two or more clusters
in it.
Hypothesis H1
Probability of the data under H1:
: prior over the parameters
Dk : data in the two trees to be merged
Integral is tractable with conjugate prior
Hypothesis H2
Probability of the data under H2:
Product over sub-trees
BHC Algorithm - Working Flow
From Bayes Rule, the posterior probability of the merged hypothesis:
Data number, concentration(DPM) Hidden features (Beneath Distribution)
The pair of trees with highest probability are merged.
Natural place to cut the final tree:
Tree-Consistent Partitions
Consider the right tree and all 15 possible partitions of {1,2,3,4}:
(1)(2)(3)(4), (1 2)(3)(4), (1 3)(2)(4), (1 4)(2)(3), (2 3)(1)(4),
(2 4)(1)(3), (3 4)(1)(2), (1 2)(3 4), (1 3)(2 4), (1 4)(2 3),
(1 2 3)(4), (1 2 4)(3), (1 3 4)(2), (2 3 4)(1), (1 2 3 4)
(1 2) (3) (4) and (1 2 3) (4) are tree-consistent partitions.
(1)(2 3)(4) and (1 3)(2 4) are not tree-consistent partitions.
Merged Hypothesis Prior (πk)
Based on DPM (CRP perspective)
πk = P(All points belong to one cluster)
d’s are the case for all tree-consistent partitions
Predictive Distribution
BHC allow to define predictive distributions for new data points.
Note : P(x|D) != P(x|Dk) for root!?
Approximate Inference for DPM prior
BHC forms a lower bound for the marginal likelihood of an infinite mixture model by
efficiently summing over an exponentially large subset of all partitions.
Idea : deterministically sum over partitions with high probability, therefore accounting for
most of the mass.
Compare to MCMC method, this is more deterministic and efficient.
Learning Hyperparameters
α : Concentration parameter
β : Define G0
Learned by recursive gradients and EM-like method
To Sum Up for BHC
Statistical model for comparison and decides when to stop.
Allow to define predictive distributions for new data points.
Approximate Inference for DPM marginal.
Parameters
α : Concentration parameter
β : Define G0
Unique Aspects of BHC Algorithm
Hierarchical way of organizing nested clusters, not a hierarchical generative model.
Derived from DPM.
Hypothesis test : one vs. many other clusterings
(compare to one vs. two clusters at each stage)
Not iterative and does not require sampling. (except for learning parameters)
Results
from the experiments
Conclusion
and some take home notes
Conclusion
Limitations
-> No guide to choosing correct number of clusters, or where to prune tree.
(Natural Stop Criterion) <-> Distance metric selection
(Model-based Criterion) <-
-> Evaluation, Comparison, Inference
(Probabilistic model) <Some useful results for DPM) <-
Summary
Defines probabilistic model of data, can compute probability of new
data point belonging to any cluster in tree.
Model-based criterion to decide on merging clusters.
Bayesian hypothesis testing used to decide which merges are
advantageous, and to decide appropriate depth of tree.
Algorithm can be interpreted as approximate inference method for a
DPM; gives new lower bound on marginal likelihood by summing over
exponentially many clusterings of the data.
Limitations
Inherent greediness
Lack of any incorporation of tree uncertainty
O(n2) complexity for building tree
References
Main paper:
Bayesian Hierarchical Clustering, K. Heller and Z. Ghahramani, ICML 2005
Thesis:
Efficient Bayesian Methods for Clustering, Katherine Ann Heller
Other references:
Wikipedia
Paper Slides
www.ee.duke.edu/~lcarin/emag/.../DW_PD_100705.ppt
http://cs.brown.edu/courses/csci2950-p/fall2011/lectures/2011-10-13_ghosh.pdf
General ML
http://blog.echen.me/
References
Other references(cont’d)
DPM & Nonparametric Bayesian :
http://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.ppt
https://www.cs.cmu.edu/~kbe/dp_tutorial.pdf
http://www.iro.umontreal.ca/~lisa/seminaires/31-10-2006.pdf
http://videolectures.net/mlss07_teh_dp/ , http://mlg.eng.cam.ac.uk/tutorials/07/ywt.pdf
http://www.cns.nyu.edu/~eorhan/notes/dpmm.pdf (Easy to read)
http://mlg.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf
Heavy text:
http://stat.columbia.edu/~porbanz/reports/OrbanzTeh2010.pdf
http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/dp.pdf
http://www.stat.uchicago.edu/~pmcc/reports/clusters.pdf
Hierarchical DPM
http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf
Other methods
https://people.cs.umass.edu/~wallach/publications/wallach10alternative.pdf
Thank You for Your Attentions!