Bayesian Hierarchical Clustering Paper by K. Heller and Z

Download Report

Transcript Bayesian Hierarchical Clustering Paper by K. Heller and Z

Bayesian Hierarchical
Clustering
Paper by K. Heller and Z. Ghahramani
ICML 2005
Presented by HAO-WEI, YEH
Outline
 Background - Traditional Methods
 Bayesian Hierarchical Clustering (BHC)
 Basic ideas
 Dirichlet Process Mixture Model (DPM)
 Algorithm
 Experiment results
 Conclusion
Background
Traditional Method
Hierarchical Clustering
 Given : data points
 Output: a tree (series of clusters)
 Leaves : data points
 Internal nodes : nested clusters
 Examples
 Evolutionary tree of living organisms
 Internet newsgroups
 Newswire documents
Traditional Hierarchical Clustering
 Bottom-up agglomerative algorithm
 Closeness based on given distance measure
(e.g. Euclidean distance between cluster means)
Traditional Hierarchical Clustering (cont’d)
 Limitations
 No guide to choosing correct number of clusters, or where to prune tree.
 Distance metric selection (especially for data such as images or sequences)
 Evaluation (Probabilistic model)
 How to evaluate how good result is ?
 How to compare to other models ?
 How to make predictions and cluster new data with existing hierarchy ?
BHC
Bayesian Hierarchical Clustering
Bayesian Hierarchical Clustering
 Basic ideas:
 Use marginal likelihoods to decide which clusters to merge
 P(Data to merge were from the same mixture component)
vs. P(Data to merge were from different mixture components)
 Generative Model : Dirichlet Process Mixture Model (DPM)
Dirichlet Process Mixture Model (DPM)
 Formal Definition
 Different Perspectives
 Infinite version of Mixture Model (Motivation and Problems)
 Stick-breaking Process (How generated distribution look like)
 Chinese Restaurant Process, Polya urn scheme
 Benefits
 Conjugate prior
 Unlimited clusters
 “Rich-Get-Richer, ” Does it really work? Depends!
 Pitman-Yor process, Uniform Process, …
BHC Algorithm - Overview
 Same as traditional
 One-pass, bottom-up method
 Initializes each data point in own cluster, and iteratively merges pairs of clusters.
 Difference
 Uses a statistical hypothesis test to choose which clusters to merge.
BHC Algorithm - Concepts
Two hypotheses to compare
1. All data was generated i.i.d.
from the same probabilistic
model with unknown
parameters.
2. Data has two or more clusters
in it.
Hypothesis H1
Probability of the data under H1:

: prior over the parameters
 Dk : data in the two trees to be merged
 Integral is tractable with conjugate prior
Hypothesis H2
Probability of the data under H2:
 Product over sub-trees
BHC Algorithm - Working Flow
 From Bayes Rule, the posterior probability of the merged hypothesis:
Data number, concentration(DPM) Hidden features (Beneath Distribution)
 The pair of trees with highest probability are merged.
 Natural place to cut the final tree:
Tree-Consistent Partitions
 Consider the right tree and all 15 possible partitions of {1,2,3,4}:
(1)(2)(3)(4), (1 2)(3)(4), (1 3)(2)(4), (1 4)(2)(3), (2 3)(1)(4),
(2 4)(1)(3), (3 4)(1)(2), (1 2)(3 4), (1 3)(2 4), (1 4)(2 3),
(1 2 3)(4), (1 2 4)(3), (1 3 4)(2), (2 3 4)(1), (1 2 3 4)
 (1 2) (3) (4) and (1 2 3) (4) are tree-consistent partitions.
 (1)(2 3)(4) and (1 3)(2 4) are not tree-consistent partitions.
Merged Hypothesis Prior (πk)
 Based on DPM (CRP perspective)
 πk = P(All points belong to one cluster)
 d’s are the case for all tree-consistent partitions
Predictive Distribution
 BHC allow to define predictive distributions for new data points.
 Note : P(x|D) != P(x|Dk) for root!?
Approximate Inference for DPM prior
 BHC forms a lower bound for the marginal likelihood of an infinite mixture model by
efficiently summing over an exponentially large subset of all partitions.
 Idea : deterministically sum over partitions with high probability, therefore accounting for
most of the mass.
 Compare to MCMC method, this is more deterministic and efficient.
Learning Hyperparameters
 α : Concentration parameter
 β : Define G0
 Learned by recursive gradients and EM-like method
To Sum Up for BHC
 Statistical model for comparison and decides when to stop.
 Allow to define predictive distributions for new data points.
 Approximate Inference for DPM marginal.
 Parameters
 α : Concentration parameter
 β : Define G0
Unique Aspects of BHC Algorithm
 Hierarchical way of organizing nested clusters, not a hierarchical generative model.
 Derived from DPM.
 Hypothesis test : one vs. many other clusterings
(compare to one vs. two clusters at each stage)
 Not iterative and does not require sampling. (except for learning parameters)
Results
from the experiments
Conclusion
and some take home notes
Conclusion
 Limitations
-> No guide to choosing correct number of clusters, or where to prune tree.
(Natural Stop Criterion) <-> Distance metric selection
(Model-based Criterion) <-
-> Evaluation, Comparison, Inference
(Probabilistic model) <Some useful results for DPM) <-
Summary
 Defines probabilistic model of data, can compute probability of new
data point belonging to any cluster in tree.
 Model-based criterion to decide on merging clusters.
 Bayesian hypothesis testing used to decide which merges are
advantageous, and to decide appropriate depth of tree.
 Algorithm can be interpreted as approximate inference method for a
DPM; gives new lower bound on marginal likelihood by summing over
exponentially many clusterings of the data.
Limitations
 Inherent greediness
 Lack of any incorporation of tree uncertainty
 O(n2) complexity for building tree
References
 Main paper:
 Bayesian Hierarchical Clustering, K. Heller and Z. Ghahramani, ICML 2005
 Thesis:
 Efficient Bayesian Methods for Clustering, Katherine Ann Heller
 Other references:
 Wikipedia
 Paper Slides
 www.ee.duke.edu/~lcarin/emag/.../DW_PD_100705.ppt
 http://cs.brown.edu/courses/csci2950-p/fall2011/lectures/2011-10-13_ghosh.pdf
 General ML
 http://blog.echen.me/
References
 Other references(cont’d)
 DPM & Nonparametric Bayesian :
 http://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.ppt
 https://www.cs.cmu.edu/~kbe/dp_tutorial.pdf
 http://www.iro.umontreal.ca/~lisa/seminaires/31-10-2006.pdf
 http://videolectures.net/mlss07_teh_dp/ , http://mlg.eng.cam.ac.uk/tutorials/07/ywt.pdf
 http://www.cns.nyu.edu/~eorhan/notes/dpmm.pdf (Easy to read)
 http://mlg.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf
 Heavy text:
 http://stat.columbia.edu/~porbanz/reports/OrbanzTeh2010.pdf
 http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/dp.pdf
 http://www.stat.uchicago.edu/~pmcc/reports/clusters.pdf
 Hierarchical DPM
 http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf
 Other methods
 https://people.cs.umass.edu/~wallach/publications/wallach10alternative.pdf
Thank You for Your Attentions!
