Latent Variables, Mixture Models and EM Christopher M. Bishop

Download Report

Transcript Latent Variables, Mixture Models and EM Christopher M. Bishop

Latent Variables, Mixture Models and EM

Christopher M. Bishop Microsoft Research, Cambridge BCS Summer School Exeter, 2003

Overview • K-means clustering • Gaussian mixtures • Maximum likelihood and EM • Probabilistic graphical models • Latent variables: EM revisited • Bayesian Mixtures of Gaussians • Variational Inference • VIBES BCS Summer School, Exeter, 2003 Christopher M. Bishop

Old Faithful BCS Summer School, Exeter, 2003 Christopher M. Bishop

Old Faithful Data Set Time between eruptions (minutes) Duration of eruption (minutes) BCS Summer School, Exeter, 2003 Christopher M. Bishop

K-means Algorithm • Goal: represent a data set in terms of K clusters each of which is summarized by a prototype • Initialize prototypes, then iterate between two phases: – E-step: assign each data point to nearest prototype – M-step: update prototypes to be the cluster means • Simplest version is based on Euclidean distance – re-scale Old Faithful data BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

Responsibilities •

Responsibilities

assign data points to clusters such that • Example: 5 data points and 3 clusters BCS Summer School, Exeter, 2003 Christopher M. Bishop

K-means Cost Function data responsibilities prototypes BCS Summer School, Exeter, 2003 Christopher M. Bishop

Minimizing the Cost Function • E-step: minimize w.r.t.

– assigns each data point to nearest prototype • M-step: minimize w.r.t – gives – each prototype set to the mean of points in that cluster • Convergence guaranteed since there is a finite number of possible settings for the responsibilities BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

Limitations of K-means • Hard assignments of data points to clusters – small shift of a data point can flip it to a different cluster • Not clear how to choose the value of K • Solution: replace ‘hard’ clustering of K-means with ‘soft’ probabilistic assignments • Represents the probability distribution of the data as a

Gaussian mixture model

BCS Summer School, Exeter, 2003 Christopher M. Bishop

The Gaussian Distribution • Multivariate Gaussian mean covariance • Define precision to be the inverse of the covariance • In 1-dimension BCS Summer School, Exeter, 2003 Christopher M. Bishop

Likelihood Function • Data set • Assume observed data points generated independently • Viewed as a function of the parameters, this is known as the

likelihood function

BCS Summer School, Exeter, 2003 Christopher M. Bishop

Maximum Likelihood • Set the parameters by maximizing the likelihood function • Equivalently maximize the log likelihood BCS Summer School, Exeter, 2003 Christopher M. Bishop

Maximum Likelihood Solution • Maximizing w.r.t. the mean gives the

sample mean

• Maximizing w.r.t covariance gives the

sample covariance

BCS Summer School, Exeter, 2003 Christopher M. Bishop

Bias of Maximum Likelihood • Consider the expectations of the maximum likelihood estimates under the Gaussian distribution • The maximum likelihood solution systematically under estimates the covariance • This is an example of

over-fitting

BCS Summer School, Exeter, 2003 Christopher M. Bishop

Intuitive Explanation of Over-fitting BCS Summer School, Exeter, 2003 Christopher M. Bishop

Unbiased Variance Estimate • Clearly we can remove the bias by using since this gives • Arises naturally in a Bayesian treatment (see later) • For an infinite data set the two expressions are equal BCS Summer School, Exeter, 2003 Christopher M. Bishop

Gaussian Mixtures • Linear super-position of Gaussians • Normalization and positivity require • Can interpret the mixing coefficients as prior probabilities BCS Summer School, Exeter, 2003 Christopher M. Bishop

Example: Mixture of 3 Gaussians BCS Summer School, Exeter, 2003 Christopher M. Bishop

Contours of Probability Distribution BCS Summer School, Exeter, 2003 Christopher M. Bishop

Surface Plot BCS Summer School, Exeter, 2003 Christopher M. Bishop

Sampling from the Gaussian • To generate a data point: – first pick one of the components with probability – then draw a sample from that component • Repeat these two steps for each new data point BCS Summer School, Exeter, 2003 Christopher M. Bishop

Synthetic Data Set BCS Summer School, Exeter, 2003 Christopher M. Bishop

Fitting the Gaussian Mixture • We wish to invert this process – given the data set, find the corresponding parameters: – mixing coefficients – means – covariances • If we knew which component generated each data point, the maximum likelihood solution would involve fitting each component to the corresponding cluster • Problem: the data set is unlabelled • We shall refer to the labels as

latent

(= hidden) variables BCS Summer School, Exeter, 2003 Christopher M. Bishop

Synthetic Data Set Without Labels BCS Summer School, Exeter, 2003 Christopher M. Bishop

Posterior Probabilities • We can think of the mixing coefficients as prior probabilities for the components • For a given value of we can evaluate the corresponding posterior probabilities, called

responsibilities

• These are given from Bayes’ theorem by BCS Summer School, Exeter, 2003 Christopher M. Bishop

Posterior Probabilities (colour coded) BCS Summer School, Exeter, 2003 Christopher M. Bishop

Posterior Probability Map BCS Summer School, Exeter, 2003 Christopher M. Bishop

Maximum Likelihood for the GMM • The log likelihood function takes the form • Note: sum over components appears

inside

the log • There is no closed form solution for maximum likelihood BCS Summer School, Exeter, 2003 Christopher M. Bishop

Over-fitting in Gaussian Mixture Models • Singularities in likelihood function when a component ‘collapses’ onto a data point: then consider • Likelihood function gets larger as we add more components (and hence parameters) to the model – not clear how to choose the number K of components BCS Summer School, Exeter, 2003 Christopher M. Bishop

Problems and Solutions • How to maximize the log likelihood – solved by expectation-maximization (EM) algorithm • How to avoid singularities in the likelihood function – solved by a Bayesian treatment • How to choose number K of components – also solved by a Bayesian treatment BCS Summer School, Exeter, 2003 Christopher M. Bishop

EM Algorithm – Informal Derivation • Let us proceed by simply differentiating the log likelihood • Setting derivative with respect to equal to zero gives giving which is simply the weighted mean of the data BCS Summer School, Exeter, 2003 Christopher M. Bishop

EM Algorithm – Informal Derivation • Similarly for the covariances • For mixing coefficients use a Lagrange multiplier to give BCS Summer School, Exeter, 2003 Christopher M. Bishop

• • EM Algorithm – Informal Derivation The solutions are not closed form since they are coupled Suggests an iterative scheme for solving them: – Make initial guesses for the parameters – Alternate between the following two stages: 1. E-step: evaluate responsibilities 2. M-step: update parameters using ML results BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

Digression: Probabilistic Graphical Models • Graphical representation of a probabilistic model • Each variable corresponds to a node in the graph • Links in the graph denote relations between variables • Motivation: – visualization of models and motivation for new models – graphical determination of conditional independence – complex calculations (inference) performed using graphical operations (e.g. forward-backward for HMM) • Here we consider

directed

graphs BCS Summer School, Exeter, 2003 Christopher M. Bishop

Example: 3 Variables • General distribution over 3 variables • Apply product rule of probability twice • Express as a directed graph BCS Summer School, Exeter, 2003 Christopher M. Bishop

General Decomposition Formula • Joint distribution is product of conditionals, conditioned on parent nodes • Example: BCS Summer School, Exeter, 2003 Christopher M. Bishop

EM – Latent Variable Viewpoint • Binary latent variables describing which component generated each data point • Conditional distribution of observed variable • Prior distribution of latent variables • Marginalizing over the latent variables we obtain BCS Summer School, Exeter, 2003 Christopher M. Bishop

Expected Value of Latent Variable • From Bayes’ theorem BCS Summer School, Exeter, 2003 Christopher M. Bishop

Graphical Representation of GMM BCS Summer School, Exeter, 2003 Christopher M. Bishop

Complete and Incomplete Data complete BCS Summer School, Exeter, 2003 incomplete Christopher M. Bishop

Graph for Complete-Data Model BCS Summer School, Exeter, 2003 Christopher M. Bishop

Latent Variable View of EM • If we knew the values for the latent variables, we would maximize the complete-data log likelihood which gives a trivial closed-form solution (fit each component to the corresponding set of data points) • We don’t know the values of the latent variables • However, for given parameter values we can compute the expected values of the latent variables BCS Summer School, Exeter, 2003 Christopher M. Bishop

Expected Complete-Data Log Likelihood • Suppose we make a guess for the parameter values (means, covariances and mixing coefficients) • Use these to evaluate the responsibilities • Consider expected complete-data log likelihood where responsibilities are computed using • We are implicitly ‘filling in’ latent variables with best guess • Keeping the responsibilities fixed and maximizing with respect to the parameters give the previous results BCS Summer School, Exeter, 2003 Christopher M. Bishop

K-means Revisited • Consider GMM with common covariances • Take limit • Responsibilities become binary • Expected complete-data log likelihood becomes BCS Summer School, Exeter, 2003 Christopher M. Bishop

EM in General • Consider arbitrary distribution over the latent variables • The following decomposition always holds where BCS Summer School, Exeter, 2003 Christopher M. Bishop

Decomposition BCS Summer School, Exeter, 2003 Christopher M. Bishop

Optimizing the Bound • E-step: maximize with respect to – equivalent to minimizing KL divergence – sets equal to the posterior distribution • M-step: maximize bound with respect to – equivalent to maximizing expected complete-data log likelihood • Each EM cycle must increase incomplete-data likelihood unless already at a (local) maximum BCS Summer School, Exeter, 2003 Christopher M. Bishop

E-step BCS Summer School, Exeter, 2003 Christopher M. Bishop

M-step BCS Summer School, Exeter, 2003 Christopher M. Bishop

Bayesian Inference • Include prior distributions over parameters • Advantages in using

conjugate

priors • Example: consider a single Gaussian over one variable – assume variance is known and mean is unknown – likelihood function for the mean • Choose Gaussian prior for mean BCS Summer School, Exeter, 2003 Christopher M. Bishop

Bayesian Inference for a Gaussian • Posterior (proportional to product of prior and likelihood) will then also be Gaussian where BCS Summer School, Exeter, 2003 Christopher M. Bishop

Bayesian Inference for a Gaussian BCS Summer School, Exeter, 2003 Christopher M. Bishop

Bayesian Mixture of Gaussians • Conjugate priors for the parameters: – Dirichlet prior for mixing coefficients – Normal-Wishart prior for means and precisions where the Wishart distribution is given by BCS Summer School, Exeter, 2003 Christopher M. Bishop

Graphical Representation • Parameters and latent variables appear on equal footing BCS Summer School, Exeter, 2003 Christopher M. Bishop

Variational Inference • • As with many Bayesian models, exact inference for the mixture of Gaussians is intractable • Approximate Bayesian inference traditionally based on Laplace’s method (local Gaussian approximation to the posterior) or Markov chain Monte Carlo

Variational Inference

is an alternative, broadly applicable deterministic approximation scheme BCS Summer School, Exeter, 2003 Christopher M. Bishop

General View of Variational Inference • Consider again the previous decomposition, but where the posterior is over all latent variables and parameters where • Maximizing over would give the true posterior distribution – but this is intractable by definition BCS Summer School, Exeter, 2003 Christopher M. Bishop

Factorized Approximation • Goal: choose a family of distributions which are: – sufficiently flexible to give good posterior approximation – sufficiently simple to remain tractable • Here we consider factorized distributions •

No further assumptions are required!

• Optimal solution for one factor, keeping the remained fixed • Coupled solutions so initialize then cyclically update BCS Summer School, Exeter, 2003 Christopher M. Bishop

Lower Bound • Can also be evaluated • Useful for maths/code verification • Also useful for model comparison: BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

Illustration: Univariate Gaussian • Likelihood function • Conjugate priors • Factorized variational distribution BCS Summer School, Exeter, 2003 Christopher M. Bishop

Variational Posterior Distribution where BCS Summer School, Exeter, 2003 Christopher M. Bishop

Initial Configuration BCS Summer School, Exeter, 2003 Christopher M. Bishop

After Updating BCS Summer School, Exeter, 2003 Christopher M. Bishop

After Updating BCS Summer School, Exeter, 2003 Christopher M. Bishop

Converged Solution BCS Summer School, Exeter, 2003 Christopher M. Bishop

Exact Solution • For this very simple example there is an exact solution • Expected precision given by • Compare with earlier maximum likelihood solution BCS Summer School, Exeter, 2003 Christopher M. Bishop

Variational Mixture of Gaussians • Assume factorized posterior distribution • Gives optimal solution in the form where is a Dirichlet, and is a Normal Wishart BCS Summer School, Exeter, 2003 Christopher M. Bishop

Sufficient Statistics • Small computational overhead compared to maximum likelihood EM BCS Summer School, Exeter, 2003 Christopher M. Bishop

Variational Equations for GMM BCS Summer School, Exeter, 2003 Christopher M. Bishop

Bound vs. K for Old Faithful Data BCS Summer School, Exeter, 2003 Christopher M. Bishop

Bayesian Model Complexity BCS Summer School, Exeter, 2003 Christopher M. Bishop

Sparse Bayes for Gaussian Mixture • Instead of comparing different values of K, start with a large value and prune out excess components • Treat mixing coefficients as parameters, and maximize marginal likelihood [Corduneanu + Bishop, AIStats ’01] • Gives simple re-estimation equations for the mixing coefficients – interleave with variational updates BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

BCS Summer School, Exeter, 2003 Christopher M. Bishop

General Variational Framework • Currently for each new model: – derive the variational update equations – write application-specific code to find the solution • Both stages are time consuming and error prone • Can we build a general-purpose inference engine which automates these procedures?

BCS Summer School, Exeter, 2003 Christopher M. Bishop

Lower Bound for GMM BCS Summer School, Exeter, 2003 Christopher M. Bishop

VIBES •

Variational Inference for Bayesian Networks

• Bishop and Winn (1999) • A general inference engine using variational methods • Models specified graphically BCS Summer School, Exeter, 2003 Christopher M. Bishop

Example: Mixtures of Bayesian PCA BCS Summer School, Exeter, 2003 Christopher M. Bishop

Solution BCS Summer School, Exeter, 2003 Christopher M. Bishop

Local Computation in VIBES • A key observation is that in the general solution the update for a particular node (or group of nodes) depends only on other nodes in the

Markov blanket

• Permits a local object-oriented implementation BCS Summer School, Exeter, 2003 Christopher M. Bishop

Shared Hyper-parameters BCS Summer School, Exeter, 2003 Christopher M. Bishop

Take-home Messages • Bayesian mixture of Gaussians – no singularities – determines optimal number of components • Variational inference – effective solution for Bayesian GMM – optimizes rigorous bound – little computational overhead compared to EM • VIBES – rapid prototyping of probabilistic models – graphical specification BCS Summer School, Exeter, 2003 Christopher M. Bishop

Viewgraphs, tutorials and publications available from: http://research.microsoft.com/~cmbishop