Durk KingmaMarch 2014 Stochastic Gradient VB and the Variational

Download Report

Transcript Durk KingmaMarch 2014 Stochastic Gradient VB and the Variational

Stochastic Gradient VB
and the Variational Auto-Encoder
Durk Kingma
Ph.D. Candidate (2nd year) advised by Max Welling
Kingma, Diederik P., and Max Welling.
"Stochastic Gradient VB and the variational auto-encoder." (arXiv)
Quite similar:
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
“Stochastic back propagation and variational inference in deep latent gaussian models.”
(arXiv)
D.P. Kingma
Contents
●
Stochastic Variational Inference and learning
–
●
Variational auto-encoder
–
●
SGVB algorithm
Experiments
Reparameterizations
–
Effect on posterior correlations
D.P. Kingma
2
Problems
D.P. Kingma
3
General setup
●
●
Setup:
–
x : observed variables
–
z : unobserved/latent variables
–
θ : model parameters
–
pθ(x,z): joint PDF
●
Factorized, differentiable
●
Factors can be anything, e.g. neural nets
Example:
We want:
–
Fast approximate posterior inference p(z|x)
–
Learn the parameters θ (e.g. MAP estimate)
D.P. Kingma
4
Example
D.P. Kingma
5
Learning
–
Regular EM requires tractable pθ(z|x)
–
Monte Carlo EM (MCEM) requires sampling from the
posterior (slow....)
–
Mean-field VB requires certain closed-form solutions to
certain expectations of the joint PDF
D.P. Kingma
6
Naive pure MAP optimization approach
Overfits with high dimensionality of latent space
D.P. Kingma
7
Novel approach:
Stochastic Gradient VB
●
Optimizes a lower bound of the marginal likelihood of the data
●
Scales to very large datasets
●
Scales to high-dimensional latent space
●
Simple
●
Fast!
●
Applies to almost any normalized model with continuous latent
variables
D.P. Kingma
8
The Variational Bound
●
We introduce the variational approximation:
–
Distribution can be almost anything (we use Gaussian)
–
Will approximate the true (but intractable) posterior
Marginal likelihood can be written as:
This bound is exactly we want to optimize!
(w.r.t. φ and θ)
D.P. Kingma
“Naive” Monte Carlo estimator of the bound
Problem: not appropriate for differentiation w.r.t. φ!
(Cannot differentiate through sampling process).
Recently proposed solutions (2013)
–
Michael Jordan / David Blei (very high variance)
–
Tim Salimans (2013): (only applies to Exponential Family q)
–
Rajesh Ranganath et al,“Black Box Variational Inference”, arXiv 2014
–
Andriy Mnih & Karol Gregor, “Neural Variational Inference and Learning”, arXiv 2014
D.P. Kingma
Key “reparameterization trick”
Alternative way of sampling from qφ(z):
1. Choose some ε ~ p(ε) (independent of φ!)
2. Choose some z = g(φ, ε)
Such that z ~ qφ(z) (the correct distribution)
Example
qφ(z)
p(ε)
g(φ, ε)
Also...
Normal dist.
z ~ N(μ,σ)
ε ~ N(0,1)
z=μ+σ*ε
Location-scale familie: Laplace,
Elliptical, Student’s t, Logistic, Uniform,
Triangular, ...
Exponential
z ~ exp(λ)
ε ~ U(0,1)
z = -log(1 – ε)/λ
Invertible CDF: Cauchy, Logistic,
Rayleigh, Pareto,Weibull, Reciprocal,
Gompertz, Gumbel and Erlan, ...
Other
z ~ logN(μ,σ)
ε ~ N(0,1)
z = exp(μ + σ * ε)
Gamma, Dirichlet, Beta, Chi-Squared,
and F distributions
D.P. Kingma
11
SGVB estimator
Really simple and appropriate for differentiation w.r.t. φ and θ!
D.P. Kingma
Basic SGVB Algorithm (L=1)
repeat
Torch7
Theano
until convergence
D.P. Kingma
13
Example: isotropic Gaussian q
D.P. Kingma
14
“Auto-Encoding” VB:
efficient on-line version of SGVB
●
Special case of SGVB:
Large i.i.d. dataset (large N)
=> Many variational parameters to learn
●
Solution:
–
Use conditional:
●
–
Avoid local parameters!
Neural network
Doubly stochastic optimization
D.P. Kingma
“Auto-encoding” Stochastic VB (L=1)
repeat
until convergence
Scales to very large datasets!
D.P. Kingma
16
Experiments with “variational auto-encoder”
x
p(x|z)
Generative model p(x|z) (neural
net)
h2
ε ~ p(ε)
ε
z
z = g(φ,ε,x)
h1
Posterior approximation q(z|x)
(neural net)
x
Objective:
(noisy) negative reconstruction error
D.P. Kingma
regularization term
17
Experiments
D.P. Kingma
18
Results:
Marginal likelihood lower bound
D.P. Kingma
19
Results: Marginal log-likelihood
MCEM does not scale
well to large datasets
D.P. Kingma
20
Robustness to high-dimensional latent
space
D.P. Kingma
21
D.P. Kingma
22
learned 2D manifolds
D.P. Kingma
23
learned 3D manifold
D.P. Kingma
24
Samples from MNIST
D.P. Kingma
25
Reparameterizations of latent variables
D.P. Kingma
26
Reparameterization of
continuous latent variables
●
Alternative parameterization of
latent variables.
Choose some:
–
ε ~ p(ε)
–
z = g(φ, ε) (invertible)
–
z | pa ~ p(z|pa)
(the correct distribution)
●
z's become determinstic given ε's
●
ε's are a priori independent
Large difference in posterior
dependencies, efficiency
D.P. Kingma
Centered form
Non-centered form
(Neural net with injected
noise)
27
Experiment: MCMC sampling in DBN
Centered form
Non-centered form
Samples
Samples
Autocorrelation
Terribly slow mixing
D.P. Kingma
Autocorrelation
Fast mixing
For more information and analysis see:
“Efficient Gradient-Based Inference through Transformations
between Bayes Nets and Neural Nets”
Diederik P Kingma, Max Welling
D.P. Kingma
Conclusion
●
SGVB: efficient stochastic variational algorithm for inference
and learning with continuous latent variables.
Theano and pure numpy implementations:
https://github.com/y0ast/Variational-Autoencoder.git
(includes scikit-learn wrappers)
Thanks!
D.P. Kingma
30
Appendix
The regular SVB gradient estimator
D.P. Kingma
31