Gaussian Process Structural Equation Models with Latent
Download
Report
Transcript Gaussian Process Structural Equation Models with Latent
Gaussian Process
Structural Equation Models
with Latent Variables
RICARDO SILVA
[email protected]
DEPARTMENT OF STATISTICAL SCIENCE
UNIVERSITY COLLEGE LONDON
ROBERT B. GRAMACY
[email protected]
STATISTICAL LABORATORY
UNIVERSITY OF CAMBRIDGE
Summary
A Bayesian approach for graphical models with
measurement error
Model: nonparametric DAG + linear measurement
model
Related literature: structural equation models (SEM), error-invariables regression
Applications: dimensionality reduction, density
estimation, causal inference
Evaluation: social sciences/marketing data, biological domain
Approach: Gaussian process prior + MCMC
Bayesian pseudo-inputs model + space-filling priors
An Overview of
Measurement Error Problems
Measurement Error Problems
Calorie intake
Weight
Measurement Error Problems
Calorie intake
Reported
calorie intake
Weight
Notation corner:
Latent
Observed
Error-in-variables Regression
Calorie intake
Reported
calorie intake
Reported calorie intake = Calorie intake + error
Weight = f(Calorie intake) + error
Weight
Task: estimate error and f()
Error estimation can be treated separately
Caveat emptor: outrageously hard in theory
If errors are Gaussian, best (!) rate of convergence is
O((1/log N)2), N sample size
Don’t panic
(Fan and Truong, 1993)
Error in Response/Density Estimation
Calorie intake
Weight
Reported
calorie intake
Reported weight
Multiple Indicator Models
Calorie intake
Self-reported
calorie intake
Weight
Weight recorded in
the morning
Assisted report of
calorie intake
Weight recorded in
the evening
Chains of Measurement Error
Calorie
intake
Reported
calorie intake
Weight
Well-being
Reported
weight
Reported time to
fall asleep
Widely studied as Structural Equations Models
(SEMs) with latent variables
(Bollen, 1989)
Quick Sidenote: Visualization
Industrialization
Level 1960
Democratization
Level 1965
Democratization
Level 1960
GNP etc.
GNP etc.
Fairness of
elections
etc.
GNP etc.
Fairness of
elections
etc.
GNP etc.
Quick Sidenote: Visualization
(Palomo et al., 2007)
Non-parametric SEM:
Model and Inference
Traditional SEM
Some assumptions
assume DAG structure
assume (for simplicity only) no observed variable has children
in the
Linear functional relationships:
Xi = i0 + XTP(i)Bi + i
Yj = j0 + XTP(j)j + j
Parentless vertices ~ Gaussian
Notation corner:
X
Y
Our Nonparametric SEM: Likelihood
Functional relationships:
Xi = fi(XP(i)) + i
Yj = j0 + XTP(j)j + j
where each fi() belongs to some functional space.
Parentless latent variables follow a mixture of
Gaussians, error terms are Gaussian
i ~ N(0, vi)
j ~ N(0, vj)
Related Ideas
GP Networks (Friedman and Nachman, 2000):
Reduces to our likelihood for Yi = “Xi”
Gaussian process latent variable model (Lawrence, 2005):
Module networks (Segal et al., 2005):
Shared non-linearities
e.g., Y4 = 40+41f(IL) + error, Y5 = 50+51f(IL) + error
Dynamic models (e.g., Ko and Fox, 2009)
Functions between different data points, symmetry
Identifiability Conditions
Given observed marginal M(Y) and DAG, are M(X),
{}, {v} unique?
Relevance for causal inference and embedding
Embedding: problematic MCMC for latent variable
interpretation if unidentifiable
Causal effect estimation: not resolved from data
Note: barring possible MCMC problems, not essential for
prediction
Illustration:
Yj = X1 + error, for j = 1, 2, 3; Yj = 2X2 + error, j = 4, 5, 6
X2 = 4X12 + error
Identifiable Model: Walkthrough
Assumed
structure
(In this model, regression coefficients are fixed for Y1 and Y4.)
Non-Identifiable Model: Walkthrough
Assumed
structure
(Nothing fixed, and all Y freely depend on both X1 and X2.)
The Identifiability Zoo
Many roads to identifiability via different sets of
assumptions
We will ignore estimation issues in this discussion!
One generic approach boils down to a reduction to
multivariate deconvolution
Y = X + error
so that the density of X can be uniquely obtained
from the (observable) density of Y and (given)
density of error
But we have to nail the measurement error
identification problem first.
Hazelton and Turlach (2009)
Our Path in The Identifiability Zoo
The assumption of three or more “pure” indicators:
Xi
Y1i
Y2i
Y3i
Scale, location and sign of Xi is arbitrary, so fix
Y1i = Xi + i1
It follows that remaining linear coefficients in
Yji = 0ji + 1jiXi +ji are identifiable, and so is the variance of
each error term
(Bollen, 1989)
Our Path in The Identifiability Zoo
Select one pure indicator per latent variable to form
set Y1 (Y11, Y12, ..., Y1L) and E1 ( 11, 12, ..., 1L)
From
Y1 = X + E1
obtain the density of X, since Gaussian assumption
for error terms results in density of E1 being known
Notice: since density of X is identifiable,
identifiability of directionality
Xi Xj vs. Xj Xi is achievable in theory
(Hoyer et al., 2008)
Quick Sidenote: Other Paths
Three “pure indicators” per variable might not be
reasonable
Alternatives:
Two pure indicators, non-zero correlation between latent
variables
Repeated measurements (e.g., Schennach 2004)
X* = X + error
X** = X + error
Y = f(X) + error
Also related: results on detecting presence of measurement
error (Janzing et al., 2009)
For more: Econometrica, etc.
Priors: Parametric Components
Measurement model: standard linear regression
priors
e.g., Gaussian prior for coefficients, inverse gamma for
conditional variance
Could use the standard normal-gamma priors so that
measurement model parameters are marginalized
Samples using P(Y | X, f(X))p(X, f(X))
instead of P(Y | X, f(X), )p(X, f(X))p()
In the experiments, we won’t use such normal-gamma priors,
though, because we want to evaluate mixing in general
Priors: Nonparametric Components
Function f(XPa(i)): Gaussian process prior
f(XPa(i)(1)), f(XPa(i) (2)), ..., f(XPa(i) (N)) ~ jointly Gaussian with
particular kernel function
Computational issues:
Scales as O(N3), N being sample size
Standard MCMC might converge poorly due to high
conditional association between latent variables
The Pseudo-Inputs Model
Hierarchical approach
Recall: standard GP from {X(1), X(2), ..., X(N)},
obtain distribution over {f(X(1)), f(X(2)), ..., f(X(N))}
Predictions of “future” observations f(X*(1)), f(X*(2)),
..., etc. are jointly conditionally Gaussian too
Idea:
imagine you see a pseudo training set X
your “actual” training set {f(X(1)), f(X(2)), ..., f(X(N))} is
conditionally Gaussian given X
however, drop all off-diagonal elements of the conditional
covariance matrix
(Snelson and Ghahramani, 2006; Banerjee et al., 2008)
The Pseudo-Inputs Model: SEM Context
Standard model
Pseudo-inputs model
Bayesian Pseudo-Inputs Treatment
Snelson and Ghaharamani (2006): empirical Bayes
estimator for pseudo-inputs
Pseudo-inputs rapidly amounts to many more free parameters
sometimes prone to overfitting
Here: “space-filling” prior
Let pseudo-inputs X have bounded support
Set p(Xi) det(D), where D is some kernel matrix
A priori, “spreads” points in some hyper-cube
No fitting: pseudo-inputs are sampled too
Essentially no (asymptotic) extra cost since we have to sample
latent variables anyway
Possible mixing problems?
REFERENCES HERE
Demonstration
Squared exponential kernel, hyperparameter l
exp(–|xi – xj|2 / l)
1-dimensional pseudo-input space, 2 pseudo-data
points
X(1), X(2)
Fix X(1) to zero, sample X(2)
NOT independent. It should differ from the uniform
distribution at different degrees according to l
Demonstration
More on Priors and Pseudo-Points
Having a prior
treats overfitting
“blurs” pseudo-inputs, which theoretically leads to a bigger
coverage
if number of pseudo-inputs is “insufficient,” might provide
some edge over models with fixed pseudo-inputs, but care
should be exercised
Example
Synthetic data with quadratic relationship
Predictive Samples
Sampling 150 latent points from the predictive
distribution, 2 fixed pseudo-inputs
(Average predictive log-likelihood: -4.28)
Predictive Samples
Sampling 150 latent points from the predictive
distribution, 2 fixed pseudo-inputs
(Average predictive log-likelihood: -4.47)
Predictive Samples
Sampling 150 latent points from the predictive
distribution, 2 free pseudo-inputs with priors
(Average predictive log-likelihood: -3.89)
Predictive Samples
With 3 free pseudo-inputs
(Average predictive log-likelihood: -3.61)
MCMC Updates
Metropolis-Hastings, low parent dimensionality ( 3
parents in our examples)
Mostly standard. Main points:
It is possible to integrate away pseudo-functions.
Sampling function values {f(Xj(1)), ... f(Xj(N))} is done in two-stages:
Sample pseudo-functions for Xj conditioned on all but function
values
Conditional covariance of pseudo-functions (“true” functions
marginalized)
Then sample {f(Xj(1)), ... f(Xj(N))} (all conditionally independent)
(N = number of training points, M = number of pseudo-points)
MCMC Updates
When sampling pseudo-input variable XPa(i)(d)
Factors: pseudo-functions and “regression weights”
Metropolis-Hastings step:
Warning: for large number of pseudo-points,
p(fi(d) | f\i(d), X) can be highly peaked
Alternative: propose and sample fi(d)() jointly
MCMC Updates
In order to calculate the ratio iteratively
fast submatrix updates are necessary for
to obtain O(NM) cost per pseudo-point, i.e., total of
O(NM2)
Experiments
Setup
Evaluation of Markov chain behaviour
“Objective” model evaluation via predictive log-
likelihood
Quick details
Squared exponential kernel
Prior for a (and b):
mixture of Gamma (1, 20) + Gamma(20, 20)
M = 50
Synthetic Example
Our old friend
Yj = X1 + error, for j = 1, 2, 3; Yj = 2X2 + error, j = 4, 5, 6
X2 = 4X12 + error
Synthetic Example
Visualization: comparison against GPLVM
Nonparametric factor-analysis, independent Gaussian
marginals for latent variables
GPLVM: (Lawrence, 2005)
MCMC Behaviour
Example: consumer data
Identify the factors that affect willingness to pay more to consume
environmentally friendly products
16 indicators of environmental beliefs and attitudes, measuring 4
hidden variables
X1: Pollution beliefs
X2: Buying habits
X3: Consumption habits
X4: Willingness to spend more
333 datapoints.
Latent structure
X1 X2, X1 X3, X2 X3, X3 X4
(Bartholomew et al., 2008)
MCMC Behaviour
MCMC Behaviour
Unidentifiable
model
SparseModel 1.1
Predictive Log-likelihood Experiment
Goal: compare predictive loglikelihood of
Pseudo-input GPSEM, linear and quadratic polynomial models,
GPLVM and subsampled full GPSEM
Dataset 1: Consumer data
Dataset 2: Abalone (also found in UCI)
Postulate two latent variables, “Size” and “Weight.” Size has as
indicators the length, diameter and height of each abalone specimen,
while Weight has as indicators the four weight variables. 3000+
points.
Dataset 3: Housing (also found in UCI)
Includes indicators about features of suburbs in Boston that are
relevant for the housing market. 3 latent variables, ~400 points
Abalone: Example
0.4
0.0
1.5
0.0
0.3
0
10 20 30
0.6
0.1
0.4
0.2
V1
0.3
0.1
V2
1.5
0.0
V3
1.0
0.0
V4
0.3
0.0
V5
0.6
0.0
V6
10 20 30
0.0
V7
0
V8
0.2
0.6
0.0
0.3
0.0
1.0
0.0
0.6
Housing: Example
20
40
60
80 100
0.0 0.2 0.4 0.6 0.8 1.0
3
4
0
60
80 100
0
1
2
V1
2
3
0
20
40
V2
0.0 0.2 0.4 0.6 0.8 1.0
0
1
V3
V4
0
1
2
3
4
0
1
2
3
Results
Pseudo-input GPSEM at least an order
of magnitude faster than “full” GPSEM
model (undoable in Housing). Even
when subsampled to 300 points, full
GPSEM still slower.
Predictive Samples
Conclusion and Future Work
Even Metropolis-Hastings does a somewhat decent
job (for sparse models)
Potential problems with ordinal/discrete data.
Evaluation of high-dimensional models
Structure learning
Hierarchical models
Comparisons against
random projection approximations
mixture of Gaussian processes with limited mixture size
Full MATLAB code available
Acknowledgements
Thanks to Patrik Hoyer, Ed Snelson and Irini
Moustaki.
Extra References (not in the paper)
S. Banerjee, A. Gelfand, A. Finley and H. Sang
(2008). “Gaussian predictive process models for
large spatial data sets”. JRSS B.
D. Janzing, J. Peters, J. M. Mooij and B. Schölkopf.
(2009). Identifying confounders using additive noise
models. UAI.
M. Hazelton and B. Turlach (2009). “Nonparametric
density deconvolution by weighted kernel
estimators”. Statistics and Computing.
S. Schennack (2004). “Estimation of nonlinear
models with measurement error”. Econometric 72.