Transcript Slide 1

Lars Kasper, December 15th 2010
PATTERN RECOGNITION
AND MACHINE LEARNING
CHAPTER 12: CONTINUOUS LATENT VARIABLES
Relation To Other Topics
• Last weeks: Approximate Inference
• Today: Back to
•
•
•
•
•
data-preprocessing
Data representation/Feature extraction
“Model-free” analysis
Dimensionality reduction
The Φ −matrix
• Link: We also have a (particular easy) model of
the underlying state of the world whose
parameters we want to infer from the data
Take-home TLAs (Three-letter acronyms)
Although termed “continuous latent variables”,
we mainly deal with
• PCA (Principal Component Analysis)
• ICA (Independent Component Analysis)
• Factor analysis
General motivation/theme: “What is interesting
about my data – but hidden (latent)? …
And what is just noise?”
Importance Sampling ;-)
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2
3
7
17
33
41
54
53
77
85
98
115
139
160
157
0.1918 %
0.2876 %
0.6711 %
1.6299 %
3.1640 %
3.9310 %
5.1774 %
5.0815 %
7.3826 %
8.1496 %
9.3960 %
11.0259 %
13.3269 %
15.3404 %
15.0527 %
Publications concerning
fMRI and (PCA or ICA or factor
Analysis)
Source: ISI Web of Knowledge,
Dec 13th, 2010
Importance Sampling: fMRI
• Used for fMRI analysis, e.g. software
package FSL: “MELODIC”
MELODIC Tutorial: 2nd principal component (eigenimage) and corresponding time
series of a visual block stimulation
Motivation: Low intrinsic dimensionality
• Generating hand-written digit samples by translating and
rotating one example 100 times
• High dimensional data (100 x 100 pixel)
• Low degrees of freedom (1 rotation angle, 2 translations)
Roadmap for today
Standard PCA
(heuristic)
•Dimensionality
Reduction
•Maximum Variance
•Minimum Error
Probabilistic PCA
(Maximum Likelihood)
•Generative
Probabilistic Model
•ML-equivalence to
Standard PCA
Bayesian PCA
•Automatic
determination of
latent space
dimension
Generalizations
•Relaxing equal data
noise amplitude:
Factor analysis
•Relaxing
Gaussianity: ICA
•Relaxing Linearity:
Kernel PCA
Heuristic PCA: Projection View
2D-data
Projected
on 1D-line
How do we simplify or compress our data (make it low-dimensional)
without losing actual information?
 Dimensionality reduction by projecting on a linear subspace
Heuristic PCA: Dimensionality Reduction
High dimensional
data
• Dimension 𝐷
• Data points 𝒙𝒏
Projection
Low-Dimensional
Subspace
• Dimension 𝑀
• Projected data
points 𝒙𝒏
Advantages:
• Reduced amount of data
• Might be easier to reveal structure withinin the data (pattern recognition, data
visualization)
Heuristic PCA: Maximum Variance View
•
•
•
We want to reduce the dimensionality of our data space via a linear
projection.
But we still want to keep the projected samples as different as
possible.
A good measure for this difference is the data covariance expressed by
the matrix
1
𝑆=
𝑁
•
•
𝑁
(𝒙𝒏 − 𝒙) ⋅ 𝒙𝒏 − 𝒙
𝑇
𝑛=1
Note: This expresses the covariance between different data dimensions,
not between data points.
We now aim to maximize the variance of the projected data in the
projection space spanned by the basis vectors {𝒖𝟏 , … , 𝒖𝑴 }.
𝒙 − mean of all data points,
𝑁 − number of data points
Maximum Variance View: The Maths
•
Maximum variance formulation of 1D-projection with
projection vector 𝒖𝟏 :
1
projected variance =
𝑁
•
•
•
𝑁
𝒖𝑇𝟏 𝒙𝒏 − 𝒖𝑇𝟏 𝒙 = 𝒖𝑇𝟏 𝑆𝒖𝟏
𝑛=1
Constraint optimization: 𝑢1 = 1
𝒖𝑇𝟏 𝑆 𝒖𝟏 + 𝜆1 1 − 𝒖𝑇𝟏 𝒖𝟏
Leads to best projector being an eigenvector of 𝑆, the data
covariance matrix:
𝑆𝒖𝟏 = 𝜆1 𝒖𝟏
with maximum projected variance equal to the maximum
eigenvalue:
𝒖𝑇𝟏 𝑆𝒖𝟏 = 𝝀𝟏 𝒖𝑇𝟏 𝒖𝟏 = 𝝀𝟏
Heuristic PCA: Conclusion
By induction we yield the general PCA result to
maximize the variance of the data in the projected
dimensions:
The projection vectors {𝒖𝟏 , … , 𝒖𝑴 } shall
be the eigenvectors corresponding to the
largest 𝑀 eigenvalues of the data
covariance matrix 𝑆. These vectors are
called the principal components.
Heuristic PCA: Minimum error formulation
•
•
By projecting, we want to lose as few information as possible,
i.e. keep the projected data points 𝒙𝒏 as similiar to the raw data
𝒙𝒏 as possible.
Therefore we minimize the mean quadratic error
1
𝐽=
𝑁
𝑁
𝒙𝒏 − 𝒙𝒏
𝑛=1
𝑀
𝒖𝑇𝒅 𝒙𝒏 ⋅ 𝒖𝒅
where 𝒙𝒏 =
•
•
2,
𝑑=1
With respect to the projection vectors {𝒖𝟏 , … , 𝒖𝑴 }.
This leads to the same result as in the maximum variance
formulation:
{𝒖𝟏 , … , 𝒖𝑴 } shall be the eigenvectors corresponding to the largest
𝑀 eigenvalues of the data covariance matrix 𝑆.
Example: Eigenimages
Eigenimages II
Christopher DeCoro http://www.cs.princeton.edu/cdecoro/eigenfaces/
Dimensionality Reduction
Roadmap for today
Standard PCA
(heuristic)
•Dimensionality
Reduction
•Maximum Variance
•Minimum Error
Probabilistic PCA
(Maximum Likelihood)
•Generative
Probabilistic Model
•ML-equivalence to
Standard PCA
Bayesian PCA
•Automatic
determination of
latent space
dimension
Generalizations
•Relaxing equal data
noise amplitude:
Factor analysis
•Relaxing
Gaussianity: ICA
•Relaxing Linearity:
Kernel PCA
Probabilistic PCA: A synthesizer’s view
𝒙 = 𝑊𝒛 + 𝝁 + 𝝐
• 𝒛~𝑁(0, 𝐼𝑀 ) – standardised normal distribution
• Independent latent variables with zero mean & unit variance
• 𝝐~𝑁(0, 𝜎 2 ⋅ 𝐼𝐷 ) – a spherical Gaussian
• i.e. identical independent noise in each of the 𝐷 data dimensions
• Prior predictive or marginal distribution of data points:
𝒙 ~𝑁 𝝁, 𝐶 with 𝐶 = 𝑊 𝑇 𝑊 + 𝜎 2 ⋅ 𝐼𝐷
Probabilistic PCA: ML-solution
𝑥 = 𝑊𝑧 + 𝜇 + 𝜖
𝑝 𝑥 = 𝑁 𝑥 𝜇, 𝐶
𝐶 = 𝑊 𝑇 𝑊 + 𝜎 2 𝐼𝐷
𝜇𝑀𝐿 = 𝑥
𝑊𝑀𝐿 = 𝑈𝑀 𝐿𝑀 −
2
𝜎𝑀𝐿
1
=
𝐷−𝑀
 Same as in heuristic PCA


1
2
𝜎 𝐼𝐷 2 𝑅
𝐷
𝜆𝑖
𝑖=𝑀+1
𝑈𝑀 − matrix of first 𝑀 eigenvectors, 𝐿𝑀 − diagonal matrix of
eigenvalues
Only specified up to a rotation in latent space 𝑅
Recap: The EM-algorithm
• The Expectation-Maximization algorithm
determines the Maximum Likelihood-solution
2
for our model parameters 𝑊𝑀𝐿 , 𝜎𝑀𝐿
, (𝜇𝑀𝐿 )
iteratively
• Advantageous compared to direct eigenvector
decomposition, if 𝑀 ≪ 𝐷, i.e. if we have
considerably fewer latent variables than data
dimensions
•
Projection on a very low dimensional space, e.g.
for data visualization to (𝑧1 , 𝑧2 )
EM-Algorithm: Expectation Step
• We consider the complete-data likelihood
𝑝 𝑋, 𝑍 = 𝑝 𝑋 𝑍 𝑝(𝑍)
•
Maximizing the marginal likelihood 𝑝 𝑋 𝜇, 𝜎 2 , 𝑊 =
∫ 𝑝 𝑋 𝑍 𝑝 𝑍 dZ instead would need an integration
over latent space
• E-Step: The posterior distribution of the latent
variables is updated and used to calculate the
expected value of the complete-data log likelihood
with respect to 𝑧
𝐸𝑧 [log 𝑝(𝑋, 𝑍)]
•
Keeping estimates of 𝑊, 𝜎 2 , 𝜇 fixed
𝑋 = 𝒙𝟏 , … , 𝒙𝑵 − matrix of all data points
𝑍 = 𝒛𝟏 , … , 𝒛𝑵 − matrix of all continous latent variable points
EM-Algorithm: Maximization Step
• M-Step: The calculated expectation
𝐸𝑧 [log 𝑝(𝑋, 𝑍)] is now maximized with
respect to 𝑊, 𝜎 2 , 𝜇:
2
𝑊𝑛𝑒𝑤 , 𝜎𝑛𝑒𝑤
= argmax2 𝐸𝑧 [log 𝑝(𝑋, 𝑍)]
𝑊,𝜎
• keeping the estimated posterior distribution of
𝑧 fixed from the E-Step
EM-algorithm for ML-PCA
𝑊
M
𝑍𝑊 𝑇
E
Green dots:
Expectation:
Maximization:
M
Data points, always fixed
Red rod is fixed, cyan connection of blue springs moves
obeying spring forces (𝐹𝑜𝑟𝑐𝑒 = −𝑐𝑜𝑛𝑠𝑡 ⋅ 𝑥, 𝐸𝑛𝑒𝑟𝑔𝑦 = 𝑐𝑜𝑛𝑠𝑡 ⋅ 𝑥 2 )
Cyan connections are fixed, red rod moves
(obey spring forces)
Roadmap for today
Standard PCA
(heuristic)
•Dimensionality
Reduction
•Maximum Variance
•Minimum Error
Probabilistic PCA
(Maximum Likelihood)
•Generative
Probabilistic Model
•ML-equivalence to
Standard PCA
Bayesian PCA
•Automatic
determination of
latent space
dimension
Generalizations
•Relaxing equal data
noise amplitude:
Factor analysis
•Relaxing
Gaussianity: ICA
•Relaxing Linearity:
Kernel PCA
Bayesian PCA – Finding the real dimension
𝑥 = 𝑊𝑧 + 𝜇 + ϵ
Maximum
Likelihood
Bayesian
PCA
Estimated projection
matrix 𝑊 for an 𝑀 =
10 dimensional
latent variable model
and synthetic data
generated from a
latent model with
𝑀=3
Estimating
𝜇, 𝜎, 𝑊
Introducing hyperparameters,
marginalizing 𝑊:
𝑊 = (𝒘𝟏 , … , 𝒘𝑴=𝑫 )
𝑤𝑖 ~𝑁(𝝁𝒊 , 𝛼𝑖−1 ⋅ 𝐼𝑁 )
Roadmap for today
Standard PCA
(heuristic)
•Dimensionality
Reduction
•Maximum Variance
•Minimum Error
Probabilistic PCA
(Maximum Likelihood)
•Generative
Probabilistic Model
•ML-equivalence to
Standard PCA
Bayesian PCA
•Automatic
determination of
latent space
dimension
Generalizations
•Relaxing equal data
noise amplitude:
Factor analysis
•Relaxing
Gaussianity: ICA
•Relaxing Linearity:
Kernel PCA
Factor Analysis: A non-spherical PCA
with
•
•
•
•
𝑥 = 𝑊𝑧 + 𝜇 + 𝜖
𝜖 ~ 𝑁(0, diag(Ψ1 , Ψ2 , … , Ψ𝐷 ))
𝑊 = 𝒘𝟏 , … , 𝒘𝐷
Noise is still independent and Gaussian
Ψ𝑑 − 𝑢𝑛𝑖𝑞𝑢𝑒𝑛𝑒𝑠𝑠𝑒𝑠
𝒘𝒅 − 𝑓𝑎𝑐𝑡𝑜𝑟 𝑙𝑜𝑎𝑑𝑖𝑛𝑔𝑠
Controversy: Do the factors (dimensions of 𝑧) have
an interpretable meaning?
•
Problem: posterior 𝑝 𝑥 invariant wrt rotations of 𝑊
𝑥~𝑁 𝜇, 𝐶 with 𝐶 = 𝑊 𝑇 𝑊 + Ψ
Independent Component Analysis (ICA)
𝑥 = 𝑊𝑧 with 𝑝 𝑧 = 𝑝 𝑧1 ⋅ … ⋅ 𝑝(𝑧𝐷=𝑀 )
•
•
Still a linear model of independent components
No data noise components, for dim(latent space) =
dim(data space)
• Explicitly Non-Gaussian
•
Otherwise, no separation of mixing coefficients in
𝑊 from latent variables 𝑧 would be possible
•
•
Rotational symmetry
Maximization of Non-Gaussianity/Independence
•
•
Different criteria, e.g. kurtosis, skewness
Minimization of mutual information
ICA vs PCA
Unsupervised method:
No class labels!
ICA 1st independent
component
PCA 1st principal component
•
•
ICA rewards bi-modality of projected distribution
PCA rewards maximum variance between elements
Summary
Parameter
estimation
Heuristic quadratic
cost function
(Minimum Error
Projection)
Probabilistic
(Maximum
Likelihood
projection matrix)
Bayesian
(Hyperparameters
of projection
vectors)
Generative
probabilistic
process in latent
space
Standardized
normal distribution
(PCA)
Standardized
normal distribution
(Factor Analysis)
Independent
probabilistic
process for each
dimension (ICA)
Spherical Gaussian
Gaussian
(PCA)
(Factor Analysis)
Linear: PCA, ICA,
Factor Analysis
Nonlinear: Kernel
PCA
Noise in data
space
Feature
Mapping (latent
to data space)
None (ICA)
Relation To Other Topics
• Today
•
data-preprocessing
•
•
•
Data representation/Feature extraction
“Model-free” analysis
•
•
Well: NO! We have seen the model assumptions in
probabilistic PCA
Dimensionality reduction
•
•
•
Whitening via covariance => Identity
Via projection on basis vectors carrying the most
variance/leaving the smallest error
At least for linear models, not for kernel PCA
The Φ −matrix
Kernel PCA
• Instead of the sample covariance matrix, we now consider a covariance matrix in a
feature space
1
𝐶=
𝑁
𝑁
𝑥𝑛 𝑥𝑛𝑇
𝑛=1
1
𝐶=
𝑁
𝑁
Φ(𝑥𝑛 ) ⋅ Φ 𝑥𝑛
𝑇
𝑛=1
• As always, the kernel trick of not computing in the high-dimensional feature space
Φ 𝑋 works, because the covariance matrix only needs scalar products of the
Φ 𝒙𝒏
Kernel PCA – Example: Gaussian kernel
𝑇
• Kernel PCA does not enable dimensionality reduction via 𝑥𝑛 = 𝐿<𝐷
𝑖=1 𝑥𝑛 𝑢𝑖 𝑢𝑖
• Φ(𝑋) is a manifold in feature space, not a linear subspace
• The PCA projects onto subspaces in feature space with elements 𝑥𝑛
• These elements typically not lie in Φ(𝑋), so their pre-images Φ−1 (𝑥𝑛 ) will
not be in data space