Lecture 12 Dimension reduction * PCA and SIR

Download Report

Transcript Lecture 12 Dimension reduction * PCA and SIR

Dimension reduction (1)
Overview
PCA
Factor Analysis
EDR space
SIR
References:
Applied Multivariate Analysis.
http://www.stat.ucla.edu/~kcli/sir-PHD.pdf
Overview
The purpose of dimension reduction:
Data simplification
Data visualization
Reduce noise (if we can assume only the dominating
dimensions are signals)
Variable selection for prediction
Overview
An analogy:
Data separation
Outcome variable Classification,
y exists
regression
(learning the
association rule)
No outcome
Clustering
variable
(learning intrinsic
structure)
Dimension
reduction
SIR, Classpreserving
projection, Partial
least squares
PCA, MDS,
Factor Analysis,
ICA, NCA…
PCA
Explain the variance-covariance structure
among a set of random variables by a few linear
combinations of the variables;
Does not require normality!
PCA
First PC :
a' X that maximizes Var(a' X),
subject to a'a = 1
ith PC :
a 'i X that maximizes Var(a 'i X),
subject to a 'ia i = 1 and Cov(a 'i X, a 'k X) = 0, " k < i
PCA
Reminder of some results for random vectors
7
Reminder of some results for random vectors
Proof of the first (and second) point of the previous slide.
A 1/2  P1/2 P ,
y  P x
x Ax x A 1/2 A 1/2 x x P1/2 P P1/2 P x


x x
x PP x
y y
p
y y


y y
2

y
 ii
i 1
p
2
y
 i
i 1
x  e1,y  (Pe1) [10
p
 1
2
y
 i
i 1
p
2
y
 i
 1
i 1
y y
e1 Ae1
0],
 1 
y y
e e
1
1
8
PCA
The eigen values are the variance components:
Proportion of total variance explained by the kth PC:
PCA
PCA
The geometrical interpretation of PCA:
PCA
PCA using the correlation matrix, instead of the
covariance matrix?
This is equivalent to first standardizing all X vectors.
PCA
Using the correlation matrix avoids the domination
from one X variable due to scaling (unit changes), for
example using inch instead of foot. Example:
é1 4 ù
é 1 0.4ù
S=ê
ú, r = ê
ú
4
100
0.4
1
ë
û
ë
û
PCA from S :
l1 = 100.16, e1' = [0.040 0.999]
l2 = 0.84, e'2 = [0.999 -0.040]
PCA from r :
l1 = 1+ r = 1.4, e1' = [0.707 0.707]
l2 = 1- r = 0.6, e'2 = [0.707 -0.707]
PCA
Selecting the number of components?
Based on eigen values (% variation
explained). Assumption: the small
amount of variation explained by lowrank PCs is noise.
Factor Analysis
If we take the first several PCs that explain most of the
variation in the data, we have one form of factor model.
L: loading matrix
F: unobserved random vector (latent variables).
ε: unobserved random vector (noise)
Factor Analysis
Orthogonal factor model assumes no correlation
between the factor RVs.
is a diagonal matrix
( X - m)( X - m)'= (LF + e)(LF + e)'
= LFF'L'+eF'L'+LFe'+ee'
S = Cov(X) = E ( X - m)( X - m)'
= LE(FF')L'+ E(eF')L'+LE(Fe') + E(ee')
= LL'+Y
Factor Analysis
Factor Analysis
Rotations in the m-dimensional subspace defined by the
factors make the solution non-unique:
PCA is one unique solution, as the vectors are sequentially
selected. Maximum likelihood estimator is another
solution:
Factor Analysis
As we said, rotations within the m-dimensional subspace
doesn’t change the overall amount of variation explained.
Do rotation to make the results more interpretable:
Factor Analysis
Varimax criterion:
Find T such that
is maximized.
V is proportional to the summation of the variance of the
squared loadings. Maximizing V makes the squared
loadings as spread out as possible --- some are real small,
and some are real big.
Factor Analysis
Orthogonal simple factor
rotation:
Rotate the orthogonal factors
around the origin until the
system is maximally aligned
with the separate clusters of
variables.
Oblique Simple Structure
Rotation:
Allow the factors to become
correlated. Each factor is
rotated individually to fit a
cluster.
21
MDS
Multidimensional scaling is a dimension reduction
procedure that maps the distances between observations
to a lower dimensional space.
Minimize this objective function:
D: distance in the original space
d: distance in the reduced dimension space.
Numerical method is used for the minimization.
EDR space
Now we start talking about regression. The data is {xi, yi}
Is dimension reduction on X matrix alone helpful here?
Possibly, if the dimension reduction preserves the
essential structure about Y|X. This is suspicious.
Effective Dimension Reduction --- reduce the dimension
of X without losing information which is essential to
predict Y.
EDR space
The model: Y is predicted by a set of linear combinations
of X.
If g() is known, this is not very
different from a generalized
linear model.
For dimension reduction
purpose, is there a scheme
which can work on almost any
g(), without knowledge of its
actual form?
EDR space
The general model encompasses many models as special
cases:
EDR space
Under this general model,
The space B generated by β1, β2, ……, βK is called the
e.d.r. space.
Reducing to this sub-space causes no loss of
information regarding predicting Y.
Similar to factor analysis, the subspace B is
identifiable, but the vectors aren’t.
Any non-zero vector in the e.d.r. space is called an
e.d.r. direction.
EDR space
This equation assumes almost the weakest form, to
reflect the hope that a low-dimensional projection of a
high-dimensional regresser variable contains most of the
information that can be gathered from a sample of
modest size.
It doesn’t impose any structure on how the projected
regresser variables effect the output variable.
Most regression models assume K=1, plus additional
structures on g().
EDR space
The philosophical point of Sliced Inverse Regression:
the estimation of the projection directions can be a more
important statistical issue than the estimation of the
structure of g() itself.
After finding a good e.d.r. space, we can project data to
this smaller space. Then we are in a better position to
identify what should be pursued further : model building,
response surface estimation, cluster analysis,
heteroscedasticity analysis, variable selection, ……
SIR
Sliced Inverse Regression.
In regular regression, our interest is the conditional
density h(Y|X). Most important is E(Y|x) and var(Y|x).
SIR treats Y as independent variable and X as the
dependent variable.
Given Y=y, what values will X take?
This takes us from a p-dimensional problem (subject to
curse of dimensionality) back to a 1-dimensional curvefitting problem:
E(xi|y), i=1,…, p
SIR
SIR
SIR
covariance matrix for the slice means of
x, weighted by the slice sizes
Find the SIR directions by conducting the
eigenvalue decomposition of
with
respect to
:
sample covariance for xi ’s
SIR
An example response
surface found by SIR.
SIR and LDA
Reminder: Fisher’s linear discriminant analysis seeks a
projection direction that maximized class separation.
When the underlying distributions are Gaussian, it agrees
with the Bayes decision rule. It seeks to maximize:
Between-group variance:
Within-group variance:
SIR and LDA
The solution is the first eigen vector in this eigen value
decomposition:
If we let
a scaling.
, the LDA agrees with SIR up to
Multi-class LDA
Structure-preserving dimension reduction in
classification.
Within-class scatter:
Between-class scatter:
Mixture scatter:
a: observations, c: class centers
Kim et al. Pattern Recognition 2007, 40:2939
Multi-class LDA
Maximize:
The solution come from the eigen value/vectors of
When we have N<<p, Sw is singular. Let
Kim et al. Pattern Recognition 2007, 40:2939
Multi-class LDA
Kim et al. Pattern Recognition 2007, 40:2939