LING 696B: Mixture model and its applications in category

Download Report

Transcript LING 696B: Mixture model and its applications in category

Dimensional reduction, PCA
1
Curse of dimensionality



The higher the dimension, the more data is
needed to draw any conclusion
Probability density estimation:

Continuous: histograms

Discrete: k-factorial designs
Decision rules:

Nearest-neighbor and
K-nearest neighbor
2
How to reduce dimension?

Assume we know something about the
distribution


Parametric approach: assume data follow
distributions within a family H
Example: counting histograms for 10-D
data needs lots of bins, but knowing it’s
normal allows to summarize the data in
terms of sufficient statistics

(Number of bins)10 v.s. (10 + 10*11/2)
3
Linear dimension reduction


Normality assumption is crucial for
linear methods
Examples:



Principle Components Analysis (also Latent
Semantic Indexing)
Factor Analysis
Linear discriminant analysis
4
Covariance structure of
multivariate Gaussian

2-dimensional example
Variance in each dimension
Correlation between dimensions

No correlations --> diagonal covariance
matrix, e.g.


Special case:  = I
- log likelihood  Euclidean distance to the center
5
Covariance structure of
multivariate Gaussian

Non-zero correlations --> full covariance
matrix, COV(X1,X2)  0



E.g.  =
Nice property of Gaussians: closed
under linear transformation
This means we can remove correlation
by rotation
6
Covariance structure of
multivariate Gaussian

Rotation matrix: R = (w1, w2), where
w1, w2 are two unit vectors
perpendicular to each other

Rotation by 90 degree
w1
w1 w2 w2

Rotation by 45 degree
w2
w1
7
Covariance structure of
multivariate Gaussian

Matrix diagonalization: any 2X2 covariance
matrix A can be written as:
Rotation!


Interpretation: we can always find a rotation
to make the covariance look “nice” -- no
correlation between dimensions
This IS PCA when applied to N dimensions
8
Computation of PCA

The new coordinates uniquely identify the
rotation
w
3
w2


w1
3-D: 3 coordinates
In computation, it’s easier to identify one
coordinate at a time.
Step 1: centering the data


X <-- X - mean(X)
Want to rotate around the center
9
Computation of PCA


Step 2: finding a direction of projection
that has the maximal variance
Linear projection of X onto vector w:

Projw(X) = XNXd * wdX1 (X centered)
w

x
X
w
Now measure the stretch

This is sample variance = Var(X*w)
10
Computation of PCA

Step 3: formulate this as a constrained
optimization problem



Objective of optimization: Var(X*w)
Need constraint on w: (otherwise can
explode), only consider the direction, not
the scaling
So formally:
argmax||w||=1 Var(X*w)
11
Computation of PCA



Recall single variable case:
Var(a*X) = a2 Var(X)
Apply to multivariate case using matrix
notation:
Var(X*w) = wT XT X w
= wTCov(X) w
Cov(X) is a dXd matrixSymmetric (easy)

For any y, yTCov(X) y > 0
12
Computation of PCA


Going back to the optimization problem:
= max||w||=1 Var(X*w)
= max||w||=1 wTCOV(X) w
The answer is the largest eigenvalue for
COV(X) w1
The first
Principle Component!
(see demo)
13
More principle components


We keep looking among all the
projections perpendicular to w1
Formally:
max||w ||=1,w2w1 wTCov(X) w
This turns out to be another
eigenvector corresponding to the 2nd
largest eigenvalue
w2
(see demo)
2

New coordinates!
14
Rotation



Can keep going until we find all
projections/coordinates w1,w2,…,wd
Putting them together, we have a big
matrix W=(w1,w2,…,wd)
W is called an orthogonal matrix


This corresponds to a rotation (sometimes
plus reflection) of the pancake
This pancake has no correlation between
dimensions (see demo)
15
When does dimension
reduction occur?


Decomposition of covariance matrix
If only the first few ones are significant,
we can ignore the rest, e.g.
2-D coordinates of X
16
Measuring “degree” of
reduction
Pancake data in 3D
a1
a2
17
An application of PCA

#market
Latent Semantic Indexing in document
retrieval
#stock


#bonds 

Documents as vectors of word counts
Try to extract some “features” by linear
combination of word counts
The underlying geometry unclear (mean?
Distance?)
The meaning of principle components
unclear (rotation?)
18
Summary of PCA:

PCA looks for:


A sequence of linear, orthogonal
projections that reveal interesting structure
in data (rotation)
Defining “interesting”:


Maximal variance under each projection
Uncorrelated structure after projection
19
Departure from PCA

3 directions of divergence

Other definitions of “interesting”?



Other methods of projection?



Linear Discriminant Analysis
Independent Component Analysis
Linear but not orthogonal: sparse coding
Implicit, non-linear mapping
Turning PCA into a generative model

Factor Analysis
20
Re-thinking “interestingness”



It all depends on what you want
Linear Disciminant Analysis (LDA):
supervised learning
Example: separating 2 classes
Maximal separation
Maximal variance
21
Re-thinking “interestingness”


Most high-dimensional data look like Gaussian
under linear projections
Maybe non-Gaussian is more interesting



Independent Component Analysis
Projection pursuits
Example: ICA projection of 2-class data
Most unlike Gaussian (e.g. maximize kurtosis)
22
The “efficient coding”
perspective

Sparse coding:


Projections do not have to be orthogonal
There can be more basis vectors than the
dimension of the space

Representation using over-complete basis
w2
x
Basis expansion
w3
w1
p << d; compact coding (PCA)
w4 p > d; sparse coding
23
“Interesting” can be expensive

Often faces difficult optimization problems



Need many constraints
Lots of parameter sharing
Expensive to compute, no longer an
eigenvalue problem
24
PCA’s relatives:
Factor Analysis

PCA is not a generative model:
reconstruction error is not likelihood



Sensitive to outliers
Hard to build into bigger models
Factor Analysis: adding a measurement
noise to account for variability
observation
Loading matrix
(scaled PC’s)
Measurement noise
N(0,R), R diagonal
Factors:
spherical Gaussian N(0,I)
25
PCA’s relatives:
Factor Analysis


Generative view: sphere --> stretch and
rotate --> add noise
Learning: a version of EM algorithm
26