Principal Component Analysis (PCA)

Download Report

Transcript Principal Component Analysis (PCA)

Principal Component Analysis
(PCA)
• PCA is an unsupervised signal processing method. There
are two views of what PCA does:
• it finds projections of the data with maximum
variance;
• it does compression of the data with minimum mean
squared error.
• PCA can be achieved by certain types of neural
networks. Could the brain do something similar?
• Whitening is a related technique that transforms the
random vector such that its components are
uncorrelated and have unit variance
Thanks to Tim Marks for some slides!
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
1
Maximum Variance Projection
• Consider a random vector x with arbitrary joint pdf
• in PCA we assume that x has zero mean
• we can always achieve this by constructing x’=x-μ
• Scatter plot and projection y onto a “weight” vector w
w
y
n
x
y  w T x   wi xi
i 1
 w  x  cos 
• Motivating question: y is a scalar random variable that
depends on w; for which w is the variance of y maximal?
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
2
Example:
• consider projecting x onto the red wr and blue wb who
happen to lie along the coordinate axes in this case:
n
yr  w r x   wri xi
T
i 1
n
yb  w b x   wbi xi
T
i 1
• clearly, the projection
onto wb has the higher
variance, but is there
a w for which it’s even
higher?
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
3
• First, what about the mean (expected value) of y?
n
y  w T x   wi xi
i 1
• so we have:
n
n
n
i 1
i 1
i 1
E ( y )  E ( wi xi )   E ( wi xi )   wi E ( xi )  0
since x has zero mean, meaning that E(xi) = 0 for all i.
• This implies that the variance of y is just:



var( y )  E  y  E  y   E ( y )  E (w x)
2
2
T
2

• Note that this gets arbitrarily big if w gets longer and
longer, so from now on we restrict w to have length 1,
i.e. (wTw)1/2=1
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
4
• continuing with some linear algebra tricks:

var( y )  E ( y 2 )  E (w T x) 2




 
 E w x x w  w E xx w
T
T
T
T
 w Cxw
T
where Cx is the correlation matrix (or covariance
matrix) of x.
• We are looking for the vector w that maximizes this
expression
• Linear algebra tells us that the desired vector w is the
eigenvector of Cx corresponding to the largest
eigenvalue
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
5
Reminder: Eigenvectors and Eigenvalues
• let A be an nxn (square) matrix. The n-component vector
v is an eigenvector of A with the eigenvalue λ if:
Av  v
i.e. multiplying v by A only changes the length by a
factor λ but not the orientation.
Example:
• for the scaled identity matrix is every vector is an
eigenvector with identical eigenvalue s
 s  0


     v  sv
0  s 


Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
6
Notes:
• if v is an eigenvector of a matrix A then any scaled
version of v is an Eigenvector of A, too
• if A is a correlation (or covariance) matrix, then all the
eigenvalues are real and nonnegative; the eigenvectors
can be chosen to be mutually orthonormal:
1 : i  k
v v k   ik  
0 : i  k
T
i
(δik is the so-called Kronecker delta)
• also, if A is a correlation (or covariance) matrix, then A
is positive semidefinite:
v T Av  0  v  0
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
7
The PCA Solution
• The desired direction of maximum variance is the
direction of the unit-length eigenvector v1 of Cx with the
largest eigenvalue
• Next, we want to find the direction whose projection
has the maximum variance among all directions
perpendicular to the first eigenvector v1. The solution to
this is the unit-length eigenvector v2 of Cx with the 2nd
highest eigenvalue.
• In general, the k-th direction whose projection has the
maximum variance among all directions perpendicular to
the first k-1 directions is given by the k-th unit-length
eigenvector of Cx. The eigenvectors are ordered such
that:
1  2  n1  n
• We say the k-th principal component of x is given by:
yk  vTk x
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
8
What are the variances of the PCA projetions?
• From above we have:


var( y )  E (w T x) 2  w T C x w
• Now utilize that w is the k-th eigenvector of Cx:
var( y )  v k C x v k  v k C x v k 
T
T
 v k k v k   k v k v k  k
T
T
where in the last step we have exploited that vk has unit
length. Thus, in words, the variance of the k-th principal
component is simply the eigenvalue of the k-th
eigenvector of the correlation/covariance matrix.
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
9
To summarize:
• The eigenvectors v1, v2, …, vn of the covariance matrix have
corresponding eigenvalues λ1 ≥ λ2 ≥ … ≥ λn. They can be found
with standard algorithms.
• λ1 is the variance of the projection in the v1 direction, λ2 is
the variance of the projection in the v2 direction, and so on.
• The largest eigenvalue λ1 corresponds to the principal
component with the greatest variance, the next largest
eigenvalue corresponds to the principal component with the
next greatest variance, etc.
• So, which eigenvector (green or red) corresponds to the
smaller eigenvalue in this example?
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
10
PCA for multinomial Gaussian
• We can think of the joint pdf of the multinomial
Gaussian in terms of the (hyper-) ellipsoidal (hyper-)
surfaces of constant values of the pdf.
• In this setting, principle component analysis (PCA) finds
the directions of the axes of the ellipsoid.
• There are two ways to think about what PCA does next:
• Projects every point perpendicularly onto the axes of
the ellipsoid.
• Rotates the ellipsoid so its axes are parallel to the
coordinate axes, and translates the ellipsoid so its
center is at the origin.
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
11
Two views of what PCA does
a) projects every point x onto the axes of
the ellipsoid to give projection c.
b) rotates space so that the ellipsoid lines up
with the coordinate axes, and translates
space so the ellipsoid’s center is at the
origin.
x
c
 c1 
 x
 y   c 
 
 2
y
x
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
12
Transformation matrix for PCA
• Let V be the matrix whose columns are the eigenvectors
of the covariance matrix, S.
• Since the eigenvectors vi are all normalized to have length
1 and are mutually orthogonal, the matrix V is orthonormal
(VV T = I), i.e. it is a rotation matrix
• the inverse rotation transformation is given by the
transpose V T of matrix V
VT
V


 v1



v2



 vn 
 
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
 v 1
 v
2




 v n





13
Transformation matrix for PCA
• PCA transforms the point x (original coordinates) into the
point c (new coordinates) by subtracting the mean from x
(z = x – m) and multiplying the result by V T
VT
 v 1
 v
2




 v n
z





c
 v1  z 

v  z
 
 2 
z

 
  



 
v n  z
• this is a rotation because V T is an orthonormal matrix
• this is a projection of z onto PC axes, because v1 • z is
projection of z onto PC1 axis, v2 • z is projection onto PC2
axis, etc.
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
14
• PCA expresses the mean-subtracted point, z = x – m, as a
weighted sum of the eigenvectors vi . To see this, start
with:
T
V
 v 1
 v
2




 v D
z
c

 c1 


c 
  
 2
z

  



 
 

c D 
• multiply both sides of the equation from the left by V:
VV Tz = Vc
| exploit that V is orthonormal
z = Vc
| i.e. we can easily reconstruct z from c
z 
  
  
 z    v1
  
  
V

v2

c
 c1 



  
  c2 
 
 
 
 vn 
 c1  v 1   c2  v 2     cn  v n 




   




 
cn 
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
15
PCA for data compression
What if you wanted to transmit someone’s height and weight,
but you could only give a single number?
• Could give only height, x
— = (uncertainty when height is known) weight
• Could give only weight, y
— = (uncertainty when weight is known) y
• Could give only c1,
the value of first PC
— = (uncertainty when first PC is known)
• Giving the first PC minimizes
the squared error of the result.
x
height
To compress n-dimensional data into k dimensions, order the
principal components in order of largest-to-smallest
eigenvalue, and only save the first k components.
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
16
• Equivalent view: To compress n-dimensional data into k<n
dimensions, use only the first k eigenvectors.
VT
 v 1
 v
2


 v
k


 v n
z





c
 c1 

c 
 
 2 
 z   c 

 k 
 
c n 
VT
 v 1

 v 2


 v k
z





c
 c1 

 
 
 z   c2 
 

 
ck 
• PCA approximates the mean-subtracted point, z = x – m, as
a weighted sum of only the first k eigenvectors:
z 
  
  
 z    v1
  
  
V

v2

c
 c1 



  
  c2 
 
 
 
 vk 
 c1  v 1   c2  v 2     ck  v k 





  




 
ck 
• transforms the point x into the point c by subtracting the
mean: z = x – m and multiplying the result by V T
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
17
PCA vs. Discrimination
• variance maximization or compression are different from
discrimination!
• consider this data coming stemming from two classes?
• which direction has max. variance, which discriminates
between the two clusters?
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
18
Application: PCA on face images
• 120  100 pixel grayscale 2D images of faces. Each
image is considered to be a single point in face space (a
single sample from a random vector):
 
 
 x
 
 
• How many dimensions is the vector x ?
n = 120 • 100 = 12000
• Each dimension (each pixel) can take values between 0
and 255 (8 bit grey scale images).
• You can easily visualize points in this 12000 dimensional
space!
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
19
The original images (12 out of 97)
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
20
The mean face
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
21
• Let x1, x2, …, xp be the p sample images (n-dimensional).
• Find the mean image, m, and subtract it from each
image: zi = xi – m
• Let A be the matrix whose columns are the
mean-subtracted sample images.


A  z 1





z2  z p 

 
n  p matrix
• Estimate the covariance matrix:
1
S  Cov( A) 
A AT
p 1
What are the dimensions of Σ ?
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
22
The transpose trick
The eigenvectors of Σ are the principal component directions.
Problem:
• Σ is a nxn (12000×12000) matrix. It’s too big for us to
find its eigenvectors numerically. In addition, only the
first p eigenvalues are useful; the rest will all be zero,
because we only have p samples.
Solution: Use the following trick.
• Eigenvectors of Σ are eigenvectors of AAT; but instead
of eigenvectors of AAT, the eigenvectors of ATA are easy
to find
What are the dimensions of ATA ? It’s a p×p matrix: it’s easy
to find eigenvectors since p is relatively small.
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
23
• We want the eigenvectors of AAT, but instead we calculated
the eigenvectors of ATA. Now what?
• Let vi be an eigenvector of ATA, whose eigenvalue is i
(ATA)vi = ?ivi
A (ATAvi) = A(ivi)
AAT (Avi) =
i (Avi)
• Thus if vi is an eigenvector of ATA ,
Avi is an eigenvector of AAT , with the same eigenvalue.
Av1 , Av2 , … , Avn are the eigenvectors of S.
u1 = Av1 , u2 = Av2 , …, un = Avn are called eigenfaces.
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
24
Eigenfaces
• Eigenfaces (the principal component directions of face
space) provide a low-dimensional representation of any
face, which can be used for:
• Face image compression and reconstruction
• Face recognition
• Facial expression recognition
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
25
The first 8 eigenfaces
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
26
PCA for low-dimensional face
representation
• To approximate a face using only k dimensions, order the
eigenfaces in order of largest-to-smallest eigenvalue,
and only use the first k eigenfaces.
UT
 u 1

 u 2


 u k
z





c
 c1 

 
 
 z   c2 
 

 
ck 
• PCA approximates a mean-subtracted face, z = x – m, as
a weighted sum of the first k eigenfaces:
z 
   
  
 z   u1 u 2
   
  
U
c
 c1 



  
  c2 
 
 
 
 uk 
 c1 u1   c2 u 2     ck u k 





  




 
c
k
 
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
27
• To approximate the actual face we have to add the mean
face m to the reconstruction of z:
x  z m
Uc  m
 c1 


   
 c
 
u 2  u k   2   m 



      
c k 


   
 
 
   
 c1 u 1   c 2 u 2     c k u k   m 


   
 
 
   


 u 1


• The reconstructed face is simply the mean face with
some fraction of each eigenface added to it
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
28
The mean face
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
29
Approximating a face using the
first 10 eigenfaces
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
30
Approximating a face using the
first 20 eigenfaces
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
31
Approximating a face using the
first 40 eigenfaces
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
32
Approximating a face using all
97 eigenfaces
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
33
Notes:
• not surprisingly, the approximation of the reconstructed
faces get better when more eigenfaces are used
• In the extreme case of using no eigenface at all, the
reconstruction of any face image is just the average
face
• In the other extreme of using all 97 eigenfaces, the
reconstruction is exact, since the image was in the
original training set of 97 images. For a new face image,
however, its reconstruction based on the 97 eigenfaces
will only approximately match the original.
• What regularities does each eigenface capture? See
movie.
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
34
Eigenfaces in 3D
• Blanz and Vetter (SIGGRAPH ’99). The video can be
found at:
http://graphics.informatik.uni-freiburg.de/Sigg99.html
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
35