Principal Component Analysis Machine Learning Last Time • Expectation Maximization in Graphical Models – Baum Welch.

download report

Transcript Principal Component Analysis Machine Learning Last Time • Expectation Maximization in Graphical Models – Baum Welch.

Principal Component Analysis
Machine Learning
Last Time
• Expectation Maximization in Graphical Models
– Baum Welch
Now
• Unsupervised Dimensionality Reduction
Curse of Dimensionality
• In (nearly) all modeling approaches, more
features (dimensions) require (a lot) more
data
– Typically exponential in the number of features
• This is clearly seen from filling a probability
table.
• Topological arguments are also made.
– Compare the volume of an inscribed hypersphere
to a hypercube
Dimensionality Reduction
• We’ve already seen some of this.
• Regularization attempts to reduce the number
of effective features used in linear and logistic
regression classifiers
Linear Models
• When we regularize, we optimize a function that
ignores as many features as possible.
• The “effective” number of dimensions is much
smaller than D
Support Vector Machines
• In exemplar approaches (SVM, k-nn) each data
point can be considered to describe a dimension.
• By selecting only those instances that maximize
the margin (setting α to zero), SVMs use only a
subset of available dimensions in their decision
making.
Decision Trees
weight
<165
5M
height
<68
5F
1F / 1M
• Decision Trees explicitly select split points
based on features that improve
InformationGain or Accuracy
• Features that don’t contribute to the
classification sufficiently are never used.
Feature Spaces
• Even though a data point is described in terms
of N features, this may not be the most
compact representation of the feature space
• Even classifiers that try to use a smaller
effective feature space can suffer from the
curse-of-dimensionality
• If a feature has some discriminative power,
the dimension may remain in the effective set.
1-d data in a 2-d world
251.4
251.2
251
250.8
250.6
250.4
250.2
250
0
0.05
0.1
0.15
Dimensions of high variance
Identifying dimensions of variance
• Assumption: directions that show high
variance represent the appropriate/useful
dimension to represent the feature set.
Aside: Normalization
• Assume 2 features:
– Percentile GPA
– Height in cm.
• Which dimension
shows greater
variability?
285
280
275
270
265
260
255
250
0
0.2
0.4
0.6
0.8
1
Aside: Normalization
• Assume 2 features:
– Percentile GPA
– Height in cm.
• Which dimension
shows greater
variability?
285
280
275
270
265
260
255
250
0
5
10
15
20
25
30
Aside: Normalization
• Assume 2 features:
– Percentile GPA
– Height in m.
• Which dimension
shows greater
variability?
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Principal Component Analysis
• Principal Component Analysis (PCA) identifies
the dimensions of greatest variance of a set of
data.
Eigenvectors
• Eigenvectors are orthogonal vectors that
define a space, the eigenspace.
• Any data point can be described as a linear
combination of eigenvectors.
• Eigenvectors of a square matrix have the
following property.
• The associated lambda is the eigenvalue.
PCA
• Write each data point in this new space
• To do the dimensionality reduction,
keep C < D dimensions.
• Each data point is now represented as a vector
of c’s.
Identifying Eigenvectors
• PCA is easy once we have eigenvectors and
the mean.
• Identifying the mean is easy.
• Eigenvectors of the covariance matrix,
represent a set of direction of variance.
• Eigenvalues represent the degree of the
variance.
Eigenvectors of the Covariance Matrix
• Eigenvectors are orthonormal
• In the eigenspace, the Gaussian is diagonal – zero
covariance.
• All eigen values are non-negative.
• Eigenvalues are sorted.
• Larger eigenvalues, higher variance
Dimensionality reduction with PCA
• To convert from an original data point to PCA
• To reconstruct a point
TonyTony
Jebara,
Columbia
University
Jebara,
Columbia
University
Eigenfaces
Eigenfaces
x ,…,x }
{
Eigenfaces
Eigenfaces {x {,…
,x
} }
x ,…,x
1
! !
v1 v
1
N
1
N
N
!
v1
! !
vC v
1
C
ENCODE
ENCODE
ENCODE
! T !
!
!c ! !=T (!x ! µ) v
!
j
cij c= (=xi !x! µ!) µ
vj v i
( DECODE
) j DECODE
ij
i
!
C DECODE
!
! !!
ˆC1 =C µ+!! j =1c1j vj ,…, x
ˆCN =C µ+
x
xˆ1 x
=
, xˆ,N xˆ= µ+
cNj v
! !j =1cj =11j vcj1j v,…
!
j
=1
ˆ1 µ+
= µ+
,…
=
µ+
! j =1cNjj vj
j
N
T
ij
{( {(
{(
Encoded then Decoded.
Efficiency can be evaluated
with Absolute or Squared error
) )( ( ) (
!
c v
j =1 Nj j
C
)} )}
)}
Some other (unsupervised)
dimensionality reduction techniques
•
•
•
•
•
Kernel PCA
Distance Preserving Dimension Reduction
Maximum Variance Unfolding
Multi Dimensional Scaling (MDS)
Isomap
• Next Time
– Model Adaptation and Semi-supervised
Techniques
• Work on your projects.