Transcript PowerPoint 演示文稿 - Sun Yat
Learning representations
By Yang Zechao The Machine Learning course www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml
Neural Nets for Face Recognition
• Learning Lower Dimensional Representations Supervised learning of lower dimension representation – Hidden layers in Neural Networks – Fisher linear discriminant • Unsupervised learning of lower dimension representation – Principle Components Analysis (PCA) – Independent components analysis (ICA) – Canonical correlation analysis (CCA) – Deep Belief Networks(DBN)
• Learning Lower Dimensional Representations Supervised learning of lower dimension representation – Hidden layers in Neural Networks – Fisher linear discriminant • Unsupervised learning of lower dimension representation – Principle Components Analysis (PCA) – Independent components analysis (ICA) – Canonical correlation analysis (CCA) – Deep Belief Networks(DBN)
Principle Components Analysis
• Idea: – Given data points in d-dimensional space, project into lower dimensional space while preserving as much information as possible • E.g., find best planar approximation to 3D data • E.g., find best planar approximation to 10^4 D data – In particular, choose projection that minimizes the squared error in reconstructing original data
Principle Components Analysis
• Like auto-encoding neural networks, learn re-representation of input data that can best reconstruct it.
Principle Components Analysis
• • • • learned encoding is linear function of inputs (not
logistic)
No local minimum problems when training!
Given d-dimensional data X, learns d-dimensional representation, where – the dimensions are orthogonal – top k dimensions are the k-dimensional linear re representation that minimizes reconstruction error (sum of squared errors)
Principle Components Analysis
Assume data is set of d-dimensional vectors, where nth vector is 𝑥 𝑛 =< 𝑥 1 𝑛 … 𝑥 𝑑 𝑛 > We can represent these in terms of any d orthogonal vectors 𝑢 1 … 𝑢 𝑑 , 𝑥 𝑛 = 𝑑 𝑖=1 𝑧 𝑖 𝑛 𝑢 𝑖 ; 𝑢 𝑖 𝑇 𝑢 𝑗 = 𝛿 𝑖𝑗 So, PCA: given M
Principle Components Analysis
• Note we get zero error if M=d, so all error is due to missing components.
• Therefore, 𝐸 𝑀 = 𝑑 𝑖=𝑀+1 𝑁 𝑛=1 𝑢 𝑖 𝑇 𝑥 = 𝑑 𝑖=𝑀+1 𝑢 𝑖 𝑇 Σ𝑢 𝑖 Covariance matrix: Σ = 𝑥 𝑛 𝑛 2 − 𝑛 𝑇
Principle Components Analysis
• • • Minimize E M = 𝑑 𝑖=𝑀+1 𝑢 𝑖 𝑇 Σ𝑢 𝑖 use Lagrange Multiplier, get 𝑢 𝑖 𝑇 Σ𝑢 𝑖 + 𝜆 𝑖 1 − 𝑢 𝑖 𝑇 𝑢 𝑖 derive → 𝑆𝑢 𝑖 = 𝜆 𝑖 u i • so E M = 𝑑 𝑖=𝑀+1 𝜆 𝑖
PCA algorithm 1
1. X ← Create N x d data matrix, with one row vector xn per data point 2. X ← subtract mean x from each row vector xn in X 3. Σ ← covariance matrix of X 4. Find eigenvectors and eigenvalues of S 5. PC’s ← the M eigenvectors with largest eigenvalues
Very Nice When Initial Dimension Not Too Big • • • What if very large dimensional data?
– e.g., Images (d ¸ 10^4) Problem: – Covariance matrix Σ is size (d x d) – d=10^4 | Σ | = 10^8 Singular Value Decomposition (SVD) to the rescue!
– pretty efficient algs available, including Matlab SVD – some implementations find just top N eigenvectors
SVD
Data X, one row per data point U gives coordinates of rows of X in the space of principle components S is diagonal, 𝑆 𝑘 > 𝑆 𝑘+1 , 𝑆 𝑘 2 is kth eigenvalue rows of 𝑉 𝑇 are unit length eigenvectors of 𝑋 𝑇 𝑋
Singular Value Decomposition
• • • • To generate principle components: Subtract mean 𝑥 = 1 𝑁 𝑑 𝑖=1 create zero-centered data 𝑥 𝑛 from each data point, to Create matrix X with one row vector per (zero centered) data point Solve SVD: X = US 𝑉 𝑇 Output Principle components: columns of V (= rows of 𝑉 𝑇 ) – – Eigenvectors in V are sorted from largest to smallest eigenvalues S is diagonal, with 𝑠 2 𝑘 giving eigenvalue for kth eigenvector
Singular Value Decomposition
• • To project a point (column vector x) into PC coordinates: 𝑉 𝑇
x
If xi is ith row of data matrix X, then • To project a column vector x to M dim Principle Components subspace, take just the first M coordinates of 𝑉 𝑇
x
Independent Components Analysis
• • PCA seeks orthogonal directions <Y1 … YM> in feature space X that minimize reconstruction error ICA seeks directions <Y1 … YM> that are most statistically independent. I.e., that minimize I(Y), the mutual information between the Yj :
• Learning Lower Dimensional Representations Supervised learning of lower dimension representation – Hidden layers in Neural Networks – Fisher linear discriminant • Unsupervised learning of lower dimension representation – Principle Components Analysis (PCA) – Independent components analysis (ICA) – Canonical correlation analysis (CCA) – Deep Belief Networks(DBN)
Dimensionality reduction across multiple datasets • Given data sets A and B, find linear projections of each into a common lower dimensional space!
– Generalized SVD: minimize sq reconstruction errors of both – Canonical correlation analysis: maximize correlation of A and B in the projected space
Canonical Correlation Analysis
• • • Measuring the linear relationship between two multi dimensional variables Finding two sets of basis vectors such that the correlation between the projections of the variables onto these basis vectors is maximized Determine Correlation Coefficients
• Learning Lower Dimensional Representations Supervised learning of lower dimension representation – Hidden layers in Neural Networks – Fisher linear discriminant • Unsupervised learning of lower dimension representation – Principle Components Analysis (PCA) – Independent components analysis (ICA) – Canonical correlation analysis (CCA) – Deep Belief Networks(DBN)
Deep Belief Networks
• • Problem: training networks with many hidden layers doesn’t work very well – local minima, very slow training if initialize with zero weights Deep belief networks – autoencoder networks to learn low dimensional encodings – but more layers, to learn better encodings
Deep Belief Networks
The second row is reconstructed from 2000-1000 500-30 DBN The third row is reconstructed from 2000-300, linear PCA
Encoding of digit images in two dimensions 784-2 linear encoding (PCA) 784-1000-500-250-2 DBNet
• Learning Lower Dimensional Representations Supervised learning of lower dimension representation – Hidden layers in Neural Networks – Fisher linear discriminant • Unsupervised learning of lower dimension representation – Principle Components Analysis (PCA) – Independent components analysis (ICA) – Canonical correlation analysis (CCA) – Deep Belief Networks(DBN)
Fisher linear discriminant
• Objective : LDA seeks to reduce dimensionality while preserving as much of the class discriminatory information as possible • We seek to obtain a scalar 𝑦 samples 𝑥 onto a line by projecting the 𝑦 = 𝑤 𝑇 𝑥
Fisher linear discriminant
• Of all the possible lines we would like to select the one that maximizes the separability of the scalars • In order to find a good projection vector, we need to define a measure of separation
Fisher linear discriminant
• Define class means: • Could choose w according to:
Fisher Linear Discriminant
• For each class we define the scatter, an equivalent of the variance, as • Fisher Linear Discriminant chooses:
Fisher Linear Discriminant
• • • Choose n-1 dimension projection for n-class classification problem Use within-class covariances to determine the projection Minimizes a different error function (the projected withinclass variances)