#### Principal Component Analysis Machine Learning Last Time • Expectation Maximization in Graphical Models – Baum Welch.

download report#### Transcript Principal Component Analysis Machine Learning Last Time • Expectation Maximization in Graphical Models – Baum Welch.

Principal Component Analysis Machine Learning Last Time • Expectation Maximization in Graphical Models – Baum Welch Now • Unsupervised Dimensionality Reduction Curse of Dimensionality • In (nearly) all modeling approaches, more features (dimensions) require (a lot) more data – Typically exponential in the number of features • This is clearly seen from filling a probability table. • Topological arguments are also made. – Compare the volume of an inscribed hypersphere to a hypercube Dimensionality Reduction • We’ve already seen some of this. • Regularization attempts to reduce the number of effective features used in linear and logistic regression classifiers Linear Models • When we regularize, we optimize a function that ignores as many features as possible. • The “effective” number of dimensions is much smaller than D Support Vector Machines • In exemplar approaches (SVM, k-nn) each data point can be considered to describe a dimension. • By selecting only those instances that maximize the margin (setting α to zero), SVMs use only a subset of available dimensions in their decision making. Decision Trees weight <165 5M height <68 5F 1F / 1M • Decision Trees explicitly select split points based on features that improve InformationGain or Accuracy • Features that don’t contribute to the classification sufficiently are never used. Feature Spaces • Even though a data point is described in terms of N features, this may not be the most compact representation of the feature space • Even classifiers that try to use a smaller effective feature space can suffer from the curse-of-dimensionality • If a feature has some discriminative power, the dimension may remain in the effective set. 1-d data in a 2-d world 251.4 251.2 251 250.8 250.6 250.4 250.2 250 0 0.05 0.1 0.15 Dimensions of high variance Identifying dimensions of variance • Assumption: directions that show high variance represent the appropriate/useful dimension to represent the feature set. Aside: Normalization • Assume 2 features: – Percentile GPA – Height in cm. • Which dimension shows greater variability? 285 280 275 270 265 260 255 250 0 0.2 0.4 0.6 0.8 1 Aside: Normalization • Assume 2 features: – Percentile GPA – Height in cm. • Which dimension shows greater variability? 285 280 275 270 265 260 255 250 0 5 10 15 20 25 30 Aside: Normalization • Assume 2 features: – Percentile GPA – Height in m. • Which dimension shows greater variability? 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 Principal Component Analysis • Principal Component Analysis (PCA) identifies the dimensions of greatest variance of a set of data. Eigenvectors • Eigenvectors are orthogonal vectors that define a space, the eigenspace. • Any data point can be described as a linear combination of eigenvectors. • Eigenvectors of a square matrix have the following property. • The associated lambda is the eigenvalue. PCA • Write each data point in this new space • To do the dimensionality reduction, keep C < D dimensions. • Each data point is now represented as a vector of c’s. Identifying Eigenvectors • PCA is easy once we have eigenvectors and the mean. • Identifying the mean is easy. • Eigenvectors of the covariance matrix, represent a set of direction of variance. • Eigenvalues represent the degree of the variance. Eigenvectors of the Covariance Matrix • Eigenvectors are orthonormal • In the eigenspace, the Gaussian is diagonal – zero covariance. • All eigen values are non-negative. • Eigenvalues are sorted. • Larger eigenvalues, higher variance Dimensionality reduction with PCA • To convert from an original data point to PCA • To reconstruct a point TonyTony Jebara, Columbia University Jebara, Columbia University Eigenfaces Eigenfaces x ,…,x } { Eigenfaces Eigenfaces {x {,… ,x } } x ,…,x 1 ! ! v1 v 1 N 1 N N ! v1 ! ! vC v 1 C ENCODE ENCODE ENCODE ! T ! ! !c ! !=T (!x ! µ) v ! j cij c= (=xi !x! µ!) µ vj v i ( DECODE ) j DECODE ij i ! C DECODE ! ! !! ˆC1 =C µ+!! j =1c1j vj ,…, x ˆCN =C µ+ x xˆ1 x = , xˆ,N xˆ= µ+ cNj v ! !j =1cj =11j vcj1j v,… ! j =1 ˆ1 µ+ = µ+ ,… = µ+ ! j =1cNjj vj j N T ij {( {( {( Encoded then Decoded. Efficiency can be evaluated with Absolute or Squared error ) )( ( ) ( ! c v j =1 Nj j C )} )} )} Some other (unsupervised) dimensionality reduction techniques • • • • • Kernel PCA Distance Preserving Dimension Reduction Maximum Variance Unfolding Multi Dimensional Scaling (MDS) Isomap • Next Time – Model Adaptation and Semi-supervised Techniques • Work on your projects.