دانشگاه صنعتي اميرکبير ) (پلي تکنيک تهران Nonlinear Dimensionality Reduction •Nonlinear Dimensionality Reduction , John A.

Download Report

Transcript دانشگاه صنعتي اميرکبير ) (پلي تکنيک تهران Nonlinear Dimensionality Reduction •Nonlinear Dimensionality Reduction , John A.

‫دانشگاه صنعتي اميرکبير‬
)‫(پلي تکنيک تهران‬
Nonlinear Dimensionality
Reduction
•Nonlinear Dimensionality Reduction , John A. Lee, Michel Verleysen
1
By: sadatnejad
The goals to be reached
 Discover and extract information that lies hidden in the huge
quantity of data.
 Understand and classify the existing data
 Infer and generalize to new data
2
Dim. Reduction- Practical Motivations
 By essence, the world is multidimensional.
 Redundancy means that parameters or features that could characterize the set of various units
are not independent from each other.
 The large set of parameters or features must be summarized into a smaller set, with no or
less redundancy. This is the goal of dimensionality reduction (DR), which is one of the key tools for
analyzing high-dimensional data.
 Fields of application
 Image processing
 Processing of sensor arrays
 Multivariate data analysis
3
Theoretical Motivations
 Curse of dimensionality
1- How can we visualize high-dimensional spaces?
2- Curse of dimensionality and empty space phenomenon
4
Theoretical Motivations
1- How can we visualize high-dimensional spaces?
 Spatial data
 Temporal data
Two-dimensional representation of a
four-dimensional cube. In addition to
perspective, the color indicates the
depth in the fourth dimension.
5
•Two plots of the same temporal data. In the first
representation, data are displayed in a single coordinate system
(spatial representation).
• In the second representation, each variable is plotted in its
own coordinate system, with time as the abscissa (time
representation).
6
Theoretical Motivations
2- Curse of dimensionality and empty space
phenomenon
 The curse of dimensionality also refers to the fact that in the absence of
simplifying assumptions, the number of data samples required to estimate a
function of several variables to a given accuracy (i.e., to get a reasonably
low-variance estimate) on a given domain grows exponentially with the
number of dimensions.
 Empty space phenomenon: Because the amount of available data is
generally restricted to a few observations, high-dimensional spaces are
inherently sparse.
7
Hyper volume of cubes and spheres
r is the radius of the sphere.
 Surprisingly, the ratioVsphere/Vcube tends to zero when D
increases
 As dimensionality increases, a cube becomes more and
more spiky:
 The spherical body gets smaller and smaller while the number of spikes
increases
 Now, assigning the value 1/2 to r,Vcube equals 1, leading to
 The volume of a sphere vanishes when dimensionality increases
8
Hyper volume of a thin spherical shell
ε<<1 is the thickness of the shell
 When D increases, the ratio tends to 1, meaning that the
shell contains almost all the volume
9
Tail probability of isotropic Gaussian
distribution
 Where y is a D-dimensional vector
 μy its D-dimensional mean
 σ2 the isotropic (scalar) variance.
 Assuming the random vector y has zero mean and unit variance
• Where
 Because the distribution is isotropic, the equiprobable contours are spherical
10
Tail probability of isotropic Gaussian
distribution (CONT.)
 By computing r0.95 defined as the radius of a hypersphere that contains 95% of the
distribution, the value of r0.95 is such that
 Where Ssphere(r) is the surface of a D-dimensional hypersphere of radius r
 The radius r0.95 grows as the dimensionality D increases
‫ می بایست‬overfitting ‫ ناگزیر به منظور اجتناب از دام‬،‫ می گردد‬heavy tail ‫ به نظر می رسد توزیع‬
‫مجموعه داده های آموزشی در فضای گسترده تر نمونه برداری گردیده والجرم اندازه مجموعه‬
...‫افزایش یابد‬
11
Some directions to be explored
 In the presence of high-dimensional data, two possibilities exist to avoid or at least
attenuate the effects of the above-mentioned phenomena.
 Relevance of the variables
 For example by computing the correlations between known pairs of input/output
 Dependencies between the variables
 Among relevant variables, one of relevant variables brings information
about the other.

12
The new set should obviously contain a smaller number of variables
but should also preserve the interesting characteristics of the initial
set.
Goals of projection
 The determination of a projection may also follow two different goals.
 The first and simplest one aims to just detect and eliminate the
dependencies. PCA
 The second goal of a projection is not only to reduce the dimensionality,
but also to retrieve the so-called latent variables, i.e., those that are at the
origin of the observed ones but cannot be measured directly.

13
Blind source separation (BSS), in signal processing, or Independent component
analysis (ICA), in multivariate data analysis, are particular cases of latent variable
separation
About topology, spaces, and manifolds
 From a geometrical point of view, when two or more variables depend on each other,
their joint distribution does not span the whole space.
 Actually, the dependence induces some structure in the distribution, in the form of a
geometrical locus that can be seen as a kind of object in the space.
 Dimensionality reduction aims at giving a new representation of these objects while
preserving their structure.
14
Topology
 In mathematics, topology studies the properties of objects that are
preserved through deformations, twisting, and stretching.
 For example, a circle is topologically equivalent to an ellipse,
and a sphere is equivalent to an ellipsoid.
15
Topology
 The knowledge of objects does not depend on how they are represented, or embedded, in
space.
 For example, the statement, “If you remove a point from a circle, you get a (curved) line
segment” holds just as well for a circle as for an ellipse.
 In other words, topology is used to abstract the intrinsic connectivity of objects while
ignoring their detailed form. If two objects have the same topological properties, they
are said to be homeomorphic.
16
Topological space
 A topological space is a set for which a topology is specified
 For a set Y, a topology T is defined as a collection of subsets ofY that
obey the following properties:
 Trivially, ∅ ∈ T andY ∈ T .
 Whenever two sets are in T , then so is their intersection.
 Whenever two or more sets are in T, then so is their union.
17
Topological space
Geometrical view
 From a more geometrical point of view, a topological space can also be defined
using neighborhoods and Haussdorf’s axioms.
 The neighborhood of a point y
∈ RD also called a ε-neighborhood or
infinitesimal open set, is often defined as the open ε-ball,
 i.e. the set of points inside a D-dimensional hollow sphere of radius ε> 0 and centered
on y.
 A set containing an open neighborhood is also called a neighborhood.
18
Topological space
Geometrical view (Cont.)
 Then, a topological space is such that
 To each point y there corresponds at least one neighborhood U(y),
and U(y) contains y.
 If U(y) andV(y) are neighborhoods of the same point y, then a
neighborhood W(y) exists such that W(y) ⊂ U(y) ∪V(y).
 If z ∈ U(y), then a neighborhoodV(z) of z exists such thatV(z) ⊂ U(y).
 For two distinct points, two disjoint neighborhoods of these points exist.
19
Manifold
 Within this framework, a (topological) manifold M is a topological space that is locally
Euclidean, meaning that around every point of M is a neighborhood that is topologically
the same as the open unit ball in RD .

20
In general, any object that is nearly “flat” on small scales is a manifold.
 For example, the Earth is spherical but looks flat on the human scale.
Embedding
 An embedding is a representation of a topological object (a manifold, a
graph, etc.) in a certain space, usually RD for some D, in such a way
that its topological properties are preserved.
 For example, the embedding of a manifold preserves open sets.
21
Differentiable manifold
22
Swiss roll
 The challenge of the Swiss roll consists of finding a two-
dimensional embedding that “unrolls” it, in order to avoid
superpositions of the successive turns of the spiral and to
obtain a bijective mapping between the initial and final
embeddings of the manifold.
 The Swiss roll is a smooth, and connected manifold.
23
Open box
 For Open box manifold the goal is to reduce the embedding
dimensionality from three to two.
 As can be seen, the open box is connected but neither
compact (in contrast with a cube or closed box) nor smooth
(there are sharp edges and corners).

24
Intuitively, it is not so obvious to guess what an embedding
of the open box should look like. Would the lateral faces be
stretched? Or torn? Or would the bottom face be shrunk?
Actually, the open box helps to show the way each
particular method behaves.
In practice, all DR methods work with a discrete representation of the
manifold to be embedded.
25
‫دانشگاه صنعتي اميرکبير‬
)‫(پلي تکنيک تهران‬
Characteristics of an analysis method
•Nonlinear Dimensionality Reduction , John A. Lee, Michel Verleysen, Chapter2
26
Expected Functionalities
 The analysis of high-dimensional data amounts to identifying and eliminating the
redundancies among the observed variables. This requires three main functionalities: an
ideal method should indeed be able to

Estimate the number of latent variables.
 Embed data in order to reduce their dimensionality.
 Embed data in order to recover the latent variables.
27
Estimation of the number of latent
variables
 Sometimes latent variables are also called degrees of freedom.
 The number of latent variables is often computed from a topological point of view, by
estimating the intrinsic dimension(ality) of data.
 In contrast with a number of variables, which is necessarily an integer value, the intrinsic
dimension hides a more generic concept and may take on real values.
 P=D
Intrinsic Dim.
 There is no structure.
 P<D
 A low intrinsic dimension indicates that a topological object or structure underlies the data set.
28
A two-dimensional manifold embedded in a three-dimensional space. The
data set contains only a finite number of points (or observations).
Without a good estimate of the intrinsic dimension, dimensionality reduction is no more
than a risky bet since one does not known to what extent the dimensionality can be
reduced.
29
Embedding for dimensionality
reduction
 The knowledge of the intrinsic dimension P indicates that data have some topological
structure and do not completely fill the embedding space.
 DR consist of re-embedding the data in a lower dimensional space that would be better
filled.
 The aims are both to get the most compact representation and to make any subsequent
processing of data more easy.
 The main problem is, how to measure or characterize the structure of a manifold in
order to preserve it
30
Possible two-dimensional embedding for the object in Fig. 2.1. The dimensionality
of the data set has been reduced from three to two.
31
Internal characteristics
 Behind the expected functionalities of an analysis method, less visible characteristics
are hidden, though they play a key role.These characteristics are:
 The model that data are assumed to follow.
 The type of algorithm that identifies the model parameters
 The criterion to be optimized, which guides the algorithm.
32
Underlying model
 All methods of analysis rely on the assumption that the data sets they are fed with, have
been generated according to a well-defined model.
 For example, principal component analysis assumes that the dependencies between the
variables are linear
 The type of model determines the power and/or limitations of the method.

33
For this model choice (Linear), PCA often delivers poor results when trying to project
data lying on a nonlinear subspace.
Fig. 2.4. Dimensionality reduction by PCA from 3 to 2 for the data set of Fig. 2.1. Obviously, data
do not fit the model of PCA, and the initial rectangular distribution cannot be retrieved.
34
Algorithm
 For the same model, several algorithms can implement the desired method of
analysis.
 For example, in the case of PCA, the model parameters are computed in
closed form by using general-purpose algebraic procedures.
 Most often, these procedures work quickly, without any external
hyperparameter to tune, and are guaranteed to find the best possible solution
(depending on the criterion, see ahead).
 PCA as a batch methods
 Online or adaptative PCA algorithms
35
Criterion
 Criterion probably plays the most important role among characteristics of the method.
 For example, a well-known criterion for dimensionality reduction is the mean square
error.
 Most often the loss of information or deterioration of the data structure occurs solely in
the first step, but the second is necessary in order to have a comparison reference.
36
Other criteria
 From statistical view, a projection that preserves the variance initially observed
in the raw data.
 From a more geometrical or topological point of view, the projection of the
object which preserve its structure.
 For example, by preserving the pairwise distances measured between the
observations in the data set.
 If the aim is latent variable separation, then the criterion can be decorrelation.
37
PCA
Data model of PCA
 Random vector y = [y1, . . . , yd, . . . , yD]T , result from a linear transformation W of P unknown
latent variables x = [x1, . . , xp, . . , xP ]T .
 All latent variables are assumed to have a Gaussian distribution.
 Transformation W is constrained to be an axis change, meaning that the columns wd of W are
orthogonal to each other and of unit norm. In other words, the D-by-P matrix W is a matrix such
thatWTW = IP (but the permuted productWWT may differ from ID).
 Both the observed variables y and the latent ones x are centered.
 Matrix form of N observations
38
PCA
Preprocessing
 Before determining P and W, the observations can be centered by removing the
expectation of y from each observation y(n).
 The centering can be rewritten for the entire data set as:
 In some situation it is required standardizing the variables, i.e., dividing each yd
by its standard deviation after centering.
39
PCA - standardization
 The standardization could even be dangerous when some variable has a low standard
deviation.
 When a variable is zero, its standard deviation is also zero. Trivially, the division by zero must be
avoided, and the variable should be discarded.
 When noise pollutes an observed variable having a small standard deviation, the contribution of
the noise to the standard deviation may be proportionally large. This means that discovering the
dependency between that variable and the other ones can be difficult.
 The standardization can be useful but may not be achieved blindly.
 Some knowledge about the data set is necessary.
 After centering (and standardization if appropriate), the parameters P and W
can be identified by PCA.
40
Criteria leading to PCA
 PCA can be derived from several criteria, all leading to the same method
and/or results.
 Minimal reconstruction error
 Maximal preserved variance
41
PCA Criterion
Minimal reconstruction error
 The reconstruction mean square error
 WTW= IP, but WWT = ID is not necessarily true. Therefore, no simplification may
occur in the reconstruction error.
42
 However, in a perfect world, the observed vector y has been generated precisely
according to the PCA model. In this case only, y can be perfectly retrieved.
Indeed, if y =Wx, then
 The reconstruction error is zero.
 Almost all real situations, the observed variables in y are polluted by some
noise, or do not fully respect the linear PCA model, yielding a nonzero
reconstruction error.
 The best approximation is determined by developing and minimizing the
reconstruction error.
43
 where the first term is constant. Hence, minimizing Ecodec turns out to maximize
the term
 As only a few observations y(n) are available, the latter expression is
approximated by the sample mean:
 To maximize this last expression, Y has to be factored by singular value
decomposition
44
 Where V, U are unitary matrices and where Σ is a matrix with the same size as Y
but with at most D nonzero entries σd, called singular values and located on the first
diagonal of Σ.
 The D singular values are usually sorted in descending order. Substituting in the
approximation of the expectation leads to
 Since the columns of V and U are orthonormal vectors by construction, it is
easy to see that
45
 for a given P (ID×P is a matrix made of the first P columns of the identity matrix ID).
Indeed, the above expression reaches its maximum when the P columns of W are
colinear with the columns of V that are associated with the P largest singular
values in Σ.
 Additionally, it can be trivially proved that Ecodec = 0 forW=V.
 Finally, P-dimensional latent variables are approximated by computing the product
46
Maximal preserved variance and
decorrelation
 From a statistical point of view:
 Assumed that the latent variables in x are
uncorrelated (no linear dependencies
bind them).
 In practice, this means that the covariance
matrix of x, defined as provided x is
centered, is diagonal.
47
Maximal preserved variance and
decorrelation (Cont.)
 After the axis change induced by W, it is very likely that the observed
variables in y are correlated, i.e., Cyy is no longer diagonal.
 The goal of PCA is then to get back the P uncorrelated latent variables in x.
 Assuming that the PCA model holds and the covariance of y is known,
48
Maximal preserved variance and
decorrelation (Cont.)
 Since WTW = I, left and right multiplications by, respectively,WT andW lead to
 The covariance matrix Cyy can be factored by eigenvalue decomposition
 Where V is a matrix of normed eigenvectors vd and Λ a diagonal matrix
containing their associated eigenvalues λd, in descending order.
49
Maximal preserved variance and
decorrelation (Cont.)
 Because the covariance matrix is symmetric and semipositive definite, the
eigenvectors are orthogonal and the eigenvalues are nonnegative real numbers.
 This equality holds only when the P columns of W are taken colinear with P columns
ofV, among D ones.
 If the PCA model is fully respected, then only the first P eigenvalues in Λ are strictly
larger than zero; the other ones are zero. The eigenvectors associated with these P
nonzero eigenvalues must be kept:
50
Maximal preserved variance and
decorrelation (Cont.)
 This shows that the eigenvalues in Λ correspond to the variances of the latent
variables
 In real situations, some noise may corrupt the observed variables in y. As a
consequence, all eigenvalues of Cyy are larger than zero, and the choice of P
columns inV becomes more difficult.
51
Maximal preserved variance and
decorrelation (Cont.)
 Assuming that the latent variables have larger variances than the noise, it suffices
to choose the eigenvectors associated with the largest eigenvalues.
 If the global variance of y is defined as
 Then the proposed solution is guaranteed to preserve a maximal fraction of the
global variance.
52
Maximal preserved variance and
decorrelation (Cont.)
 From a geometrical point of view, the columns of V indicates the directions in
RD that span the subspace of the latent variables.
 To conclude, it must be emphasized that in real situations the true covariance of
y is not known but can be approximated by the sample covariance:
53
PCA
Intrinsic dimension estimation
 If the model of PCA is fully respected, then only the P largest eigen values of
Cyy will depart from zero.
 The rank of the covariance matrix (the number of nonzero eigen values)
indicates trivially the number of latent variables.
 D eigenvalues are often different from zero:
 When having only a finite sample, the covariance can only be approximated.
 The data probably do not entirely respect the PCA model (presence of noise, etc.).
 making the estimation of the intrinsic dimension more difficult.
54
PCA
Intrinsic dimension estimation (Cont.)
 Normally, if the model holds reasonably well, large (significant) eigenvalues
correspond to the variances of the latent variables, while smaller (negligible)
ones are due to noise and other imperfections.
 A rather visible gap should separate the two kinds of eigenvalues.
 The gap can be visualized by plotting the eigenvalues in descending order: a
sudden fall should appear right after the Pth eigenvalue.
 If the gap is not visible, plotting minus the logarithm of the normalized
eigenvalues may help. In this plot, the intrinsic dimension is indicated by a
sudden ascent.
55
PCA
Intrinsic dimension estimation (Cont.)
 Unfortunately, when the data dimensionality D is high, there may also be numerous latent
variables showing a wide spectrum of variances.
 In the extreme, the variances of latent variables can no longer be distinguished from the
variances related to noise. In this case, the intrinsic dimension P is chosen in order to
preserve an arbitrarily chosen fraction of the global variance. Then the dimension P is
determined so as to preserve at least this fraction of variance.
 For example, if it is assumed that the latent variables bear 95% of the global variance, then P is
the smaller integer such that the following inequality holds
56
PCA
Intrinsic dimension estimation (Cont.)
 Sometimes the threshold is set on individual variances instead of cumulated
ones.
 The best way to set the threshold consists of finding a threshold that separates
the significant variances from the negligible ones. This turns out to be
equivalent to the visual methods.
 More complex methods exist to set the frontier between the latent and noise
subspaces, such as Akaike’s information criterion (AIC), the Bayesian
information criterion (BIC), and the minimum description length (MDL).
 These methods determine the value of P on the basis of information-theoretic
considerations.
57
Examples and limitations of PCA
 Gaussian variables and linear embedding
 If columns of the mixing matrix W are neither orthogonal nor normed. Consequently,
PCA cannot retrieve exactly the true latent variables.
58
PCA Result
59
Nonlinear embedding
 PCA
is unable to completely
reconstruct the curved object displayed
in Fig. 2.7
60
PCA Result
61
Non-Gaussian distributions
62
PCA Result
63