Transcript Department of Computer Science
Similarities, Distances and Manifold Learning
Prof. Richard C. Wilson Dept. of Computer Science University of York
Background • Typically objects are characterised by features – Face images – SIFT features – Object spectra – ...
• If we measure
n
features →
n
-dimensional space • The arena for our problem is an
n
-dimensional vector space
• Example: Eigenfaces Background • Raw pixel values:
n
by
m
gives
nm
features • Feature space is space of all
n
by
m
images
Background • The space of all face-like images is smaller than the space of all images • Assumption is faces lie on a smaller manifold embedded in the global space All images Face images
Manifold
: A space which locally looks Euclidean
Manifold learning
: Finding the manifold representing the objects we are interested in All objects should be on the manifold, non-objects outside
Part I: Euclidean Space
Position, Similarity and Distance Manifold Learning in Euclidean space Some famous techniques
Part II: Non-Euclidean Manifolds
Assessing Data Nature and Properties of Manifolds Data Manifolds Learning some special types of manifolds
Part III: Advanced Techniques
Methods for intrinsically curved manifolds Thanks to Edwin Hancock, Eliza Xu, Bob Duin for contributions And support from the EU SIMBAD project
Part I: Euclidean Space
Position The main arena for pattern recognition and machine learning problems is vector space – A set of
n
well defined features collected into a vector ℝ
n
Also defined are addition of vectors and multiplication by a scalar Feature vector → position
Similarity To make meaningful progress, we need a notion of similarity Inner product
x
,
y
i x i y i
• The inner-product ‹
x
,
y
› can be considered to be a similarity between
x
and
y
Induced norm • The self-similarity ‹
x
,
x
› is the (square of) the ‘size’ of x and gives rise to the induced norm, of the length of
x
:
x
x
,
x
• Finally, the length of
x
allows the definition of a distance in our vector space as the length of the vector joining
x
and
y
d
(
x
,
y
)
x
y
x
y
,
x
y
• Inner product also gets us distance
Euclidean space • If we have a vector space for features, and the usual inner product, all three are connected: Position
x
,
y
Similarity
x
,
y
Distance
d
(
x
,
y
)
non-Euclidean Inner Product • If the inner-product has the form
x
,
y
x
T
y
i
• Then the vector space is
Euclidean x i y i
• Note we recover all the expected stuff for Euclidean space, i.e.
x
x
1 2
x
2 2
x
1 2
d
(
x
,
y
) (
x
1
y
1 ) 2 (
x
2
y
2 ) 2 (
x
n
y
n
) 2 • The inner-product doesn’t have to be like this; for example in Einstein’s special relativity, the inner-product of spacetime is
x
,
y
x
1
y
1
x
2
y
2
x
3
y
3
x
4
y
4
The Golden Trio • In Euclidean space, the concepts of position, similarity and distance are elegantly connected Position
X
Distance
D
Similarity
K
Point position matrix • In a normal manifold learning problem, we have a set of samples
X
={
x
1 ,
x
2 ,...,
x
m } • These can be collected together in a matrix
X X
x x
T
1
T
2
x
T m
I use this convention, but others may write them vertically
Centreing A common and important operation is centreing – moving the mean to the origin – Centred points behave better
JX
matrix – /
m
J
is the all-ones matrix
X
This can be done with
C C
I
J
/
m
X
CX
–
C
is the centreing matrix
(
and is symmetric
C
=
C
T
)
Position-Similarity • The similarity matrix
K
K ij
x
i
,
x
j
is defined as • From the definition of
X
, we simply get
K
XX
T
• The Gram matrix is the similarity matrix of the centred points (from the definition of
X
)
K
c
CXX
T
C
T
CKC
– i.e. a centring operation on
K
Position
X
Similarity
K
•
K
c is really a kernel matrix for the points (linear kernel)
Position-Similarity • To go from
K
to
X
, we need to consider the eigendecomposition of
K K
U
Λ
U
T
Position
X K
XX
T
• As long as we can take the square root of Λ then we can find
X
as
X
U
Λ 1/2 Similarity
K
Kernel embedding First manifold learning method –
kernel embedding
Finds a Euclidean manifold from object similarities
K X
U
Λ
U U
Λ 1/2
T
• • Embeds a kernel matrix into a set of points in Euclidean space (the points are automatically centred)
K
must have no negative eigenvalues, i.e. it is a kernel matrix (Mercer condition)
Similarity-Distance Distance
D
d
(
x
i
,
x
j
)
2
x
i
x
j
,
x
i
x
j
x
i
,
x
i
x
j
,
x
j
2
x
i
,
x
j
K ii
K jj
2
K ij
D s
,
ij
• We can easily determine
D
s from
K
Similarity
K
Similarity-Distance What about finding
K
from
D
s ?
D s
,
ij
K ii
K jj
2
K ij
Looking at the top equation, we might imagine that
K
=-½
D
s is a suitable choice • Not centred; the relationship is actually
K
1 2
CD
s
C
Classic MDS • Classic Multidimensional Scaling embeds a (squared) distance matrix into Euclidean space • Using what we have so far, the algorithm is simple
K
U
Λ
U
T
1 2
CD
s
C
Compute the kernel
K
Eigendecom pose the kernel
X
U
Λ
1/2
Embed the kernel
• This is MDS Position
X
Distance
D
The Golden Trio
MDS
Position
X
Kernel Embedding
Distance
D
Similarity
K K
D s
,
ij
1 2
K
CD
ii
s
C
K jj
2
K ij
Kernel methods • A kernel is function
k
(
i
,
j
) which computes an inner-product
k
(
i
,
j
)
x
i
,
x
j
– But without needing to know the actual points (the space is implicit) • Using a kernel function we can directly compute
K
without knowing
X
Position
X
Distance
D
Similarity
K
Kernel function
Kernel methods • The implied space may be very high dimensional, but a true kernel will always produce a positive semidefinite
K
and the implied space will be Euclidean • Many (most?) PR algorithms can be
kernelized
– Made to use
K
rather than
X
or
D
• The trick is to note that any interesting vector should lie in the space spanned by the examples we are given • Hence it can be written as a linear combination
u
1
x
1 2
x
2
m
x
m
X
T
α
• Look for
α
instead of
u
Kernel PCA • What about PCA? PCA solves the following problem
u
* arg min
u u
T
Σu
arg min
u
1
n
u
T
X
T
Xu
• Let’s kernelize: 1
n
u
T
X
T
Xu
1 (
X
T
α
)
T
X
T
X
(
X
T
α
)
n
1
n
α
T
α
1
n
α
T
K
2
α
Kernel PCA •
K
2 has the same eigenvectors as
K
, so the eigenvectors of PCA are the same as the eigenvectors of
K
• The eigenvalues of PCA are related to the eigenvectors of
K
by P CA 1
n
2
K
• Kernel PCA is a kernel embedding with an externally provided kernel matrix
Kernel PCA • So kernel PCA gives the same solution as kernel embedding – The eigenvalues are modified a bit • They are essentially the same thing in Euclidean space • MDS uses the kernel and kernel embedding • MDS and PCA are essentially the same thing in Euclidean space • Kernel embedding, MDS and PCA all give the same answer for a set of points in Euclidean space
Some useful observations • Your similarity matrix is Euclidean iff it has no negative eigenvalues (i.e. it is a kernel matrix and PSD) • By similar reasoning, your distance matrix is Euclidean iff the similarity matrix derived from it is PSD • If the feature space is small but the number of samples is large, then the covariance matrix is small and it is better to do normal PCA (on the covariance matrix) • If the feature space is large and the number of samples is small, then the kernel matrix will be small and it is better to do kernel embedding
Part II: Non-Euclidean Manifolds
Non-linear data • Much of the data in computer vision lies in a high dimensional feature space but is constrained in some way – The space of all images of a face is a subspace of the space of all possible images – The subspace is highly non-linear but low dimensional (described by a few parameters)
Non-linear data • This cannot be exploited by the linear subspace methods like PCA – These assume that the subspace is a Euclidean space as well • A classic example is the ‘swiss roll’ data:
‘Flat’ Manifolds • Fundamentally different types of data, for example: • The embedding of this data into the high-dimensional space is highly curved – This is called
extrinsic
curvature, the curvature of the manifold with respect to the embedding space • Now imagine that this manifold was a piece of paper; you could unroll the paper into a flat plane
without distorting it
– No
intrinsic
curvature, in fact it is homeomorphic to Euclidean space
Curved manifold • This manifold is different: • It must be stretched to map it onto a plane – It has non-zero intrinsic curvature • A flatlander living on this manifold can tell that it is curved, for example by measuring the ratio of the radius to the circumference of a circle • In the first case, we might still hope to find Euclidean embedding • We can never find a distortion free Euclidean embedding of the second (in the sense that the distances will always have errors)
Intrinsically Euclidean Manifolds • We cannot use the previous methods on the second type of manifold, but there is still hope for the first • The manifold is embedded in Euclidean space, but Euclidean distance is
not
the correct way to measure distance • The Euclidean distance ‘shortcuts’ the manifold • The
geodesic
manifold distance calculates the shortest path along the
Geodesics • The geodesic generalizes the concept of distance to curved manifolds – The shortest path joining two points which lies completely within the manifold • If we can correctly compute the geodesic distances, and the manifold is intrinsically flat, we should get Euclidean distances which we can plug into our Euclidean geometry machine Position
X
Geodesic Distances
Distance
D
Similarity
K
ISOMAP • ISOMAP is exactly such an algorithm • Approximate geodesic distances are computed for the points from a graph • Nearest neighbours graph – For neighbours, Euclidean distance≈geodesic distances – For non-neighbours, geodesic distance approximated by shortest distance in graph • Once we have distances
D
, can use MDS to find Euclidean embedding
ISOMAP • ISOMAP: – Neighbourhood graph – Shortest path algorithm – MDS • ISOMAP is distance-preserving – embedded distances should be close to geodesic distances
Laplacian Eigenmap • The Laplacian Eigenmap is another graph-based method of embedding non-linear manifolds into Euclidean space • As with ISOMAP, form a neighbourhood graph for the datapoints • Find the graph Laplacian as follows • The adjacency matrix
A
A ij
e
0
d t
2
ij
is if
i
and
j
are connected otherwise • The ‘degree’ matrix
D
is the diagonal matrix
D ii
j A ij
• The normalized graph Laplacian is
L
I
D
1 / 2
AD
1 / 2
Laplacian Eigenmap • We find the Laplacian eigenmap embedding using the eigendecomposition of L
L
U
U
T
• The embedded positions are
X
D
1 / 2
U
• Similar to ISOMAP – Structure preserving not distance preserving
Locally-Linear Embedding • Locally-linear Embedding is another classic method which also begins with a neighbourhood graph • We make point
i
(in the original data) from a weighted sum of the neighbouring points
i j i
j
W x
ij j
•
W ij
is 0 for any point
j
not in the neighbourhood (and for
i
=
j
) • We find the weights by minimising the reconstruction error min | | 2
i
x
i
– Subject to the constrains that the weights are non-negative and sum to 1
W ij
0 ,
j W ij
1 • Gives a relatively simple closed-form solution
Locally-Linear Embedding • These weights encode how well a point
j i
represents a point and can be interpreted as the adjacency between
i
and
j
• A low dimensional embedding is found by then finding points to minimise the error min |
i
y
i
| 2
y
i
j i j
• In other words, we find a low-dimensional embedding which preserves the adjacency relationships • The solution to this embedding problem turns out to be simply the eigenvectors of the matrix
M M
(
I
W
)
T
(
I
W
) • LLE is scale-free: the final points have the covariance matrix
I
– Unit scale
Comparison • LLE might seem like quite a different process to the previous two, but actually very similar • We can interpret the process as producing a kernel matrix followed by scale-free kernel embedding
K
(
k
1 )
I
K
UΛ U
T k
J
n
X
W U
W
T
W
T
W ISOMAP Representation Similarity matrix Embedding
Neighbourhood graph From geodesic distances
X
U
1 / 2
Lap. Eigenmap
Neighbourhood graph Graph Laplacian
X
D
1 / 2
U LLE
Neighbourhood graph Reconstruction weights
X
U
Comparison • ISOMAP is the only method which directly computes and uses the geodesic distances – The other two depend indirectly on the distances through local structure • LLE is scale-free, so the original distance scale is lost, but the local structure is preserved • Computing the necessary local dimensionality to find the correct nearest neighbours is a problem for all such methods
Non-Euclidean data • Data is Euclidean iff
K
is psd • Unless you are using a kernel function, this is often not true • Why does this happen?
What type of data do I have?
• Starting point: distance matrix • However we do not know apriori if our measurements are representable on a manifold – We will call them
dissimilarities
• Our starting point to answer the question “
What type of data do I have?”
will be a matrix of dissimilarities
D
between objects • Types of dissimilarities – Euclidean (no intrinsic curvature) – Non-Euclidean, metric (curved manifold) – Non-metric (no point-like manifold representation)
Causes • Example: Chicken pieces data • Distance by alignment • Global alignment of everything could find Euclidean distances • Only local alignments are practical
Causes Dissimilarities may also be non-metric The data is metric if it obeys the metric conditions 1.
D ij
≥ 0 (nonegativity) 2.
D ij
= 0 iff
i
=
j
(identity of indiscernables) 3.
4.
D ij
=
D ji D ij
≤
D ik + D kj
(symmetry) (triangle inequality) Reasonable dissimilarites should meet 1&2
Causes • Symmetry
D
ij
=
D
ji
• May not be symmetric by definition • Alignment:
i
→
j j
→
i
may find a better solution than
Causes • Triangle violations
D ij
≤
D ik + D kj
• ‘Extended objects’
i j k D ik D kj
0 0
D ij
0 • Finally, noise in the measure of
D
can cause all of these effects
Tests(1) • Find the similarity matrix
K
1 2
CD
s
C
• The data is Euclidean iff
K
is positive semidefinite (no negative eigenvalues) –
K
is a kernel, explicit embedding from kernel embedding • We can then use
K
in a kernel algorithm
Tests(2) • Negative eigenfraction (NEF) NEF
i
0
i
i
• Between 0 and 0.5
Tests(3) 1.
D ij
≥ 0 (nonegativity) 2.
D ij
= 0 iff
i
=
j
(identity of indiscernables) 3.
4.
D ij
=
D ji D ij
≤
D ik + D kj
(symmetry) (triangle inequality) – Check these for your data (3 rd involves checking all triples) – Metric data is embeddable on a (curved) Reimannian manifold
Corrections • If the data is non-metric or non-Euclidean, we can ‘correct it’ • Symmetry violations – Average
D ij
D
ji
1 2 (
D ij D ij
D D ji
ji
) appropriate
D ij
• Triangle violations – Constant offset
D ij
D ij
c
(
i
j
) – This will also remove non-Euclidean behaviour for large enough
c
• Euclidean violations – Discard negative eigenvalues • There are many other approaches * * “ On Euclidean corrections for non-Euclidean dissimilarities”, Duin, Pekalska, Harol, Lee and Bunke, S+SSPR 08
Part III: Advanced techniques for non-Euclidean Embeddings
Known Manifolds • Sometimes we have data which lies on a known but non Euclidean manifold • Examples in Computer Vision – Surface normals – Rotation matrices – Flow tensors (DT-MRI) • This is not Manifold Learning, as we already know what the manifold is • What tools do we need to be able to process data like this?
– As before, distances are the key
Example: 2D direction Direction of an edge in an image, encoded as a unit vector
x
1
x x
2 The average of the direction vector isn’t even a direction vector (not unit length), let alone the correct ‘average’ direction The normal definition of mean is not correct
x
1 – Because the manifold is curved
n
i
x
i
Tangent space • The tangent space (
T P
) is the Euclidean space which is parallel to the manifold(
M
) at a particular point (
P
)
M P T P
• The tangent space is a very useful tool because it is Euclidean
Exponential Map • Exponential map:
Exp
P
:
T P
M
• Exp
P
maps a point the manifold
X A
Exp
X P
on the tangent plane onto a point
A
on –
P
is the centre of the mapping and is at the origin on the tangent space – The mapping is one-to-one in a local region of P – The most important property of the mapping is that the distances to the centre P are preserved
d
(
X
,
P
)
d
(
A
,
P
)
T P M
– The geodesic distance on the manifold equals the Euclidean distance on the tangent plane (for distances to the centre
only
)
Exponential map • The log map goes the other way, from manifold to tangent plane Log
P X
:
M
T p
Log
P M
Exponential Map • Example on the circle: Embed the circle in the complex plane • The manifold representing the circle is a complex number with magnitude 1 and can be written
x
+
iy
=exp(
i
)
Im P
e i
P Re
• In this case it turns out that the map is related to the normal exp and log functions
M T P P
e i
P X X
Log
P A
i
log
A P
i
log
e i
A e i
P
A
P A
Exp exp
X P i
P
P
exp
iX
exp
i
(
A
P
) exp
i
A A
e i
A
Intrinsic mean • The mean of a set of samples is usually defined as the sum of the samples divided by the number – This is only true in Euclidean space • A more general formula
x
arg min
x
d
2
g
(
x
,
x
i
)
i
• Minimises the distances from the mean to the samples (equivalent in Euclidean space)
Intrinsic mean • We can compute this
intrinsic
mean using the exponential map • If we knew what the mean was, then we can use the mean as the centre of a map
X i
Log
M A i
• From the properties of the Exp-map, the distances are the same
d e
(
X i
,
M
)
d g
(
A i
,
M
) • So the mean on the tangent plane is equal to the mean on the manifold
Intrinsic mean • Start with a guess at the mean and move towards correct answer • This gives us the following algorithm – Guess at a mean
M
0 1. Map on to tangent plane using
M i
2. Compute the mean on the tangent plane to get new estimate
M i
+1
M k
1 Exp
M k
1
n
i
Log
M k A i
Intrinsic Mean • For many manifolds, this procedure will converge to the intrinsic mean – Convergence not always guaranteed • Other statistics and probability distributions on manifolds are problematic.
– Can hypothesis a normal distribution on tangent plane, but distortions inevitable
Some useful manifolds and maps • • Some useful manifolds and exponential maps • Directional vectors (surface normals etc.)
a
,
a
1
x
sin (
a
p
cos ) (Log map)
a
p
cos sin
x
(Exp map)
a
,
p
unit vectors,
x
lies in an (
n
-1)D space
Some useful manifolds and maps • Symmetric positive definite matrices (covariance, flow tensors etc)
A
,
u
T
Au
0
u
0
X
P
1 2 log
P
1 2
AP
1 2
P
1 2 (Log map)
A
P
1 2 exp
P
1 2
XP
1 2
P
1 2 (Exp map) • A is symmetric positive definite, X is just symmetric • log is the matrix log defined as a generalized matrix function
Some useful manifolds and maps • Orthogonal matrices (rotation matrices, eigenvector matrices)
A
,
AA X A
log
P
T
exp
I
(Log (Exp map) map)
•
A
orthogonal,
X
antisymmetric (
X
+
X
T
=0) • These are the matrix exp and log functions as before • In fact there are multiple solutions to the matrix log – Only one is the required real antisymmetric matrix; not easy to find – Rest are complex
Embedding on
S n
• On
S
2 (surface of a sphere in 3D) the following parameterisation is well known
x
(
r
sin
cos
,
r
sin
sin
,
r
cos
)
T
• The distance between two points (the length of the
d
ij
geodesic) is
r
cos 1 sin
x
sin
y
xy
cos
x
cos
y
x
d xy
y
More Spherical Geometry • But on a sphere, the distance is the highlighted arc-length – Much neater to use inner-product
d xy
x
,
y
r
xy
xy
cos
xy
r
2 cos
xy
r
cos 1
x
,
y
r
2 – And works in any number of dimensions
x
rθ xy θ xy
y
Spherical Embedding • Say we had the distances between some objects (
d ij
), measured on the surface of a [hyper]sphere of dimension
n
• • The sphere (and objects) can be embedded into an
n
+1 dimensional space – Let
X
be the matrix of point positions
Z
=
XX
T
• But • And is a kernel matrix
Z d xy ij
r
x
i
cos , 1
x
j
x
,
y
r
2
Z ij
x
i
,
x
j
r
2 cos
d ij r
• We can compute
Z
embedding! from
D
and find the spherical
Spherical Embedding • But wait, we don’t know what
r
is!
• The distances
D
are non-Euclidean, and if we use the wrong radius,
Z
is not a kernel matrix – Negative eigenvalues • Use this to find the radius – Choose
r
to minimise the negative eigenvalues
r
* arg min
r
o
Z
(
r
)
Example: Texture Mapping • As an alternative to unwrapping object onto a plane and texture-mapping the plane • Embed onto a sphere and texture-map the sphere Plane Sphere
Backup slides
Laplacian and related processes • As well as embedding objects onto manifolds, we can model many interesting processes on manifolds • Example: the way ‘heat’ flows across a manifold can be very informative •
du
2
u
heat equation
dt
2 is the Laplacian and in 3D Euclidean space it is • On a sphere it is 2
x
2
r
2 1 sin 2 2 2 2
y
2 1
r
2 sin 2
z
2 sin
Heat flow • Heat flow allows us to do interesting things on a manifold • Smoothing: Heat flow is a diffusion process (will smooth the data) • Characterising the manifold (heat content, heat kernel coefficients...) • The Laplacian depends on the geometry of the manifold – We may not know this – It may be hard to calculate explicitly • Graph Laplacian
Graph Laplacian • Given a set of datapoints on the manifold, describe them by a graph – Vertices are datapoints, edges are adjacency relation • Adjacency matrix (for example)
d
• Then the graph Laplacian is
L
A
V
ij
exp(
d
2
ij
2
ij
A
V ii
/
)
A ij
manifold Laplacian
Heat Kernel • Using the graph Laplacian, we can easily implement heat flow methods on the manifold using the heat-kernel
d
u
dt
H
Lu
heat equation exp(
L
t
) heat kernel
• Can diffuse a function on the manifold by
f
'
Hf