Department of Computer Science

Download Report

Transcript Department of Computer Science

Similarities, Distances and Manifold Learning

Prof. Richard C. Wilson Dept. of Computer Science University of York

Background • Typically objects are characterised by features – Face images – SIFT features – Object spectra – ...

• If we measure

n

features →

n

-dimensional space • The arena for our problem is an

n

-dimensional vector space

• Example: Eigenfaces Background • Raw pixel values:

n

by

m

gives

nm

features • Feature space is space of all

n

by

m

images

Background • The space of all face-like images is smaller than the space of all images • Assumption is faces lie on a smaller manifold embedded in the global space All images Face images

Manifold

: A space which locally looks Euclidean

Manifold learning

: Finding the manifold representing the objects we are interested in All objects should be on the manifold, non-objects outside

Part I: Euclidean Space

Position, Similarity and Distance Manifold Learning in Euclidean space Some famous techniques

Part II: Non-Euclidean Manifolds

Assessing Data Nature and Properties of Manifolds Data Manifolds Learning some special types of manifolds

Part III: Advanced Techniques

Methods for intrinsically curved manifolds Thanks to Edwin Hancock, Eliza Xu, Bob Duin for contributions And support from the EU SIMBAD project

Part I: Euclidean Space

Position The main arena for pattern recognition and machine learning problems is vector space – A set of

n

well defined features collected into a vector ℝ

n

Also defined are addition of vectors and multiplication by a scalar Feature vector → position

Similarity To make meaningful progress, we need a notion of similarity Inner product 

x

,

y

 

i x i y i

• The inner-product ‹

x

,

y

› can be considered to be a similarity between

x

and

y

Induced norm • The self-similarity ‹

x

,

x

› is the (square of) the ‘size’ of x and gives rise to the induced norm, of the length of

x

:

x

 

x

,

x

 • Finally, the length of

x

allows the definition of a distance in our vector space as the length of the vector joining

x

and

y

d

(

x

,

y

) 

x

y

 

x

y

,

x

y

 • Inner product also gets us distance

Euclidean space • If we have a vector space for features, and the usual inner product, all three are connected: Position

x

,

y

Similarity

x

,

y

Distance

d

(

x

,

y

)

non-Euclidean Inner Product • If the inner-product has the form

x

,

y

x

T

y

 

i

• Then the vector space is

Euclidean x i y i

• Note we recover all the expected stuff for Euclidean space, i.e.

x

x

1 2 

x

2 2   

x

1 2

d

(

x

,

y

)  (

x

1 

y

1 ) 2  (

x

2 

y

2 ) 2    (

x

n

y

n

) 2 • The inner-product doesn’t have to be like this; for example in Einstein’s special relativity, the inner-product of spacetime is

x

,

y

x

1

y

1 

x

2

y

2 

x

3

y

3 

x

4

y

4

The Golden Trio • In Euclidean space, the concepts of position, similarity and distance are elegantly connected Position

X

Distance

D

Similarity

K

Point position matrix • In a normal manifold learning problem, we have a set of samples

X

={

x

1 ,

x

2 ,...,

x

m } • These can be collected together in a matrix

X X

       

x x

T

1

T

2

x

T m

       I use this convention, but others may write them vertically

Centreing A common and important operation is centreing – moving the mean to the origin – Centred points behave better

JX

matrix – /

m

J

is the all-ones matrix

X

 This can be done with

C C

I

J

/

m

X

CX

C

is the centreing matrix

(

and is symmetric

C

=

C

T

)

Position-Similarity • The similarity matrix

K

K ij

x

i

,

x

j

is defined as • From the definition of

X

, we simply get

K

XX

T

• The Gram matrix is the similarity matrix of the centred points (from the definition of

X

)

K

c

CXX

T

C

T

CKC

– i.e. a centring operation on

K

Position

X

Similarity

K

K

c is really a kernel matrix for the points (linear kernel)

Position-Similarity • To go from

K

to

X

, we need to consider the eigendecomposition of

K K

U

Λ

U

T

Position

X K

XX

T

• As long as we can take the square root of Λ then we can find

X

as

X

U

Λ 1/2 Similarity

K

Kernel embedding First manifold learning method –

kernel embedding

Finds a Euclidean manifold from object similarities

K X

 

U

Λ

U U

Λ 1/2

T

• • Embeds a kernel matrix into a set of points in Euclidean space (the points are automatically centred)

K

must have no negative eigenvalues, i.e. it is a kernel matrix (Mercer condition)

Similarity-Distance Distance

D

d

(

x

i

,

x

j

)

2 

x

i

x

j

,

x

i

x

j

x

i

,

x

i

x

j

,

x

j

2

x

i

,

x

j

K ii

K jj

2

K ij

D s

,

ij

• We can easily determine

D

s from

K

Similarity

K

Similarity-Distance What about finding

K

from

D

s ?

D s

,

ij

K ii

K jj

2

K ij

Looking at the top equation, we might imagine that

K

=-½

D

s is a suitable choice • Not centred; the relationship is actually

K

 

1 2

CD

s

C

Classic MDS • Classic Multidimensional Scaling embeds a (squared) distance matrix into Euclidean space • Using what we have so far, the algorithm is simple

K

 

U

Λ

U

T

1 2

CD

s

C

Compute the kernel

K

Eigendecom pose the kernel

X

U

Λ

1/2

Embed the kernel

• This is MDS Position

X

Distance

D

The Golden Trio

MDS

Position

X

Kernel Embedding

Distance

D

Similarity

K K

D s

 ,

ij

 1 2 

K

CD

ii

s

C

K jj

 2

K ij

Kernel methods • A kernel is function

k

(

i

,

j

) which computes an inner-product

k

(

i

,

j

) 

x

i

,

x

j

– But without needing to know the actual points (the space is implicit) • Using a kernel function we can directly compute

K

without knowing

X

Position

X

Distance

D

Similarity

K

Kernel function

Kernel methods • The implied space may be very high dimensional, but a true kernel will always produce a positive semidefinite

K

and the implied space will be Euclidean • Many (most?) PR algorithms can be

kernelized

– Made to use

K

rather than

X

or

D

• The trick is to note that any interesting vector should lie in the space spanned by the examples we are given • Hence it can be written as a linear combination

u

  1

x

1   2

x

2    

m

x

m

X

T

α

• Look for

α

instead of

u

Kernel PCA • What about PCA? PCA solves the following problem

u

*  arg min

u u

T

Σu

 arg min

u

1

n

u

T

X

T

Xu

• Let’s kernelize: 1

n

u

T

X

T

Xu

  1 (

X

T

α

)

T

X

T

X

(

X

T

α

)

n

1

n

α

T

  

α

 1

n

α

T

K

2

α

Kernel PCA •

K

2 has the same eigenvectors as

K

, so the eigenvectors of PCA are the same as the eigenvectors of

K

• The eigenvalues of PCA are related to the eigenvectors of

K

by  P CA  1

n

 2

K

• Kernel PCA is a kernel embedding with an externally provided kernel matrix

Kernel PCA • So kernel PCA gives the same solution as kernel embedding – The eigenvalues are modified a bit • They are essentially the same thing in Euclidean space • MDS uses the kernel and kernel embedding • MDS and PCA are essentially the same thing in Euclidean space • Kernel embedding, MDS and PCA all give the same answer for a set of points in Euclidean space

Some useful observations • Your similarity matrix is Euclidean iff it has no negative eigenvalues (i.e. it is a kernel matrix and PSD) • By similar reasoning, your distance matrix is Euclidean iff the similarity matrix derived from it is PSD • If the feature space is small but the number of samples is large, then the covariance matrix is small and it is better to do normal PCA (on the covariance matrix) • If the feature space is large and the number of samples is small, then the kernel matrix will be small and it is better to do kernel embedding

Part II: Non-Euclidean Manifolds

Non-linear data • Much of the data in computer vision lies in a high dimensional feature space but is constrained in some way – The space of all images of a face is a subspace of the space of all possible images – The subspace is highly non-linear but low dimensional (described by a few parameters)

Non-linear data • This cannot be exploited by the linear subspace methods like PCA – These assume that the subspace is a Euclidean space as well • A classic example is the ‘swiss roll’ data:

‘Flat’ Manifolds • Fundamentally different types of data, for example: • The embedding of this data into the high-dimensional space is highly curved – This is called

extrinsic

curvature, the curvature of the manifold with respect to the embedding space • Now imagine that this manifold was a piece of paper; you could unroll the paper into a flat plane

without distorting it

– No

intrinsic

curvature, in fact it is homeomorphic to Euclidean space

Curved manifold • This manifold is different: • It must be stretched to map it onto a plane – It has non-zero intrinsic curvature • A flatlander living on this manifold can tell that it is curved, for example by measuring the ratio of the radius to the circumference of a circle • In the first case, we might still hope to find Euclidean embedding • We can never find a distortion free Euclidean embedding of the second (in the sense that the distances will always have errors)

Intrinsically Euclidean Manifolds • We cannot use the previous methods on the second type of manifold, but there is still hope for the first • The manifold is embedded in Euclidean space, but Euclidean distance is

not

the correct way to measure distance • The Euclidean distance ‘shortcuts’ the manifold • The

geodesic

manifold distance calculates the shortest path along the

Geodesics • The geodesic generalizes the concept of distance to curved manifolds – The shortest path joining two points which lies completely within the manifold • If we can correctly compute the geodesic distances, and the manifold is intrinsically flat, we should get Euclidean distances which we can plug into our Euclidean geometry machine Position

X

Geodesic Distances

Distance

D

Similarity

K

ISOMAP • ISOMAP is exactly such an algorithm • Approximate geodesic distances are computed for the points from a graph • Nearest neighbours graph – For neighbours, Euclidean distance≈geodesic distances – For non-neighbours, geodesic distance approximated by shortest distance in graph • Once we have distances

D

, can use MDS to find Euclidean embedding

ISOMAP • ISOMAP: – Neighbourhood graph – Shortest path algorithm – MDS • ISOMAP is distance-preserving – embedded distances should be close to geodesic distances

Laplacian Eigenmap • The Laplacian Eigenmap is another graph-based method of embedding non-linear manifolds into Euclidean space • As with ISOMAP, form a neighbourhood graph for the datapoints • Find the graph Laplacian as follows • The adjacency matrix

A

A ij

  

e

 0

d t

2

ij

is if

i

and

j

are connected otherwise • The ‘degree’ matrix

D

is the diagonal matrix

D ii

 

j A ij

• The normalized graph Laplacian is

L

I

D

 1 / 2

AD

 1 / 2

Laplacian Eigenmap • We find the Laplacian eigenmap embedding using the eigendecomposition of L

L

U

U

T

• The embedded positions are

X

D

 1 / 2

U

• Similar to ISOMAP – Structure preserving not distance preserving

Locally-Linear Embedding • Locally-linear Embedding is another classic method which also begins with a neighbourhood graph • We make point

i

(in the original data) from a weighted sum of the neighbouring points

i j i

 

j

W x

ij j

W ij

is 0 for any point

j

not in the neighbourhood (and for

i

=

j

) • We find the weights by minimising the reconstruction error min | | 2

i

x

i

– Subject to the constrains that the weights are non-negative and sum to 1

W ij

 0 , 

j W ij

 1 • Gives a relatively simple closed-form solution

Locally-Linear Embedding • These weights encode how well a point

j i

represents a point and can be interpreted as the adjacency between

i

and

j

• A low dimensional embedding is found by then finding points to minimise the error min  |

i

y

i

| 2

y

i

 

j i j

• In other words, we find a low-dimensional embedding which preserves the adjacency relationships • The solution to this embedding problem turns out to be simply the eigenvectors of the matrix

M M

 (

I

W

)

T

(

I

W

) • LLE is scale-free: the final points have the covariance matrix

I

– Unit scale

Comparison • LLE might seem like quite a different process to the previous two, but actually very similar • We can interpret the process as producing a kernel matrix followed by scale-free kernel embedding

K

 (

k

 1 )

I

K

UΛ U

T k

J

n

X

W U

W

T

W

T

W ISOMAP Representation Similarity matrix Embedding

Neighbourhood graph From geodesic distances

X

U

  1 / 2

Lap. Eigenmap

Neighbourhood graph Graph Laplacian

X

D

 1 / 2

U LLE

Neighbourhood graph Reconstruction weights

X

U

Comparison • ISOMAP is the only method which directly computes and uses the geodesic distances – The other two depend indirectly on the distances through local structure • LLE is scale-free, so the original distance scale is lost, but the local structure is preserved • Computing the necessary local dimensionality to find the correct nearest neighbours is a problem for all such methods

Non-Euclidean data • Data is Euclidean iff

K

is psd • Unless you are using a kernel function, this is often not true • Why does this happen?

What type of data do I have?

• Starting point: distance matrix • However we do not know apriori if our measurements are representable on a manifold – We will call them

dissimilarities

• Our starting point to answer the question “

What type of data do I have?”

will be a matrix of dissimilarities

D

between objects • Types of dissimilarities – Euclidean (no intrinsic curvature) – Non-Euclidean, metric (curved manifold) – Non-metric (no point-like manifold representation)

Causes • Example: Chicken pieces data • Distance by alignment • Global alignment of everything could find Euclidean distances • Only local alignments are practical

Causes Dissimilarities may also be non-metric The data is metric if it obeys the metric conditions 1.

D ij

≥ 0 (nonegativity) 2.

D ij

= 0 iff

i

=

j

(identity of indiscernables) 3.

4.

D ij

=

D ji D ij

D ik + D kj

(symmetry) (triangle inequality) Reasonable dissimilarites should meet 1&2

Causes • Symmetry

D

ij

=

D

ji

• May not be symmetric by definition • Alignment:

i

j j

i

may find a better solution than

Causes • Triangle violations

D ij

D ik + D kj

• ‘Extended objects’

i j k D ik D kj

 0  0

D ij

 0 • Finally, noise in the measure of

D

can cause all of these effects

Tests(1) • Find the similarity matrix

K

  1 2

CD

s

C

• The data is Euclidean iff

K

is positive semidefinite (no negative eigenvalues) –

K

is a kernel, explicit embedding from kernel embedding • We can then use

K

in a kernel algorithm

Tests(2) • Negative eigenfraction (NEF) NEF  

i

   0 

i

i

• Between 0 and 0.5

Tests(3) 1.

D ij

≥ 0 (nonegativity) 2.

D ij

= 0 iff

i

=

j

(identity of indiscernables) 3.

4.

D ij

=

D ji D ij

D ik + D kj

(symmetry) (triangle inequality) – Check these for your data (3 rd involves checking all triples) – Metric data is embeddable on a (curved) Reimannian manifold

Corrections • If the data is non-metric or non-Euclidean, we can ‘correct it’ • Symmetry violations – Average

D ij

 

D

ji

 1 2 (

D ij D ij

  

D D ji

ji

) appropriate

D ij

• Triangle violations – Constant offset

D ij

 

D ij

 

c

(

i

j

) – This will also remove non-Euclidean behaviour for large enough

c

• Euclidean violations – Discard negative eigenvalues • There are many other approaches * * “ On Euclidean corrections for non-Euclidean dissimilarities”, Duin, Pekalska, Harol, Lee and Bunke, S+SSPR 08

Part III: Advanced techniques for non-Euclidean Embeddings

Known Manifolds • Sometimes we have data which lies on a known but non Euclidean manifold • Examples in Computer Vision – Surface normals – Rotation matrices – Flow tensors (DT-MRI) • This is not Manifold Learning, as we already know what the manifold is • What tools do we need to be able to process data like this?

– As before, distances are the key

Example: 2D direction Direction of an edge in an image, encoded as a unit vector

x

1

x x

2 The average of the direction vector isn’t even a direction vector (not unit length), let alone the correct ‘average’ direction The normal definition of mean is not correct

x

 1 – Because the manifold is curved

n

i

x

i

Tangent space • The tangent space (

T P

) is the Euclidean space which is parallel to the manifold(

M

) at a particular point (

P

)

M P T P

• The tangent space is a very useful tool because it is Euclidean

Exponential Map • Exponential map:

Exp

P

:

T P

M

• Exp

P

maps a point the manifold

X A

Exp

X P

on the tangent plane onto a point

A

on –

P

is the centre of the mapping and is at the origin on the tangent space – The mapping is one-to-one in a local region of P – The most important property of the mapping is that the distances to the centre P are preserved

d

(

X

,

P

) 

d

(

A

,

P

)

T P M

– The geodesic distance on the manifold equals the Euclidean distance on the tangent plane (for distances to the centre

only

)

Exponential map • The log map goes the other way, from manifold to tangent plane Log

P X

:

M

T p

 Log

P M

Exponential Map • Example on the circle: Embed the circle in the complex plane • The manifold representing the circle is a complex number with magnitude 1 and can be written

x

+

iy

=exp(

i

 )

Im P

e i

P Re

• In this case it turns out that the map is related to the normal exp and log functions

M T P P

e i

P X X

 Log

P A

 

i

log

A P

 

i

log

e i

A e i

P

 

A

 

P A

  Exp exp

X P i

P

P

exp

iX

exp

i

( 

A

 

P

)  exp

i

A A

e i

A

Intrinsic mean • The mean of a set of samples is usually defined as the sum of the samples divided by the number – This is only true in Euclidean space • A more general formula

x

 arg min

x

d

2

g

(

x

,

x

i

)

i

• Minimises the distances from the mean to the samples (equivalent in Euclidean space)

Intrinsic mean • We can compute this

intrinsic

mean using the exponential map • If we knew what the mean was, then we can use the mean as the centre of a map

X i

 Log

M A i

• From the properties of the Exp-map, the distances are the same

d e

(

X i

,

M

) 

d g

(

A i

,

M

) • So the mean on the tangent plane is equal to the mean on the manifold

Intrinsic mean • Start with a guess at the mean and move towards correct answer • This gives us the following algorithm – Guess at a mean

M

0 1. Map on to tangent plane using

M i

2. Compute the mean on the tangent plane to get new estimate

M i

+1

M k

 1  Exp

M k

1

n

i

Log

M k A i

Intrinsic Mean • For many manifolds, this procedure will converge to the intrinsic mean – Convergence not always guaranteed • Other statistics and probability distributions on manifolds are problematic.

– Can hypothesis a normal distribution on tangent plane, but distortions inevitable

Some useful manifolds and maps • • Some useful manifolds and exponential maps • Directional vectors (surface normals etc.)

a

,

a

 1

x

  sin  (

a

p

cos  ) (Log map)

a

p

cos    sin 

x

(Exp map)

a

,

p

unit vectors,

x

lies in an (

n

-1)D space

Some useful manifolds and maps • Symmetric positive definite matrices (covariance, flow tensors etc)

A

,

u

T

Au

 0 

u

 0

X

P

1 2 log

P

 1 2

AP

 1 2 

P

1 2 (Log map)

A

P

1 2 exp

P

 1 2

XP

 1 2 

P

1 2 (Exp map) • A is symmetric positive definite, X is just symmetric • log is the matrix log defined as a generalized matrix function

Some useful manifolds and maps • Orthogonal matrices (rotation matrices, eigenvector matrices)

A

,

AA X A

 

log

P

T

exp

I

   

(Log (Exp map) map)

A

orthogonal,

X

antisymmetric (

X

+

X

T

=0) • These are the matrix exp and log functions as before • In fact there are multiple solutions to the matrix log – Only one is the required real antisymmetric matrix; not easy to find – Rest are complex

Embedding on

S n

• On

S

2 (surface of a sphere in 3D) the following parameterisation is well known

x

(

r

sin

cos

,

r

sin

sin

,

r

cos

)

T

• The distance between two points (the length of the

d

ij

geodesic) is 

r

cos  1  sin 

x

sin 

y

 

xy

cos 

x

cos 

y

x

d xy

y

More Spherical Geometry • But on a sphere, the distance is the highlighted arc-length – Much neater to use inner-product

d xy

x

,

y

r

xy



xy

cos 

xy

r

2 cos 

xy

r

cos  1 

x

,

y

r

2 – And works in any number of dimensions

x

rθ xy θ xy

y

Spherical Embedding • Say we had the distances between some objects (

d ij

), measured on the surface of a [hyper]sphere of dimension

n

• • The sphere (and objects) can be embedded into an

n

+1 dimensional space – Let

X

be the matrix of point positions

Z

=

XX

T

• But • And is a kernel matrix

Z d xy ij

 

r

x

i

cos ,  1

x

j

 

x

,

y

r

2 

Z ij



x

i

,

x

j



r

2 cos

d ij r

• We can compute

Z

embedding! from

D

and find the spherical

Spherical Embedding • But wait, we don’t know what

r

is!

• The distances

D

are non-Euclidean, and if we use the wrong radius,

Z

is not a kernel matrix – Negative eigenvalues • Use this to find the radius – Choose

r

to minimise the negative eigenvalues

r

*  arg min

r

o

Z

(

r

) 

Example: Texture Mapping • As an alternative to unwrapping object onto a plane and texture-mapping the plane • Embed onto a sphere and texture-map the sphere Plane Sphere

Backup slides

Laplacian and related processes • As well as embedding objects onto manifolds, we can model many interesting processes on manifolds • Example: the way ‘heat’ flows across a manifold can be very informative •

du

   2

u

heat equation

dt

 2 is the Laplacian and in 3D Euclidean space it is • On a sphere it is  2 

x

2

r

2 1 sin 2   2   2    2 

y

2 1

r

2 sin    2 

z

2      sin      

Heat flow • Heat flow allows us to do interesting things on a manifold • Smoothing: Heat flow is a diffusion process (will smooth the data) • Characterising the manifold (heat content, heat kernel coefficients...) • The Laplacian depends on the geometry of the manifold – We may not know this – It may be hard to calculate explicitly • Graph Laplacian

Graph Laplacian • Given a set of datapoints on the manifold, describe them by a graph – Vertices are datapoints, edges are adjacency relation • Adjacency matrix (for example)

d

• Then the graph Laplacian is

L

A

V

ij

 

exp(

d

2

ij

2

ij

A

V ii

/

 

)

A ij

manifold Laplacian

Heat Kernel • Using the graph Laplacian, we can easily implement heat flow methods on the manifold using the heat-kernel

d

u

dt

H

  

Lu

heat equation exp(

L

t

) heat kernel

• Can diffuse a function on the manifold by

f

' 

Hf