Clustering Methods

Download Report

Transcript Clustering Methods

Isomap Algorithm
http://isomap.stanford.edu/
Yuri Barseghyan
Yasser Essiarab
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Linear Methods for Dimensionality
Reduction
– PCA (Principal Component Analysis): rotate data so that
principal axes lie in direction of maximum variance
– MDS (Multi-Dimensional Scaling): find coordinates that best
preserve pairwise distances
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Limitations of Linear methods
• What if the data does not lie within a linear subspace?
• Do all convex combinations of the measurements generate
plausible data?
• Low-dimensional non-linear Manifold embedded in a higher
dimensional space
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
http://www.cs.unc.edu/Courses/comp290-090-s06/Lecturenotes/DimReduction1.pdf
Non-linear Dimensionality Reduction
• What about data that cannot be described by linear combination of
latent variables?
– Ex: swiss roll, s-curve
• In the end, linear methods do nothing more than “globally
transform” (rotate/translate/scale) data.
Sometimes need to
“unwrap” the data first
PCA
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
http://www.cs.unc.edu/Courses/comp290-090-s06/Lecturenotes/DimReduction2.pdf
Non-linear Dimensionality Reduction
• Unwrapping the data = “manifold learning”
• Assume data can be embedded on a lower-dimensional manifold
• Given data set X = {xi}i=1…n, find representation Y = {yi}i=1…n
where Y lies on lower-dimensional manifold
• Instead of preserving global pairwise distances, non-linear
dimensionality reduction tries to preserve only the geometric
properties of local neighborhoods
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Isometry
• From Mathworld: two Riemannian manifolds M and N are
isometric if there is a diffeomorphism such that the Riemannian
metric from one pulls back to the metric on the other.
For a complete Riemannian manifold:
d(x, y) = geodesic distance between x and y
• Informally, an isometry is a smooth invertible mapping that looks
locally like a rotation plus translation
• Intuitively, for 2-dimensional case, isometries include whatever
physical transformations one can perform on a sheet of paper
without introducing tears, holes, or self-intersections
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Trustworthiness [2]
The trustworthiness quanties how trustworthy is a projection of a
high-dimensional data set onto a low-dimensional space.
Specically a projection is trustworthy if the set of the t nearest
neighbors of each data point in the lowdimensional space are also
close-by in the original space.
v
2
M (t )  1 
(r (i, j )  t ),


vt(2v  3t  1) i 1 jU t (i )
r(i, j) is the rank of the data point j in the ordering according to the
distance from i in the original data space
Ut(i) denotes the set of those data points that are among the tnearest neighbors of the data point i in the low-dimensional space
but not in the original space.
The maximal value that trustworthiness can take is equal to one.
The closer M(t) is to one, the better the low-dimensional space
describes the originaldata.
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Several methods to learn a manifold
• Two to start:
– Isomap [Tenenbaum 2000]
– Locally Linear Embeddings (LLE) [Roweis and Saul, 2000]
• Recently:
– Semidefinite Embeddings (SDE) [Weinberger and Saul, 2005]
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
An important observation
•
Small patches on a non-linear manifold look linear
•
These locally linear neighborhoods can be defined in two ways
– k-nearest neighbors: find the k nearest points to a given point, under some
metric. Guarantees all items are similarly represented, limits dimension to K-1
– ε-ball: find all points that lie within ε of a given point, under some metric. Best
if density of items is high and every point has a sufficient number of neighbors
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
http://www.cs.unc.edu/Courses/comp290-090-s06/Lecturenotes/DimReduction1.pdf
Isomap
• Find coordinates on lower-dimensional manifold that preserve
geodesic distances instead of Euclidean distances
• Key Observation:
If goal is to discover
Small
Euclidean
distance
underlying manifold,
geodesic distance
makes more sense
than Euclidean
Large
geodesic
distance
http://www.cs.unc.edu/Courses/comp290-090-s06/Lecturenotes/DimReduction1.pdf
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Calculating geodesic distance
• We know how to calculate Euclidean distance
• Locally linear neighborhoods mean that we can approximate
geodesic distance within a neighborhood using Euclidean distance
• A graph is constructed by connecting each point to its K nearest
neighbours.
• Approximate geodesic
distances are calculated by
finding the length of the
shortest path in the graph
between points
• Use Dijkstra’s algorithm to
fill in remaining distances
http://www.maths.lth.se/bioinformatics/calendar/20040527/NilssonJ_KI_27maj04.pdf
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Dijkstra’s Algorithm
• Greedy breadth-first algorithm to compute shortest path from
one point to all other points
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
http://www.cs.unc.edu/Courses/comp290-090-s06/Lecturenotes/DimReduction2.pdf
Isomap Algorithm
– Compute fully-connected
neighborhood of points for each
item
• Can be k nearest neighbors
or ε-ball
– Calculate pairwise Euclidean
distances within each
neighborhood
– Use Dijkstra’s Algorithm to
compute shortest path from
each point to non-neighboring
points
– Run MDS on resulting distance
matrix
http://www.cs.unc.edu/Courses/comp290-090-s06/Lecturenotes/DimReduction2.pdf
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Isomap Algorithm [3]
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Time Complexity of Algorithm
http://www.cs.rutgers.edu/~elgammal/classes/cs536/lectures/NLDR.pdf
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Isomap Results
Find a 2D embedding of the 3D S-curve
http://www.cs.unc.edu/Courses/comp290-090-s06/Lecturenotes/DimReduction2.pdf
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Residual Fitting Error
http://www.cs.unc.edu/Courses/comp290-090-s06/Lecturenotes/DimReduction2.pdf
Plotting eigenvalues from MDS will tell you dimensionality of your
data
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Neighborhood Graph
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
http://www.cs.unc.edu/Courses/comp290-090-s06/Lecturenotes/DimReduction2.pdf
More Isomap Results
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
http://www.cs.unc.edu/Courses/comp290-090-s06/Lecturenotes/DimReduction2.pdf
Results on projecting the face dataset to two
dimensions (Trustworthiness−Continuity) [1]
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
More Isomap Results
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
http://www.cs.unc.edu/Courses/comp290-090-s06/Lecturenotes/DimReduction2.pdf
Isomap Failures
• Isomap has problems on closed manifolds of arbitrary topology
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
http://www.cs.unc.edu/Courses/comp290-090-s06/Lecturenotes/DimReduction2.pdf
Isomap: Advantages
• Nonlinear
• Globally optimal
– Still produces globally optimal low-dimensional Euclidean
representation even though input space is highly folded,
twisted, or curved.
• Guarantee asymptotically to recover the true dimensionality.
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Isomap: Disadvantages
• Guaranteed asymptotically to recover geometric structure of
nonlinear manifolds
– As N increases, pairwise distances provide better
approximations to geodesics by “hugging surface” more
closely
– Graph discreteness overestimates dM(i,j)
• K must be high to avoid “linear shortcuts” near regions of high
surface curvature
• Mapping novel test images to manifold space
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Literature
[1] Jarkko Venna and Samuel Kaski, Nonlinear dimensionality
reduction viewed as information retrieval, NIPS' 2006 workshop
on Novel Applications of Dimensionality Reduction, 9 Dec 2006
http://www.cis.hut.fi/projects/mi/papers/nips06_nldrws_poster.pdf
[2] Claudio Varini, Visual Exploration of Multivariate Data in Breast
Cancer by Dimensional Reduction, March 2006
http://deposit.ddb.de/cgibin/dokserv?idn=98073472x&dok_var=d1&dok_ext=pdf&filena
me=98073472x.pdf
[3] YimingWu, Kap Luk Chan, An Extended Isomap Algorithm for
Learning Multi-Class Manifold, Machine Learning and
Cybernetics, 2004. Proceedings of 2004 International Conference,
Aug. 2004
http://ww2.cs.fsu.edu/~ywu/PDF-files/ICMLC2004.pdf
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi