Sparse Solutions for Kernel Machines

Download Report

Transcript Sparse Solutions for Kernel Machines

Sparse Solutions for Large
Scale Kernel Machines
Taher Dameh
CMPT820-Multimedia Systems
[email protected]
Dec 2nd , 2010
Outline







Introduction
Motivation: Kernel machines applications in
multimedia content analysis and search
Challenges in large scale kernel machines
Previous Work
Sub-Quadratic approach to compute the
sparse Gram matrix
Results
Conclusion and future Work
Introduction

Given a set of points, with a notion of distance between points,
group the points into some number of clusters.

We use Kernel functions to compute the similarity between each pair
of points to produce a Similarity (Gram) Matrix (O(N2) space and
computation)

Example of kerenl kernel machines:
 Support Vector Machines SVM (formulated for 2 classes)
 Relevance Vector Machines (result much sparse models)
 Guassian Process
 Fisher’s Linear discriminant analysis LDA
 Kernel PCA
Kernel machines applications in
multimedia content analysis and search




BroadCast Video Summarization Using
Clustering
Document Clustering
Audio Content Discovery
Searching one billion web images by content
Challenges and Sparse Solutions for
Kernel Machines

One of the significant limitations of many kernel methods is that
the kernel function k(x,y) must be evaluated for all possible pairs
x and y of training points, which can be computationally
infeasible.

traditional algorithm analysis assumes that the data fits in main
memory, it is unreasonable to make such assumptions when
dealing with massive data sets such as multimedia data, web
page repositories and so on

Observing that kernel machines are Redial Basis Function, then
the gram matrices have many values that are close to zero

We are developing algorithms to approximate the gram matrix to
sparse one (filtering out the small similarities)
Previous Work

Approximation depending on the Eigen spectrum of the gram
matrix
 The Eigen spectrum rapidly decays especially when the kernel
function is Radial basis (most information stored in the first few
eigen vectors)

Sparse Bayesian learning
 Methods that leads to much sparse models



Relevance vector machines (RVM)
Sparse kernel principle component analysis (sparse KPCA )
Efficient Implementation of computing the kernel function
 Space filling curves
 Locality Sensitive Hashing (OUR Method)
Locality Sensitive hashing


Hash the data-points so that probability of collision is higher for close points.
A Family H=h : S → U is called (r1,r2,p1,p2)-sensitive, if for any v,q є S


dist(v,q) < r1 → ProbH [h(v) = h(p)] ≥ p1
dist(v,q) > r2 → ProbH [h(v) = h(p)] ≤ p2



For a proper choice of k (will be shown later),


p1 > p2
and r1 < r2  r2=cr1; c>1
We need the gap between p1 and p2 a quite large
g(v) = {h1(v), …,hk(v) }
We compute the kernel function between the points that reside at the same
bucket.

using this approach and for a hash table of size m (assuming the buckets have
the same number of points) computing the gram matrix will have the complexity of
N2/m
Sub-quadratic approach using LSH


Claim 1:The number of concatenated hash values k is
logarithmic in the size of datasets n and independent of
the dimension d
Proof: Given a set of n points P in the d-dimensional
space and (r1; r2; p1; p2)-sensitive hash functions, and
given a point q, the probability that

Is at most p2k = B/n , where B is the average bucket
size. then we can find that:


Claim2:The complexity of computing the
approximated gram matrix using the locality
sensitive hashing is sub-quadratic.
Proof:
FN ratio VS Memory reeducation for
different values of k
Affinity Propagation results for different
values of k
Second stage of AP over the first stage
weighted exemplars
N*d input vectors
Hashing
m segments each of size
(N/m)*d
L buckets files
Clustering
Clusters with weights
Compute Gram matrix of each
bucket (gram matrix size is
(N/L)2 ) and run clustering
algorithm on each bucket’s
Gram matrix
Combine clusters with
Weights
Run second phase of
clustering
Final Clusters
Conclusion and future work

Brute force kernel methods require O(N2) space and computation,
where the assumption that data fits in main memory no longer
works.

Approximate the full Gram matrix to sparse one depending on the
redial basis property of such methods would reduce this quadratic
down into sub-quadratic

Using the locality sensitive hashing we can find the close points
to compute the kennel function between them and also we can
distrusted the processing as the bucket will be the base unit.

Future Work: working on control the error as k increases, so we
can both run very large scale data and at the same time maintain
sufficient accuracy.