KERNEL INDEPENDENT COMPONENT ANALYSIS BY FRANCIS …

Download Report

Transcript KERNEL INDEPENDENT COMPONENT ANALYSIS BY FRANCIS …

KERNEL INDEPENDENT COMPONENT
ANALYSIS
BY
FRANCIS BACH & MICHAEL JORDAN
International Conference on Acoustics,
Speech, and Signal Processing
(ICASSP), 2003
Presented by Nagesh Adluru
Goal of the Paper
To perform Independent Component
Analysis (ICA) in a novel way which is
better, robust compared to the
existing techniques.
07.07.2015
2
Concepts Involved
ICA – Independent Component Analysis
Mutual Information
F – Correlation
RKHS – Reproducing Kernel Hilbert
Spaces
CCA – Canonical Correlation Analysis
KICA – Kernel ICA
KGV – Kernel Generalized Variance
07.07.2015
3
ICA – Independent Component Analysis
 ICA is unsupervised learning
 We have estimate x given the set of observations of y
(Assumption components of x are independent).
 So we have to estimate W such that
 x = Wy
07.07.2015
4
ICA – Independent Component Analysis
 ICA is semi-parametric.
 Because we do not know anything about the
distribution of x it is non-parametric.
 But we do know the distribution of y and that it
is a distribution of ‘linear combination’ of
components of x.
 So the problem is semi-parametric and kernels
do well in such situations.
07.07.2015
5
ICA – Independent Component Analysis
If we knew the distribution of x then we
can assume the ‘x-space’ and hence can
find W using gradient or fixed-point
algorithm.
But not in practice!!! So how??
Since we are looking for independent
components we need to maximize the
independence or minimize mutual
information.
07.07.2015
6
Mutual Information
 Mutual Information is an abstract term that is
used to describe independence among
variables.
 The mutual information is the least when the
dependence is the least.
 So looks promising to be explored!!!
 Prior work has focused on approximations to this
term because of difficulty involved with realvariables and finite samples.
 Kernels offer better ways.
07.07.2015
7
F – Correlation
 F – Correlation is defined as below:
 If x1 and x2 are independent then the value is
zero but converse is important here.
07.07.2015
8
F – Correlation
Converse: If
is zero then the x1 and
x2 are independent.
Is that true?
It is true only if F ‘space’ is very large.
But it is also true if F is restricted to the
reproducing Kernel Hilbert Spaces based
on Gaussian kernels.
07.07.2015
9
F – Correlation
Since the converse holds even for the
restriction of F to RKHS, a mutual
information can be defined such that if it is
0 then the two variables are independent.
07.07.2015
10
RKHS – Reproducing Kernel Hilbert
Spaces
 Operations using kernels can be treated as
operations in Hilbert space.
 The reproducing ability of the kernels of
operations in Euclidean space is exploitable for
computational purposes.
 So the correlation between fs can be interpreted
as the correlation between Фs which is defined
as the canonical correlation between Фs.
07.07.2015
11
CCA – Canonical Correlation Analysis
CCA vs PCA
PCA maximizes variance of projection of
distribution of a single random vector.
CCA maximizes correlation between
projections of distributions of two or more
random vectors. CIJ = cov(xI, xJ)
07.07.2015
12
CCA – Canonical Correlation Analysis
 While PCA leads to eigenvector problem CCA
leads to generalized eigenvector problem.
(Eigenvector problem: AV = V Generalized
eigenvector problem: AV = BV)
 The CCA can easily be kernelized and also
generalized to more than two random vectors.
 So the max correlation between variables can
be found efficiently, which is very nice.
07.07.2015
13
CCA – Canonical Correlation Analysis
Though this kernelization of CCA can help
us, the generalization is not precise in
terms mutual independence measure
using F – Correlation.
But that is not limitation in practice, both
because of empirical results as well as
because mutuality could be achieved
using pair-wise dependence.
07.07.2015
14
Kernel ICA
We saw
And also that
can be calculated using
kernelized CCA.
So we now have Kernel – ICA not in the
sense that the basic ICA is kernelized but
because using kernelized CCA.
07.07.2015
15
KICA – Kernel ICA Algorithm
Input: W and
Procedure:
Estimate set
Minimize
are [N*N] Gram matrices for
each component of the random vector.
(Equivalent to generalized CCA, where
each of the m vectors is a single element
vector)
07.07.2015
16
KICA – Kernel ICA
Computational Complexity of calculating
‘smallest’ generalized eigen value of
matrices of size mN is O(N3). (Note: the
eigen values are not directly related to the
entries in W.)
But we can reduce it because of special
properties of the Gram matrix spectrum (or
range of values in its space) to O(M2N),
where M is a constant < N.
07.07.2015
17
KICA – Kernel ICA
 The next crucial job is to find minimum C(W) in
the space and that W is called de-mixing matrix.
 Preferably data is whitened (PCA) and W is
restricted to be ‘orthogonal’ because decorrelation implies independence.
 The search for W in this restricted space (called
Stiefel manifold) can be done with Riemannian
metric suggesting gradient type algorithms.
07.07.2015
18
KICA – Kernel ICA
The problem of local-minima can be
solved either using heuristics (instead of
random) for selecting initial W.
Also it has been shown empirically that a
decent number of restarts would solve this
problem when large number of samples
are available.
07.07.2015
19
KGV – Kernel Generalized Variance
F – Correlation is the ‘smallest’
generalized eigenvalue of KCCA.
Idea with KGV is to make use of other
values as well.
The mutual information contrast function is
defined as
where
07.07.2015
20
Simulation Results
The results on the simulation data showed
that the KICA is better compared to other
ICA algorithms like FastICA, Jade, Imax
for larger number of ‘components’.
The simulation data was mixture of variety
of source distributions like subgaussian,
supergaussian and nearly gaussian.
The KICA is also robust for outliers.
07.07.2015
21
Simulation Results
07.07.2015
22
Conclusions
This paper proposed novel kernel-based
measures for independence.
The approach is flexible and
computationally demanding (because of
additional search in finding eigenvalues).
07.07.2015
23
Questions!!
07.07.2015
24