Independent Component Analysis

Download Report

Transcript Independent Component Analysis

Independent Components Analysis

What is ICA?

“Independent component analysis (ICA) is a method for finding underlying factors or components from multivariate (multi dimensional) statistical data. What distinguishes ICA from other methods is that it looks for components that are both

statistically independent

, and

nonGaussian

.” A.Hyvarinen, A.Karhunen, E.Oja

‘Independent Component Analysis’

ICA

Blind Signal Separation (BSS) or Independent Component Analysis (ICA) is the identification & separation of mixtures of sources with little prior information.

• Applications include: – Audio Processing – Medical data – Finance – Array processing (beamforming) – Coding • … and most applications where Factor Analysis and PCA is currently used.

• While PCA seeks directions that represents data best in a

Σ|x 0 - x| 2

sense, ICA seeks such directions that are most independent from each other.

Often used on Time Series separation of Multiple Targets

ICA estimation principles

by A.Hyvarinen, A.Karhunen, E.Oja ‘Independent Component Analysis’ •

Principle 1

: “Nonlinear decorrelation. Find the matrix

y j

W

so that for any

i ≠ j

, the components

y i

and are uncorrelated,

and

the transformed components

g(y i )

and

h(y j )

are uncorrelated, where some suitable nonlinear functions.”

g

and

h

are •

Principle 2

: “Maximum nongaussianity”. Find the local maxima of nongaussianity of a linear combination

y=Wx

under the constraint that the variance of

x

is constant. • Each local maximum gives one independent component.

ICA mathematical approach

from A.Hyvarinen, A.Karhunen, E.Oja ‘Independent Component Analysis’ “Given a set of observations of random variables

x 1 (t), x 2 (t)…x n (t)

, where

t

is the time or sample index, assume that they are generated as a linear mixture of independent components:

y=Wx

, where

W

is some unknown matrix. Independent component analysis now consists of estimating both the matrix

W

the

x i (t)

.” and the

y i (t)

, when we only observe

The simple “Cocktail Party” Problem

Mixing matrix

A

s

1 Sources

s

2

n

sources, m=

n

observations

x

1 Observations

x

2

x

=

As

0.2

0.1

0.0

-0.1

-0.2

0.0

-0.1

-0.2

0.1

0.2

Classical ICA (fast ICA) estimation

Observing signals Original source signal 0.10

0.05

0.00

-0.05

-0.10

0 50 100 150 200 250 0 50 100 150 200 250

ICA

0.10

0.05

0.00

-0.05

-0.10

0 50 100 150 200 250 0 50 100 150 200 250

Motivation

Two Independent Sources Mixture at two Mics

x

1 (

t

)

x

2 (

t

)  

a

11

s

1

a

21

s

1  

a

12

s

2

a

22

s

2 a IJ ... Depend on the distances of the microphones from the speakers

Motivation

Get the Independent Signals out of the Mixture

ICA Model (Noise Free)

• Use statistical “latent variables“ system • Random variable s k • x j = a j1 s 1 + a j2 s 2 instead of time signal + .. + a jn s n , for all j

x

=

As

• IC‘s

s

are latent variables & are unknown AND Mixing matrix

A

is also unknown • Task: estimate

A

and

s

using only the observeable random vector

x

• Lets assume that no. of IC‘s = no of observable mixtures and

A

is square and invertible • So after estimating A, we can compute W=A -1 and hence

s = Wx = A -1 x

Illustration

2 IC‘s with distribution:

p

(

s i

)  1 2 0 3

if

|

s i

|  3

otherwise

Zero mean and variance equal to 1 Mixing matrix

A

is

A

   2 2 3 1   The edges of the parallelogram are in the direction of the cols of

A

So if we can Est joint pdf of

x 1

&

x 2

locating the edges, we can Est

A

.

and then

Restrictions

• s i are statistically independent – p(s 1 ,s 2 ) = p(s 1 )p(s 2 ) • Nongaussian distributions – The joint density of unit variance s 1 & s 2 is symmetric. So it doesn‘t contain any information about the directions of the cols of the mixing matrix A. So A cann‘t be estimated.

– If only one IC is gaussian, the estimation is still possible.

p

(

x

1 ,

x

2 )  1 2  exp   

x

1 2 

x

2 2 2  

Ambiguities

• Can‘t determine the variances (energies) of the IC‘s – Both s & A are unknowns, any scalar multiple in one of the sources can always be cancelled by dividing the corresponding col of A by it.

– Fix magnitudes of IC‘s assuming unit variance: E{s i 2 } = 1 – Only ambiguity of sign remains • Can‘t determine the order of the IC‘s – Terms can be freely changed, because both

s

and

A

are unknown. So we can call any IC as the first one.

ICA Principal (Non-Gaussian is Independent) • • Key to estimating A is non-gaussianity The distribution of a sum of independent random variables tends toward a Gaussian distribution. (By CLT) • • • • • f(s 1 ) f(s 2 ) f(x 1 ) = f(s 1 +s 2 ) Where

w

is one of the rows of matrix

W.

y

w T x

w T As

z T s

y is a linear combination of s i , with weights given by z i . Since sum of two indep r.v. is more gaussian than individual r.v., so z T s is more gaussian than either of s i . AND becomes least gaussian when its equal to one of s i .

So we could take

w

as a vector which maximizes the non-gaussianity of

w T x

.

Such a

w

s i.

would correspond to a

z

with only one non zero comp. So we get back the

Measures of Non-Gaussianity

• • We need to have a quantitative measure of non-gaussianity for ICA Estimation.

Kurtotis : gauss=0 (sensitive to outliers) • Entropy : gauss=largest

kurt

(

y

) 

E

{

y

4 }  3 (

E

{

y

2 }) 2 • • Neg-entropy : gauss = 0 (difficult to estimate) Approximations

H

(

y

)   

f

(

y

) log

f

(

y

)

dy J

(

y

) 

H

(

y gauss

) 

H

(

y

)

J

(

y

)

J

(

y

)    1 12

E E

G

(

y

  2  )   1

G

(

kurt

48

v

)   2 (

y

) 2 • where

v

is a standard gaussian random variable and :

G

(

y

)  1

a

log cosh(

a

.

y

)

G

(

y

)   exp( 

a

.

u

2 / 2 )

Data Centering & Whitening

• Centering the Alg.

x

= – IC‘s are also zero mean because of:

x

‘ –

E

{

x

‘} – But this doesn‘t mean that ICA cannt estimate the mean, but it just simplifies E{

s

} = W

E

{

x

} – After ICA, add W.

E

{

x

‘} to zero mean IC‘s • Whitening – We transform the x’s linearly so that the x ~ are white. Its done by EVD. x ~ = (ED -1/2 E T )x = ED -1/2 E T Ax = A ~ s where

E

{xx ~ } = EDE T So we have to Estimate Orthonormal Matrix A ~ – An orthonormal matrix has n(n-1)/2 degrees of freedom. So for large dim A we have to est only half as much parameters. This greatly simplifies ICA.

• Reducing dim of data (choosing dominant Eig) while doing whitening also help.

Computing the pre-processing steps for ICA 0) Centring = make the signals centred in zero x i  x i - E[x i ] for each i 1) Sphering = make the signals uncorrelated. I.e. apply a transform

V

such that Cov(

Vx

)=

I

to

x

// where Cov(

y

)=E[

yy

T ] denotes covariance matrix

V

=E[

xx x

Vx

T ] -1/2 // can be done using ‘sqrtm’ function in MatLab // for all t (indexes t dropped here) // bold lowercase refers to column vector; bold upper to matrix Scope: to make the remaining computations simpler. It is known that independent variables must be uncorrelated – so this can be fulfilled before proceeding to the full ICA

Computing the rotation step Aapo Hyvarinen (97) This is based on an the maximisation of an objective function G(.) which contains an approximate non-Gaussianity measure.

Obj

(

W

) 

t T

  1

G

(

W

T

x

t

) 

Λ

(

W

T

W

I

) 

Obj

W

X

g

(

W

T

X

)

T

ΛW

0

Fixed Point Algorithm Input:

X

Random init of

W

Iterate until convergence:

S

W

T

X W

X

g

(

S

)

T

W

W

Output:

W

,

S

(

W

T

W

)  1 where g(.) is derivative of G(.),

W Λ

is the rotation transform sought is Lagrange multiplier to enforce that W is an orthogonal transform i.e. a rotation Solve by fixed point iterations  The overall transform then to take

X

back to

S

is (

W

T

V

)  There are several g(.) options, each will work best in special cases. See FastICA sw / tut for details.

The effect of

Λ

is an orthogonal de-correlation

Application domains of ICA

• Blind source separation (Bell&Sejnowski, Te won Lee, Girolami, Hyvarinen, etc.) • Image denoising (Hyvarinen) • Medical signal processing – fMRI, ECG, EEG (Mackeig) • Modelling of the hippocampus and visual cortex (Lorincz, Hyvarinen) • Feature extraction, face recognition (Marni Bartlett) • Compression, redundancy reduction • Watermarking (D Lowe) • Clustering (Girolami, Kolenda) • Time series analysis (Back, Valpola) • Topic extraction (Kolenda, Bingham, Kaban) • Scientific Data Mining (Kaban, etc)

Original image Wiener filtering

Image denoising

Noisy image ICA filtering

Noisy ICA Model

• • • • • •

x

=

As

+

n A

... mxn mixing matrix

s

... n-dimensional vector of IC‘s

n

... m-dimensional random noise vector Same assumptions as for noise-free model, if we use measures of nongaussianity which are immune to gaussian noise.

So gaussian moments are used as contrast functions. i.e.

J

(

y

)

G

(

y

)   

E

G

 1 / (

y

2 

c

)  

G

 exp(  (

v x

2 )   2 / 2

c

2 ) however, in pre-whitening the effect of noise must be taken in to account:

x ~ = (E{xx

T

} - Σ) -1/2 x x ~ = Bs + n ~ .

Exercise (part 1, Updated Nov 10)

• How would you calculate efficiently the PCA of data where the dimensionality d is much larger than the number of vector observations n?

• Download the Wisconsin Data from the UC Irvine repository, extract PCAs from the data, test scatter plots of original data and after projecting onto the principal components, plot Eigen values 22

Ex1. Part 2 to [email protected]

subject: Ex1 and last names

1. Given a high dimensional data, is there a way to know if all possible projections of the data are Gaussian? Explain - What if there is some additive Gaussian noise?

Ex1. (cont.)

2. Use Fast ICA (easily found in google) http://www.cis.hut.fi/projects/ica/fastica/co de/dlcode.html

– Choose your favorite two songs – Create 3 mixture matrices and mix them – Apply fastica to de-mix

Ex1 (cont.)

• Discuss the results – What happens when the mixing matrix is symmetric – Why did u get different results with different mixing matrices – Demonstrate that you got close to the original files – Try different nonlinearity of fastica, which one is best, can you see that from the data

• • • • • • • •

References

Feature extraction (Images, Video) – http://hlab.phys.rug.nl/demos/ica/ Aapo Hyvarinen: ICA (1999) – http://www.cis.hut.fi/aapo/papers/NCS99web/node11.html

ICA demo step-by-step – http://www.cis.hut.fi/projects/ica/icademo/ Lots of links – http://sound.media.mit.edu/~paris/ica.html

object-based audio capture demos – http://www.media.mit.edu/~westner/sepdemo.html

Demo for BBS with „CoBliSS“ (wav-files) – http://www.esp.ele.tue.nl/onderzoek/daniels/BSS.html

Tomas Zeman‘s page on BSS research – http://ica.fun-thom.misto.cz/page3.html

Virtual Laboratories in Probability and Statistics – http://www.math.uah.edu/stat/index.html