#### Transcript Principal component Analysis

## Principal Component Analysis

### Philosophy of PCA

Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data in terms of a set of uncorrelated variables We typically have a data matrix of

*n*

observations on

*p*

correlated variables

*x 1 ,x 2 ,…x p*

PCA looks for a transformation of the

*x i*

into

*p*

new variables uncorrelated

*y i*

that are

### The data matrix

**case ht (x 1 ) wt(x 2 ) age(x 3 ) sbp(x 4 ) heart rate (x 5 )**

1 175 1225 25 117 56 2 n 156 202 1050 1350 31 58 122 154 63 67

### Reduce dimension

The simplet way is to keep one variable and discard all others: not reasonable!

Wheigt all variable equally: not reasonable (unless they have same variance) Wheigted average citerion.

Which criterion ?

based on some

### Let us write it first

Looking for a transformation of the data matrix **X **(*n* x *p*) such that

*Y= *

*T X=*

*1 X 1 + *

*2 X 2 +..+ *

*p X p*

Where

*=(*

*1 , *

*2 ,.., *

*p ) T*

of wheights with is a column vector

*1 ²+ *

*2 ²+..+ *

*p ² =1 *

### One good criterion

Maximize the variance of the projection of the observations on the Y variables Find

*so that Var(*

*T X)= *

*T*

*Var(X) *

*is maximal*

The matrix **C=Var(X) **is the covariance matrix of the *X*

*i*

variables

### Let us see it on a figure

Good Better

### Covariance matrix

C=

*v*

(

*x*

1 )

*c(x*

1

*,x*

2

*)*

........

*c(x*

1

*,x p )*

*c(x*

1

*,x*

2

*c(x*

1

*,x p ) ) v*

(

*x c(x*

2 2 )

*,x*

........

*c(x*

2

*p ,x p )*

..........

*v*

(

*x p )*

)

### And so.. We find that

The direction of eigenvector

**1**

largest eigenvalue is given by the correponding to the of matrix **C** The second vector that is orthogonal (uncorrelated) to the first is the one that has the second highest variance which comes to be the eignevector corresponding to the second eigenvalue And so on …

So PCA gives New variables *Y* combination of the original variables (*x*

*i*

):

*i*

that are linear

*Y i = a i1 x 1 +a i2 x 2 +…a ip x p*

; *i=1..p* The new variables *Y*

*i*

are derived in decreasing order of importance ; they are called ‘ principal components ’

Calculating eignevalues and eigenvectors The eigenvalues

*i*

are found by solving the equation

*det(C-*

*I)=0*

Eigenvectors are columns of the matrix A such that C=A D A T Where D= 1 0 0 0 2 ........

0 .......

0 0 ..........

..

*p*

### An example

Let us take two variables with covariance *c>0*

**C**

1

*c c*

1

**C**

**I**

= 1

*c*

1

*c*

*det(*

*C*

*-*

*I*

*)=(1-*

*)²-c²*

Solving this we find

*1*

*2*

=1+c =1-c <

*1*

### and eigenvectors

Any eigenvector A satisfies the condition **C**A= A A=

*a*

1

*a*

2 CA= 1

*c c*

1

*a*

1

*a*

2 =

*a*

1

*ca*

1

*ca*

2

*a*

2 =

*a*

1

*a*

2 Solving we find A 1 A 2

### PCA is sensitive to scale

If you multiply one variable by a scalar you get different results (can you show it?) This is because it uses covariance matrix (and not correlation) PCA should be applied on data that have approximately the same scale in each variable

### Interpretation of PCA

The new variables (PCs) have a variance equal to their corresponding eigenvalue Small

*i*

component

*Var(Y i )=*

*i*

for all small variance

*i=1…p*

data change little in the direction of

*Y i*

The relative variance explained by each PC is given by

*i /*

*i*

### How many components to keep?

Enough PCs to have a cumulative variance explained by the PCs that is >50-70% Kaiser criterion : keep PCs with eigenvalues >1 Scree plot : represents the ability of PCs to explain de variation in data

### Do it graphically

### Interpretation of components

See the wheights of variables in each component If

*Y 1 = 0.89 X 1 +0.15X*

*2 -0.77X*

*3 +0.51X*

*4*

Then

*X 1 and X 3*

have the highest wheights and so are the mots important variable in the first PC See the correlation between variables

*X i*

and PCs: circle of correlation

### Circle of correlation

### Normalized (standardized) PCA

If variables have very heterogenous variances we standardize them The standardized variables X i *

*X i *= (X i -mean)/*

*variance*

The new variables all have the same variance (1), so each variable have the same wheight.

### Application of PCA in Genomics

PCA is useful for finding new, more informative, uncorrelated features; it reduces dimensionality by rejecting low variance features Analysis of expression data Analysis of metabolomics data (Ward et al., 2003)

### However

PCA is only powerful if the biological question is related to the highest variance in the dataset If not other techniques are more useful : Independent Component Analysis Introduced by Jutten in 1987

### What is ICA?

### That looks like that

### The idea behind ICA

### How it works?

### Rationale of ICA

Find the components S i that are as independent as possible in the sens of maximizing some function F(s 1 ,s 2 ,.,s k ) that measures indepedence All ICs (except 1) should be non Normal The variance of all ICs is 1 There is no hierarchy between ICs

### How to find ICs ?

Many choices of objective function F Mutual information

*MI*

*f*

(

*s*

1 ,

*s*

2 ,...,

*s k*

)

*Log f*

(

*s*

1 ,

*s*

2 ,...,

*s k*

)

*f*

1 (

*s*

1 )

*f*

2 (

*s*

2 )...

*f k*

(

*s k*

) We use to approximate the distribution function the kurtosis of the variables The number of ICs is chosen by the user

### Difference with PCA

It is not a dimensionality reduction technique There is no single (exact) solution for components; uses different algorithms (in R: FastICA, PearsonICA, MLICA) ICs are of course uncorrelated but also as independent as possible Uninteresting for Normally distributed variables

Example: Lee and Batzoglou (2003) Microarray expression data on samples (19 types) 7070 genes in 59 Normal human tissue We are not interested in reducing dimension but rather in looking for genes that show tissue specific expression profile (what make tissue types differents)

### PCA vs ICA

Hsiao et al (2002) applied PCA and by visual inspection observed three gene cluster of 425 genes: liver specific, brain-specific and muscle specific ICA identified more tissue-specific genes than PCA