PowerPoint 簡報 - Asia University

Download Report

Transcript PowerPoint 簡報 - Asia University

17

Correlation

Chapter17 p399

Semimetric distance – Pearson correlation coefficient or Covariance

Var

(

x

) 

s

2  

n i

 1 (

x i n

 1 

x

) 2 How about higher dimension data ? - It is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other. - Covariance is measured between 2 dimensions, - suppose one have a 3-dimension data set (X,Y,Z), then one can calculate Cov(X,Y), Cov(X,Z) and Cov(Y,Z)

Cov

(

X

,

Y

)  

n i

 1 (

x i

x

)(

y i n

 1 

y

) - to compare heterogenous pairs of variables, define the correlation coefficient or Pearson correlation coefficient, -1 ≦ r

XY

1

r

XY

Cov

(

X

,

Y

) (var

X

)(var

Y

) -1  0  +1  perfect anticorrelation independent perfect correlation

Semimetric distance – the squared Pearson correlation coefficient

• Pearson correlation coefficient is useful for examining correlations in the data • One may imagine an instance, for example, in which the same TF can cause both enhancement and repression of expression.

• A better alternative is the squared Pearson correlation coefficient (pcc), r

sq

 r 2

XY

 [

Cov

(

X

,

Y

)] 2 var(

X

) var(

Y

) The square pcc takes the values in the range 0 ≦ 0  uncorrelate vector 1  perfectly correlated or anti-correlated r

sq

≦ 1.

pcc are measures of similarity Similarity and distance have a reciprocal relationship similarity↑  distance↓ 

d = 1 –

r is typically used as a measure of distance

Semimetric distance – Pearson correlation coefficient or Covariance

- The resulting together, below 0 if they tend to decrease together, and 0 if they are independent.

Remark:

r

XY

r

XY

value will be larger than 0 if a and b tend to increase only test whether there is a linear dependence, Y=aX+b - if two variables independent  low - a low r

XY

r

XY

, independent, it may be a non-linear relation - a high r

XY

may or may not

 is a sufficient but not necessary condition for variable dependence

Semimetric distance – the squared Pearson correlation coefficient

• To test for a non-linear relation among the data, one could make a transformation by variables substitution • • • Suppose one wants to test the relation u(v) = av n • Take logarithm on both sides • log u = log a + n log v • Set Y = log u, b = log a, and X = log v  a linear relation, Y = b + nX  log u correlates (n>0) or anti-correlates (n<0) with log v

Semimetric distance – Pearson correlation coefficient or Covariance matrix

A covariance matrix is merely collection of many covariances in the form of a

d x d

matrix:

Spearman’s rank correlation (SRC)

• One of the problems with using the PCC is that it is susceptible to being skewed by outliers : a single data point can result in two genes appearing to be correlated , even when all the other data points suggest that they are not .

• Spearman’s rank correlation (SRC) is a non-parametric measure of correlation that is robust to outliers .

• SRC is a measure that ignores the magnitude of the changes . The idea of the rank correlation is to transform the original values into ranks, and then to compute the correlation between the series of ranks. • First we order the values of gene A and B in ascending order , and assign the lowest value with rank 1 . The SRC between A and B is defined as the PCC between ranked A and B. • In case of ties assign mid-ranks  both are ranked 5, then assign a rank of 5.5

Spearman’s rank correlation

The SRC can be calculated by the following formula, where

x i

denote the rank of the x and y respectively.

and

y i

r

SRC

(

X

,

Y

)  

n i

 1 (

x i

[ 

n i

 1 (

x i

 

x

)(

y i

y

)

x

) 2 ][ 

n i

 1 (

y i

y

) 2 ] An approximate formula in case of ties is given by r

SRC

(

X

,

Y

)  1  6 

n i

 1 (

x i n

(

n

2   1 )

y i

) 2

SRC vs. PCC

Time 0.5

2 5 7 9 11 Gene A ratio -0.76359

2.276659

2.137332

1.900334

0.932457

0.761866

Gene B ratio -4.05957

-1.7788

-0.97433

-1.44114

-0.87574

-0.52328

Gene A rank 1

6 5 4 3 2

Gene B rank 1

2 4 3 5 6

PCC(A, B) = 0.633

SRC(A,B) = -0.086

Chapter17 p401

Chapter17 p408