Transcript PowerPoint 簡報 - Asia University
17
Correlation
Chapter17 p399
Semimetric distance – Pearson correlation coefficient or Covariance
Var
(
x
)
s
2
n i
1 (
x i n
1
x
) 2 How about higher dimension data ? - It is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other. - Covariance is measured between 2 dimensions, - suppose one have a 3-dimension data set (X,Y,Z), then one can calculate Cov(X,Y), Cov(X,Z) and Cov(Y,Z)
Cov
(
X
,
Y
)
n i
1 (
x i
x
)(
y i n
1
y
) - to compare heterogenous pairs of variables, define the correlation coefficient or Pearson correlation coefficient, -1 ≦ r
XY
≦
1
r
XY
Cov
(
X
,
Y
) (var
X
)(var
Y
) -1 0 +1 perfect anticorrelation independent perfect correlation
Semimetric distance – the squared Pearson correlation coefficient
• Pearson correlation coefficient is useful for examining correlations in the data • One may imagine an instance, for example, in which the same TF can cause both enhancement and repression of expression.
• A better alternative is the squared Pearson correlation coefficient (pcc), r
sq
r 2
XY
[
Cov
(
X
,
Y
)] 2 var(
X
) var(
Y
) The square pcc takes the values in the range 0 ≦ 0 uncorrelate vector 1 perfectly correlated or anti-correlated r
sq
≦ 1.
pcc are measures of similarity Similarity and distance have a reciprocal relationship similarity↑ distance↓
d = 1 –
r is typically used as a measure of distance
Semimetric distance – Pearson correlation coefficient or Covariance
- The resulting together, below 0 if they tend to decrease together, and 0 if they are independent.
Remark:
r
XY
r
XY
value will be larger than 0 if a and b tend to increase only test whether there is a linear dependence, Y=aX+b - if two variables independent low - a low r
XY
r
XY
, independent, it may be a non-linear relation - a high r
XY
may or may not
is a sufficient but not necessary condition for variable dependence
Semimetric distance – the squared Pearson correlation coefficient
• To test for a non-linear relation among the data, one could make a transformation by variables substitution • • • Suppose one wants to test the relation u(v) = av n • Take logarithm on both sides • log u = log a + n log v • Set Y = log u, b = log a, and X = log v a linear relation, Y = b + nX log u correlates (n>0) or anti-correlates (n<0) with log v
Semimetric distance – Pearson correlation coefficient or Covariance matrix
A covariance matrix is merely collection of many covariances in the form of a
d x d
matrix:
Spearman’s rank correlation (SRC)
• One of the problems with using the PCC is that it is susceptible to being skewed by outliers : a single data point can result in two genes appearing to be correlated , even when all the other data points suggest that they are not .
• Spearman’s rank correlation (SRC) is a non-parametric measure of correlation that is robust to outliers .
• SRC is a measure that ignores the magnitude of the changes . The idea of the rank correlation is to transform the original values into ranks, and then to compute the correlation between the series of ranks. • First we order the values of gene A and B in ascending order , and assign the lowest value with rank 1 . The SRC between A and B is defined as the PCC between ranked A and B. • In case of ties assign mid-ranks both are ranked 5, then assign a rank of 5.5
Spearman’s rank correlation
The SRC can be calculated by the following formula, where
x i
denote the rank of the x and y respectively.
and
y i
r
SRC
(
X
,
Y
)
n i
1 (
x i
[
n i
1 (
x i
x
)(
y i
y
)
x
) 2 ][
n i
1 (
y i
y
) 2 ] An approximate formula in case of ties is given by r
SRC
(
X
,
Y
) 1 6
n i
1 (
x i n
(
n
2 1 )
y i
) 2
SRC vs. PCC
Time 0.5
2 5 7 9 11 Gene A ratio -0.76359
2.276659
2.137332
1.900334
0.932457
0.761866
Gene B ratio -4.05957
-1.7788
-0.97433
-1.44114
-0.87574
-0.52328
Gene A rank 1
6 5 4 3 2
Gene B rank 1
2 4 3 5 6
PCC(A, B) = 0.633
SRC(A,B) = -0.086
Chapter17 p401
Chapter17 p408