Transcript Correlation

Correlation
Rizal Maulana
Outline
Basic Concepts of Correlation
 Scatter Diagrams
 One Sample Hypothesis Testing for Correlation

Basic Concepts of Correlation

Definition : The covariance between two sample random
variables x and y is a measure of the linear association between
the two variables, and is defined by the formula

Observation: The covariance is similar to the variance, except
that the covariance is defined for two variables (x and y above)
whereas the variance is defined for only one variable. In fact,
cov(x, x) = var(x).





The covariance can be thought of as the sum of matches and
mismatches among the pairs of data elements for x and y
A match occurs when both elements in the pair are on the same
side of their mean; a mismatch occurs when one element in the
pair is above its mean and the other is below its mean.
The covariance is positive when the matches outweigh the
mismatches and is negative when the mismatches outweigh the
matches.
The stronger the linear relationship the larger the value of the
covariance will be.
The size of the covariance is also influenced by the scale of the
data elements, and so in order to eliminate the scale factor the
correlation coefficient is used as the scale-free metric of linear
relationship.

Definition : The correlation coefficient between two sample
variables x and y is a scale-free measure of linear association
between the two variables, and is given by the formula

Observation: The covariance can be calculated as
As a result, we can also calculate the correlation coefficient as
Property 1:
-1 < r < 1
 If r is close to 1 then x and y are positively correlated. A positive
linear correlation means that high values of x are associated
with high values of y and low values of x are associated with low
values of y.
 If r is close to -1 then x and y are negatively correlated. A
negative linear correlation means that high values of x are
associated with low values of y, and low values of x are
associated with high values of y.
 When r is close to 0 there is little linear relationship between x
and y.


Definition : The covariance between two random variables x
and y for a population with discrete or continuous pdf is

Definition : The (Pearson’s product moment) correlation
coefficient for two variables x and y for a population with
discrete or continuous pdf is
If x and y are independent then cov(x, y) = 0
 The following is true for both for the sample and population :

proof :

Observation: It turns out that r is not an unbiased estimate of
ρ. A relatively unbiased estimate of is given by the adjusted
correlation coefficient radj:
while radj is a better estimate of the population correlation,
especially for small values of n, for large values of n it is easy to
see that radj ≈ r.
For constant a and random variables x, y and z, the following are
true both for the sample and population definitions of covariance:
a. cov(x, y) = cov(y, x)
b. cov(x, x) = var(x)
c. cov(a, y) = 0

d.
cov(ax, y) = a · cov(x, y)
e.
cov(x+z, y) = cov(x, y)+ cov(z, y)

If x and y are random variables and z = ax + b where a and b are
constants then the correlation coefficient between x and y is the
same as the correlation coefficient between z and y.
and so stdev(z) = a · stdev(x). Thus

Property 2 :
Proof
Since ti and ei are independent, cov(t,e) = 0, and so
Thus
Excel Functions:
1. COVAR(R1, R2) = the population covariance between the data in
arrays R1 and R2. If R1 contains data {x1,…,xn}, R2 contains
{y1,…,yn}, x = AVERAGE(R1) and y = AVERAGE(R2), then
COVAR(R1, R2) has the value

This is the same as the formula given in Definition 1, with n
replaced by n – 1. Excel doesn’t have a sample version of the
covariance, although this can be calculated using the formula:
n * COVAR(R1, R2) / (n – 1)
2.
CORREL(R1, R2) = the correlation coefficient of data in arrays
R1 and R2. This function can be used for both the sample and
population versions of the correlation coefficient. Note that:
◦ CORREL(R1, R2) = COVAR(R1. R2) / (STDEVP(R1) * STDEVP(R2)) =
the population version of the correlation coefficient
◦ CORREL(R1, R2) = n * COVAR(R1. R2) / (STDEV(R1) * STDEV(R2) *
(n – 1)) = the sample version of the correlation coefficient
3.
Excel also provide COVARIANCE.S(R1, R2) to compute the
sample covariance as well as COVARIANCE.P(R1, R2) which is
equivalent to COVAR(R1, R2). Also, the Real Statistics
supplemental functions COVARP(R1, R2) and COVARS(R1, R2)
compute the population and sample covariances respectively.
Scatter Diagrams
To better visualize the association between two data sets {x1,
…, xn} and {y1, …, yn} we can employ a chart called a scatter
diagram (also called a scatter plot). This is done in Excel by
highlighting the data in the two data sets and selecting Insert >
Charts|Scatter.
 This figure illustrates the relationship between a scatter diagram
and the correlation coefficient (or covariance).

One Sample Hypothesis Testing for Correlation





As we do in Sampling Distributions, we can consider the
distribution of r over repeated samples of x and y.
We require x and y have a joint bivariate normal distribution or
that samples are sufficiently large.
We can think of a bivariate normal distribution as the threedimensional version of the normal distribution, in which any
vertical slice through the surface which graphs the distribution
results in an ordinary bell curve.
The sampling distribution of r is only symmetric when ρ = 0 (i.e.
when x and y are independent).
If ρ ≠ 0, then the sampling distribution is asymmetric and so the
following theorem does not apply, and other methods of inference
must be used.

Theorem 1: Suppose ρ = 0. If x and y have a bivariate normal
distribution or if the sample size n is sufficiently large, then r has
a normal distribution with mean 0, and t = r/sr ~ T(n – 2) where
here the numerator r of the random variable t is the estimate of
ρ = 0 and sr is the standard error of t.
 Observation: If we solve the equation in Theorem 1 for r, we
get

Observation: The theorem can be used to test the hypothesis
that population random variables x and y are independent i.e.
ρ = 0.
Example 1

A study is designed to check the relationship between smoking and
longevity. A sample of 15 men 50 years and older was taken and the
average number of cigarettes smoked per day and the age at death was
recorded, as summarized in the table. Can we conclude from the sample
that longevity is independent of smoking?
The scatter diagram for this data is as follows. We have also included the
linear trend line that seems to best match the data.





Next we calculate the correlation coefficient of the sample using the
CORREL function:
r = CORREL(R1, R2) = -.713
From the scatter diagram and the correlation coefficient, it is clear that
the population correlation is likely to be negative.
The absolute value of the correlation coefficient looks high, but is it high
enough? To determine this, we establish the following null hypothesis:
H0 : ρ = 0
Recall that ρ = 0 would mean that the two population variables are
independent.
We use t = r/sr as the test statistic where sr is as in Theorem 1. Based
on the null hypothesis, ρ = 0, we can apply Theorem 1, provided x and y
have a bivariate normal distribution.
It is difficult to check for bivariate normality, but we can at least check to
make sure that each variable is approximately normal via QQ plots.

Both samples appear to normal, and so by Theorem 1, we know that t
has approximately a t-distribution with n – 2 = 13 degrees of freedom.
We now calculate
Finally, we perform either one of the following tests:
p-value = TDIST(ABS(-3.67), 13, 2) = .00282 < .05 = α (two-tail)
tcrit = TINV(.05, 13) = 2.16 < 3.67 = |tobs |
 And so we reject the null hypothesis, and conclude there is a non-zero
correlation between smoking and longevity. In fact, it appears from the
data that increased levels of smoking reduces longevity.

Example 2

The US Census Bureau collects statistics comparing the various 50 states.
The following table shows the poverty rate (% of population below the
poverty level) and infant mortality rate per 1,000 live births) by state.
Based on this data, can we conclude the poverty and infant mortality
rates by state are correlated?
The correlation coefficient of the sample is given by
r = CORREL(R1, R2) = .564
 Where R1 is the range containing the poverty data and R2 is the range
containing the infant mortality data.
 From the scatter diagram and the correlation coefficient, it is clear that
the population correlation is likely to be positive, and so this time we use
the following one-tail null hypothesis:
H0: ρ ≤ 0
 Based on the null hypothesis we will assume that ρ = 0 (best case), and
so as in Example 1

Finally, we perform either one of the following tests:
p-value = TDIST(4.737, 48, 1) = 9.8E-08 < .05 = α (one-tail)
tcrit = TINV(.05, 48) = 2.011 < 4.737 = tobs
 And so we reject the null hypothesis, and conclude there is a non-zero
correlation between poverty and infant mortality.

Observation: For samples of any given size n it turns out that r
is not normally distributed when ρ ≠ 0 (even when the population
has a normal distribution), and so we can’t use Theorem 1.
 There is a simple transformation of r, however, that gets around
this problem, and allows us to test whether ρ = ρ0 for some value
of ρ0 ≠ 0.
 Definition 1: For any r define the Fisher transformation of r
as follows:


Theorem 2: If x and y have a joint bivariate normal distribution
or n is sufficiently large, then the Fisher transformation r’ of the
correlation coefficient r for samples of size n has distribution
N(ρ′, sr′) where

Corollary 1: Suppose r1 and r2 are as in the theorem where
r1 and r2 are based on independent samples and further suppose
that ρ1 = ρ2. If z is defined as follows, then z ~ N(0,1)
where
Excel Functions: Excel provides functions that calculate the
Fisher transformation and its inverse.
FISHER (r) = .5 * LN((1 + r) / (1 – r))
FISHERINV(z) = (EXP(2 * z) – 1) / (EXP(2 * z) + 1)
 Observation: We can use Theorem 2 to test the null hypothesis
H0: ρ = ρ0. This test is very sensitive to outliers. If outliers are
present it may be better to use the Spearman rank correlation
test or Kendall’s tau test.

Example 3
Suppose we calculate r = .6 for a sample of size n = 100. Test the
following null hypothesis and find the 95% confidence interval.
H0: ρ = .7
 Observe that
r′ = FISHER(r) = FISHER(.6) = 0.693
ρ′ = FISHER(ρ) = FISHER(.7) = 0.867
sr′ = 1 / SQRT(n – 3) = 1 / SQRT(100 – 3) = 0.102


Since r′ < ρ′ we are looking at the left tail of a two-tail test
p-value = NORMDIST(r′, ρ′, sr′, TRUE) = NORMDIST(.693, .867, .102,
TRUE) = .0432 > 0.025 = α/2
r′-crit = NORMINV(α/2, ρ′, sr′) = NORMINV(.025, .867, .102) = .668 <
.693 = r′
In either c′ase, we cannot reject the null hypothesis.
The 95% confidential interval for ρ′ is
r′ ± zcrit ∙ sr′ = 0.693 ± 1.96 ∙ 0.102 = (0.494, 0.892)
 Here zcrit = ABS(NORMSINV(.025)) = 1.96.
 The 95% confidence interval for ρ′ is therefore (FISHERINV(0.494),
FISHERINV(0.892)) = (.457, .712).
 Note that .7 lies in this interval, confirming our conclusion not to reject
the null hypothesis.

Effect Size And Power
Until now, when we have discussed effect size we have used
some version of Cohen’s d.
 The correlation coefficient r (as well as r2) provides another
common measure of effect size.
 We now show how to calculate the power of testing correlation
using the approach from Power of a Sample.

Example 4

A market research team is conducting a study in which they believe the
correlation between increases in product sales and marketing
expenditures is 0.35. What is the power of the one-tail test if they use a
sample of size 40 with α = .05? How big does their sample need to be to
carry out the study with α = .05 and power = .80?