Module 20: Correlation

Download Report

Transcript Module 20: Correlation

Module 20: Correlation

This module focuses on the calculating, interpreting and testing hypotheses about the Pearson Product Moment Correlation Coefficient.

Reviewed 05 June 05/MODULE 20 20 - 1

Correlation

In Module 19, we examined how two variables, x and y, relate to each other by using the simple linear regression tool. In that context, x was the independent variable and y was the dependent variable. Typical examples for the independent variable include measures of time, including age; whereas, typical examples for the dependent variable are continuous measurements such as blood cholesterol level. The general assumption is that there are separate normal distributions of the dependent variable y for each value of the independent variable x. Further, we need to assume that these separate normal distributions for the dependent variable all have the same population variance.

20 - 2

Clearly these assumptions are quite restrictive in that we are often interested in the relationship between two variables, x and y, where it is not at all clear which should be labeled the independent variable and which the dependent one. An example is the relationship between blood cholesterol level and blood pressure level.

20 - 3

For this situation, we have another tool to measure and test hypotheses about the relationship between these two variables. The tool is correlation and we focus here only on what is usually called the Pearson Product Moment Correlation Coefficient. There are other measures of correlation which we will not discuss here. There are restrictions for the use of this correlation tool as well, which include the basic assumption that the x and y variables together have a joint frequency distribution which is called the bivariate normal distribution. This distribution looks like a three-dimensional bell in a manner similar to the way a normal distribution for one variable looks like a cross section of a bell.

20 - 4

The degree of association or correlation between two variables is measured by the

correlation coefficient

. This is done in a manner similar to that for other population parameters and estimates of these parameters obtained by calculating statistics from samples. That is, there is a value for the population parameter for this coefficient which is estimated by selecting a random sample and calculating the appropriate coefficient using the data from this sample. We can also use the information from the sample to test hypotheses about the population.

20 - 5

The population parameter for the Pearson Product Moment Correlation Coefficient is defined as 

xy

   ( (

x

) 2

x

)(  (

y

)

y

) 2 which is typically called Rho, for the Greek letter it represents. The estimate of

ρ

calculated from the sample data is the statistic

r xy

 = [   (

x

( 

x

x

2 

x

)( 

x

) 2   (

xy

x y

 (  (

y y

) 

n y

) 2

x

y

2

y

) /

n

 ( 

y n

 ) ( ) 20 - 6

Fecal Fat and Urinary Oxalate Example

Fecal Fat (g/24 hr) and Urinary Oxalate (mg/24 hr) secreted by a random sample of n = 11 persons Patient 1 2 3 4 5 6 7 8 9 10 11 Sum Mean x = Fecal Fat 16 14 38 8 15 22 28 27 14 45 46 273 24.8

y = Urinary Oxalate 31 40 41 70 75 65 85 115 128 145 140 935 85.0

20 - 7

Scatter plot for Fecal Fat and Urinary Oxalate Data

160 140 120 100 80 60 40 20 0 0 10 20 30 Fecal Fat (g/24hr) 40 50 20 - 8

Calculations for Regressing Urinary Oxalate on Fecal Fat

Person 7 8 9 10 11 Sum Mean Sum 2 /n SS Variance SD 1 2 3 4 5 6 Fecal Fat x 28 27 14 45 46 273 24.8

16 14 38 8 15 22 6,775.36

1,743.64

174.36

13.2

x 2 256 196 1,444 64 225 484 784 729 196 2,025 2,116 8,519 y 85 115 128 145 140 935 85.0

31 40 41 70 75 65 Urinary Oxalate y 2 961 1,600 1,681 4,900 5,625 4,225 7,225 13,225 16,384 21,025 19,600 96,451 79,475.00

16,976.00

1,697.60

41.2

xy 496 560 1,558 560 1,125 1,430 2,380 3,105 1,792 6,525 6,440 25,971 2,766.00

20 - 9

Regression Tools

y x

85.0

24.8

1,743.64

)

16,976.00

2,766.00

20 - 10

So we can calculate

Intercept

 2, 766.00

1, 743.64

y bx

  1.59

 45.54

The straight line depicting the regression relationship of y on x is

bx

x

At x = 40, the regression estimate for y is:

y

   20 - 11

With this information, we can add the regression line  45 .

54  1 .

59

x

to the scatter plot, as shown below. 160 140 120 100 80 60 40 20 0 0 20 40 Fecal Fat (g/24hr) 60  45 .

54  1 .

59 

x

 45   117 .

09 20 - 12

Test for regression of Urinary Oxalate on Fecal Fat 1. The hypothesis: 2. The

level: 3. The assumptions:

H 0 :

β

= 0 vs. H 1 :

β

≠0  = 0.05

Random normal samples for y variable from populations defined by x-variable

4. The test statistic:

ANOVA as specified by ANOVA Source Regression Residual Total df n-1 SS MS 1 bSS (xy) SS(Reg)/1 n-2 SS(Res)

a

SS(Res)/(n-2) SS (y)

a

SS(Residual) = SS(y) – SS(Regression) F MS(Reg)/MS(Res) 20 - 13

5. The critical region: 6. The result:

Source Regression Residual df 1 9 Total 10

7. The conclusion:

Reject H 0 :

β

= 0 if the value calculated for F > F 0.95

(1, 9) = 5.12

SS(Reg) = bSS(xy) = 1.59 (2,766.00) = 4,397.94

SS(Total) = SS(y) = 16,976.00

SS(Res) = 16,976.00 – 4,397.94 = 12,578.06

ANOVA SS 4,397.94 12,578.06 16,976.00 MS 4,397.94 1,397.56 F 3.15 Accept H 0 :

β

= 0 since F < 5.12

20 - 14

160 140 120

y

ˆ  45 .

54  1 .

59 ( 45 ) (x, y) (45,145)

B

 145  117  28 100 80 60 (45, 85)

A

 117  85  32 y = 45.54+1.59x

y

y

 145  85  60 40 45.54

20 0 0 10 20 30 40 50 60

Fecal Fat (g/24hr)

R 2 = 0.2585

y

ˆ = Regression estimate = 45.54 + 1.59x = 117 C = A + B =

y

y

Total deviation  SS(y) A = 

y

Explained by line  SS(Reg) B =

y

 Left over  SS(Residual) 20 - 15

Correlation Tools

)

2,766.00

The estimate of the correlation coefficient is:

r

xy

 ) ( )

r xy

 2,766.00

[1,743.64][16,976.00]  0.5084

20 - 16

The Correlation Coefficient

r measures linear association r has values between -1 ≤ r ≤ + 1 r  + 1 implies strong positive linear association r  - 1 implies strong negative linear association r  0 implies no linear association 20 - 17

y .

.

.

.

.

.

.

.

.

.

.

.

y .

.

.

.

.

.

.

.

.

.

.

.

x

r  +1 r  -1

x y y .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

x

r  0 r  +1

x .

.

y .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

y .

.

.

.

.

.

.

.

.

.

.

.

x

r  0 r > 0

x

20 - 18

Correlation Hypothesis Testing

The hypothesis of interest deals with whether there is linear association between x and y. If there is no such association, we would have  = 0. Hence, the hypotheses of interest are: H 0 :  = 0 vs. H 1 :   0 which we can test by using the test statistic:

t

r

   1

n

 

r

2 2    ½ 

t

(

n

 2 ) Note that this calculation requires only the sample estimate r of the correlation coefficient ρ and the sample size n and that we need to use the t distribution with n - 2 degrees of freedom.

20 - 19

Test of Correlation between Urinary Oxalate and Fecal Fat, n = 11, r = 0.5084

1. The hypothesis: 2. The

level:

H 0 :  = 0 vs. H 1 :  ≠ 0  = 0.05

3. The assumptions:

Random sample from bivariate normal distribution.

4. The test statistic:

t

r

  1

n

 

r

2 2  ½  

t

(

n

 2 ) 20 - 20

5. The critical region:

Reject H 0 :  = 0 if the value calculated for t is not between ± t 0.975

(9) = 2.262

6. The result:

r = 0.5084, n = 11

t

 0.5084

   9 2  ½    0.5084

  9  ½  1.77

7. The conclusion:

Accept H 0 :  = 0 since

t

= 1.77 is between ± t 0.975

(9) = 2.262

20 - 21

Test of Correlation for Tono-Pen vs. Goldman intraocular pressure, n = 40, r = 0.6574

1. The hypothesis:

H 0 :  = 0 vs. H 1 :  ≠ 0

2. The

level:

 = 0.05

3. The assumptions:

Random sample from bivariate normal distribution.

4. The test statistic:

t

r n

1   2

r

2 20 - 22

5. The rejection region:

Reject H 0 :  between  t = 0 , if t is not 0.975

(38) = 2.02

6. The result:

n = 40, r = 0.6574, r 2 = 0.44,

t

 0 .

6574 38 1  0 .

6574 2  0 .

66 38 0 .

56  5 .

44

7. The conclusion:

Reject H 0 :  = 0 since t = 5.44

is not between  2.02

20 - 23

Example : AJPH, 1995; 85: 1397-1401

20 - 24

Source: AJPH, 1995; 85: 1397-1401

20 - 25

Test of Correlation between infant mortality rate and gross domestic product, n = 17, r = -0.64

1. The hypothesis:

H 0 :  = 0 vs. H 1 :  ≠ 0

2. The

level:

 = 0.05

3. The assumptions:

Random sample from bivariate normal distribution.

4. The test statistic:

t

r n

1   2

r

2 20 - 26

5. The rejection region:

Reject H 0 :  between  t = 0 , if t is not 0.975(15) = 2.13

6. The result:

n = 17, r = -0.64

t

  0 .

64 1  15 (  0 .

64 ) 2   0 .

64 15 0 .

59   3 .

23

7. The conclusion:

Reject H 0 :  = 0 since t = -3.23 is not between  t 0.975

(15) = 2.13

20 - 27

Test of correlation hypothesis for life expectancy for males and females, n = 17, r = 0.67

1. The hypothesis:

H 0 :  = 0 vs. H 1 :  ≠ 0

2. The

level:

 = 0.05

3. The assumptions:

Random sample from bivariate normal distribution.

4. The test statistic:

t

r n

1   2

r

2 20 - 28

5. The rejection region:

Reject H 0 :  between  t = 0 , if t is not 0.975

(15) = 2.13

6. The result:

t

 0.67

15 2  n = 17, r = 0.67 15 0.67

 3.49

7. The conclusion:

Reject H 0 :  = 0 since

t

 = 3.49 is not between t 0.975 (15) = 2.13

20 - 29

Example : AJPH, 1997; 87: 1491-1498

20 - 30

20 - 31

Correlation between Mortality and Social Mistrust, n = 39, r = 0.79

1. The hypothesis:

H 0 :  = 0 vs. H 1 :  ≠ 0

2. The

level:

 = 0.05

3. The assumptions:

Random sample from bivariate normal distribution.

4. The test statistic:

t

r n

1   2

r

2 20 - 32

5. The rejection region:

Reject H 0 :  between  t = 0 , if t is not 0.975

(37) = 2.02

6. The result:

n = 39, r = 0.79

t

 0 .

79 37 1  0 .

79 2  0 .

79 37 0 .

3759  7 .

8

7. The conclusion:

Reject H 0 :  = 0 since t = 7.8

is not between  t 0.975 (37) = 2.02

20 - 33

Example : AJPH, 1998; 88: 1496-1502

20 - 34

Note: very few outliers can have a large impact on the location of the line

Source: AJPH, 1998; 88: 1496-1502

20 - 35

Test for Correlation between Gonorrhea rate and Chlamydia rate

1. The hypothesis:

H 0 :  = 0 vs. H 1 :  ≠ 0

2. The

level:

 = 0.05

3. The assumptions:

Random sample from bivariate normal distribution.

4. The test statistic:

t

r n

1   2

r

2 20 - 36

5. The rejection region:

Reject H 0 :  between  t = 0 , if t is not 0.975

(320)  2.00

6. The result:

n = 322, r = 0.83

t

 0.83

320  2  0.83

7. The conclusion:

320  26.67

Reject H 0 :  = 0 since

t

 = 26.67 is not between t 0.975

(320) = 2.00

20 - 37