Testing if a relationship occurs between two variables

Download Report

Transcript Testing if a relationship occurs between two variables

Mathematical description of
the relationship between two
variables using regression
Lecture 13
BIOL2608 Biometrics
Regression Analysis

A simple mathematical expression to
provide an estimate of one variable from
another

It is possible to predict the likely outcome
of events given sufficient quantitative
knowledge of the processes involved
Regression models

Model 1
y
x
– “Controlled” parameter (Independent variable) vs.
Measured parameter (dependent variable)
– Independent variable (on x-axis) must be
measured with a high degree of accuracy & is not
subjected to random variation
– Other inferential factors must be kept constant
– Dependent variable (on y-axis) may vary
randomly and its ‘error’ should follow a normal
distribution
Normally distributed populations of y values
Y
X
• The population of y values is normally distributed &
• The variances of different population of y values
corresponding to different individual x values are similar
Regression models

x2
Model 2
x1
– Both measured parameters (x1 & x2 not x & y)
which cannot be controlled
– Both are subject to random variation & called
random-effects factors
– Common in field studies where conditions are
difficult to control
– Correlation rather than regression, required for
bivariately normal distributions
– e.g. measurements of human arm and leg lengths
Example for model 1

Study the rate of disappearance of a
pesticide in a seawater sample
– Time (independent) vs. Concentration (dependent)
– Other factors such as pH, salinity must be kept
constant

Study the growth rate of fish at different
fixed water temperature
– Temp (independent) vs. Growth rate (dependent)
– Other factors such as diet, feeding frequency must
be kept constant
Y
x, y
c
d
a
X
Model: y = a + bx
Slope: coefficient b = c/d
Intercept: coefficient a
b= -ve
b= +ve
Y
b= 0
X
3
4
5
6
8
9
10
11
12
14
15
16
17
1.4
1.5
2.2
2.4
3.1
3.2
3.2
3.9
4.1
4.7
4.5
5.2
5.0
Wing lengths of 13 sparrows of various age
6.0
y = 0.2702x + 0.7131
R2 = 0.9733
5.0
Wing length (cm )
Age (days) Wing length (cm)
X
Y
4.0
3.0
2.0
1.0
0.0
0
5
10
Age (days)
15
20
6.0
y = 0.2695x + 0.7284
R2 = 0.9705
Wing length (cm )
5.0
The concept of
least squares
Sum of di2
indicates the
deviations of the
points from the
regression line
4.0
d
3.0
2.0
1.0
0.0
0
5
10
Age (days)
15
20
Best fit line is
achieved with
minimum sum of
square
deviations (di2)
Age (days) Wing length (cm)
X
Y
3
4
5
6
8
9
10
11
12
14
15
16
17
n
mean
sum
sum X^2
13
10.0
130.0
1562.0
1.4
1.5
2.2
2.4
3.1
3.2
3.2
3.9
4.1
4.7
4.5
5.2
5.0
XY
4.2
6.0
11.0
14.4
24.8
28.8
32.0
42.9
49.2
65.8
67.5
83.2
85.0
13.0
3.4
44.4 514.8
171.3
Calculation for a regression

y = a + bx
 b = [xy – (xy/n)]/ [x2 – (x)2/n]
 a = y – bx
b = [514.8-(130)(44.4)/13]/[1562 – (130)2/13]
b = 0.720 cm/day
a = 3.4 – (0.720)(10) = 0.715 cm
The simple linear regression equation is
^Y = 0.715 + 0.270X
Normally distributed populations of y values
Y
X
• The population of y values is normally distributed &
• The variances of different population of y values
corresponding to different individual x values are similar
Residual = y – y^ = y – (a + bx)
+
+
0
0
-
-
+
+
0
0
-
-
Testing the significance of a regression
Source of variation
Sum of squares (SS)
DF
Total (Yi- Y)
Yi2 – (Yi)2/n
n –1
Linear regression
(Y^i – Y)
[XiYi - XiYi/n]2
 Xi2 – (Xi)2/n
1
^ total SS – regression SS
Residual (Yi – Yi)
n–2
Mean square (MS)
regression SS
regression DF
residual SS
residual DF
Compute F = regression MS/ residual MS, the critical F(1), 1, (n-2)
Coefficient of determination r2 = regression SS/total SS
r2 indicates the proportion (or %) of the total variation in Y that is
explained or accounted by fitted the regression:
e.g. r2 = 0.81 i.e. the regression can explain 81% of the total variation
Age (days) Wing length (cm)
X
Y
3
4
5
6
8
9
10
11
12
14
15
16
17
n
mean
sum
sum X^2
13
10.0
130.0
1562.0
1.4
1.5
2.2
2.4
3.1
3.2
3.2
3.9
4.1
4.7
4.5
5.2
5.0
XY
4.2
6.0
11.0
14.4
24.8
28.8
32.0
42.9
49.2
65.8
67.5
83.2
85.0
13.0
3.4
44.4 514.8
171.3
ANOVA testing of Ho: b = 0
Total SS = 171.3 – (44.4)2/13 = 19.66
Regression SS
= [514.8-(130)(44.4)/13]2/[1562 – (130)2/13] = 19.13
Residual SS = 19.66 – 19 .13 = 0.53
Total DF = n –1 = 12
Source of variation
Total
Linear regression
Residual
SS
DF
19.66 12
19.13 1
0.53 11
Thus, reject Ho.
r2
MS
F
Crit F
19.13 401.1 4.84
0.048
= 19.13/19.66 = 0.97
P
<0.001
=(Sy·x)2
How to report the results in
texts?
Source of variation
Total
Linear regression
Residual
SS
DF
19.66 12
19.13 1
0.53 11
MS
F
Thus, reject Ho.
r2 = 19.13/19.66 = 0.97
Crit F
19.13 401.1 4.84
0.048
P
<0.001
The wing length of the sparrow significantly
increases with increasing age (F 1, 11 = 401.1, r2 =
0.97, p < 0.001). 97% of the total variance in the
wing length data can be explained by age.
Confidence intervals of the regression
coefficients



Assume that the slopes b of a regression are normally
distributed, then it is possible to fit confidence intervals
(CI) to the slope:
95% CI = b  t (2), (n-2) Sb
where Sb = (Sy·x)2/( Xi2 – (Xi)2/n)
CI for the intercept Yi = Yi  t (2), (n-2) SYi
where SYi = (Sy·x)2[1/n + (Xi – X)2/( Xi2 – (Xi)2/n)]
Standard errors of predicted values of Y
(follow the same example)
Y^ = 0.715 + 0.270X, mean = 10
If x = 13 days, then y = 4.225 cm


SYi = (Sy·x)2[1/n + (Xi – X)2/( Xi2 – (Xi)2/n)]
= (0.0477)[1/13 + (13 –10)2/(1562 – 1302/13)]
= 0.073 cm
CI for the intercept Yi = Yi  t 0.05(2), 11 SYi
= 4.225  (2.201)(0.073)
= 4.225  0.161 cm
Inverse prediction (follow the same
example)
Y = 0.715 + 0.270X and mean Y = 3.415, if Yi = 4.5, then
X = (Y^i – 0.715)/0.270 = 14.019 days
To compute 95% CI:
t 0.05(2), 11 = 2.201
K = b2 –t2Sb2 = 0.2702 – (2.201)2[(Sy·x)2/( Xi2 – (Xi)2/n)]
= 0.2702 – (2.201)2[(0.0477)/(262)] = 0.0720
95% CI:
=X + b(Yi – Y)/K  (t/K)(Sy·x)2{[(Yi – Y)2/( Xi2 – (Xi)2/n)] +K(1 + 1/n)}
=14.069  (0.270/0.072) (0.0477){[(4.5 – 3.415)2/262] + 0.072(1 + 1/13)}
= 14.069  1.912 days
Lower limit = 12.157 days; Higher limit = 15.981
Important for toxicity test
Regression with replication
170
y = 1.3031x + 68.785
R2 = 0.9827
160
Blood pressure (mm Hg)
150
140
130
120
110
100
0
20
40
60
80
Age (years)
See p. 345 – 357, example 17.8 (Zar, 1999)
Age
B.P.
year mm Hg
n
sum
sum X^2
XY
Yij
30
108
3240
30
110
3300
30
106
3180
40
125
5000
40
120
4800
40
118
4720
40
119
4760
50
132
6600
50
137
6850
50
134
6700
60
148
8880
60
151
9060
60
146
8760
60
147
8820
60
144
8640
70
162 11340
70
156 10920
70
164 11480
70
158 11060
70
159 11130
20
1050
2744 149240
59100 383346
324
3
34992
482
4
58081
403
3 54136.33333
736
5
799
ni (Sum Yij)^2/ni
5
108339.2
127680.2
383228.73
b = [149240-(1050)(2744)/20]/ [59100 –
(1050)2/20] = 5180/3975 = 1.303 mm Hg/yr
Mean x = 52.5; mean y = 137.2
a = 137.2 – (1.303)(52.5) = 68.79 mm Hg
Y = 68.79 + 1.303x
Example 17.8
Total SS (DF = 20 –1)
= 383346 – (2744)2/20 = 6869.2
Regression SS (DF = 1)
= [149240-(1050)(2744)/20]2/[59100 –
(1050)2/20] = (5180)2/3975 = 6751.29
Among-groups SS (DF = k –1 = 5 – 1 = 4)
= 383228.73 – (2744)2/20 = 6751.93
Within groups SS (DF = total – among-groups
= 19 – 4 = 15)
= 6869.2 – 6751.93 = 117.27
Deviations-from-linearity SS (DF = amonggroups – regression = 4 –1 = 3)
= 6751.93 – 6751.29 = 1.64
Ho: The population regression is linear.
Source of variation
Total
Among groups
Linear regression
Deviations from linearity
Within groups
SS
DF
6869.2 19
6751.93 4
6750.29 1
1.64 3
117.27 15
MS
F
P
0.55
7.82
0.07
SS
DF
6869.2
19
6750.29 1
118.91 18
MS
>0.25
Thus, accept Ho.
Ho:  = 0
Source of variation
Total
Linear regression
Within groups
6750.29
6.61
Thus, reject Ho. r2 = 6750.29/6869.2 = 0.98
F
1021.2
P
<0.001
Model 2 regression

The regression coefficient b’ (b prime) is
obtained as Sx1/Sx2

e.g. field observations of PCB
(polychlorinated biphenyl) concentration per
gram of fish and compared with the kidney
biomass
X1
X2
kidney biomass (g) PCB conc (ug/g)
4.60
3.80
6.50
7.05
9.85
4.55
7.20
2.00
5.65
7.15
9.739
1.2074
10
5.835
2.2033
10
b’ = 1.2074/2.2033 = -0.548
(by inspection of the scatter-graph)
a’ = mean X1 – b(mean X2)
= 9.739 – 0.548(5.835) = 12.94
X1 = 12.94 – 0.548X2
12.00
10.00
[PCB] (ug/g)
mean
sd
n
11.25
10.37
8.82
8.15
8.50
9.00
10.15
11.81
10.18
9.16
y = -1.3675x + 19.153
R2 = 0.5616
8.00
6.00
4.00
2.00
0.00
0.00
2.00
4.00
6.00
8.00
Kidney biomass (g)
10.00
12.00
14.00
Key notes





There are two regression models
Model 1 regression is used when one of the
variable is fixed so that it is measured with
negligible error
In model 1, the fixed variable is called the
independent variable x and the dependent variable
is designated y.
The regression equation is written y = a + bx
where a and b are the regression coefficients
Model 2 regression is used where neither variable
is fixed and both are measured with error.
Comparing simple linear regression
equations &
Multiple regression analysis
Comparing Two Slopes

Use of Student’s t
t = (b1 – b2) / Sb1-b2
Sb1-b2 = (SY·X)p2/(x2)1 + (SY·X)p2/(x2)2
Y
Y
X
Y
X
X
Comparing Two Slopes
(SY·X)p2 = (residual SS)1 + (residual SS)2

(residual DF)1 + (residual DF)2

Critical t : DF = n1 + n2 – 4
 Test Ho: Equal slopes

Y
Y
X
Y
X
X
b = 2.97
Comparing Two Slopes - Example
 =
– (
 xy = XiYi – ( XiYi/n)
 y2 = Yi2 – ( Yi)2/n
x2
Xi2
Xi)2/n
Y
b = 2.17
X
For sample 1: Temperature (C)
For sample 2: Volumes (ml)
 x2 = 1470.8712
 x2 = 2272.4750
 xy =4363.1627
 xy = 4928.8100
y2 =13299.5296
 y2 =10964.0947
n = 26
n = 30
b =  xy /  x2
b = 4363.1627/1470.8712 = 2.97
b = 4928.8100/2272.4750 = 2.17
Residual SS = RSS =  y2 – ( xy )2/ x2
= 13299.5296 – (4363.1627)2/1470.8712
= 10964.0947 – (4928.81)2/2272.475
= 356.7317
= 273.9142
Residual DF = RDF = n –2 = 26 – 2
= 30 - 2 = 28
(SY·X)p2 = (RSS1 + RSS2)/(RDF1 + RDF2)
= (356.7317+ 273.9142)/(24+28) = 12.1278
(SY·X)p2 = (RSS1 + RSS2)/(RDF1 + RDF2)
= (356.7317+ 273.9142)/(24+28) = 12.1278
Sb1-b2 = (SY·X)p2/(x2)1 + (SY·X)p2/(x2)2
= (12.1278)/(1470.8712) + (12.1278)/(2272.4750)
= 0.1165
b = 2.97
t = (2.97 – 2.17) / 0.1165 = 6.867 (p < 0.001) Y
b = 2.17
DF = 24 + 28 = 52, Critical t 0.05(2), 52 = 2.007
Reject Ho.
X
Therefore, there is a significant difference between the two
slopes (t 0.05(2), 52 = 6.867, p < 0.001)
Testing for difference between points on the two nonparallel
regression lines. We are testing whether the volumes (Y) are
different in the two groups at X = 12: Ho: same value
Further we need to know
A1 = 10.57 and a2 = 24.91;
Y
Mean X1 = 22.93 and mean X2 = 18.95
Then, Y = a + bX
Estimated Y1 = 46.21; Y2 = 50.95
b = 2.97
b = 2.17
X
SY1-Y2 = (SY·X)p2/[1/n1 + 1/n2 + (X – X1)2/(x2)1 +(X – X2)2/(x2)2 ]
= (12.1278)/(1/26)+(1/30)+(12-22.93)2/(1470.8712) +
(12 – 18.95)2/(2272.4750)
= 1.45 ml
t = (46.21 – 50.95) / 1.45 = -3.269 (0.001< p < 0.002)
DF = 26 + 30 - 4 = 52, Critical t 0.05(2), 52 = 2.007
Reject Ho.
Comparing Two Elevations

If there is no significant
difference between the slopes,
you are required to compare
the two elevations
Y
For Common regression:
Sum of squares of X = Ac = ( x2)1 + ( x2)2
Sum of cross-products = Bc = ( xy)1 + ( xy)2
Sum of squares of Y = Cc = ( y2)1 + ( y2)2
Residual SSc = Cc – Bc2/ Ac
Residual DFc = n1 + n2 – 3
Residual MS = SSc/DFc
X
Comparing Two Elevations
For Common regression:
Sum of squares of X = Ac = ( x2)1 + ( x2)2
Y
Sum of cross-products = Bc = ( xy)1 + ( xy)2
Sum of squares of Y = Cc = ( y2)1 + ( y2)2
Residual SSc = Cc –
Bc2/
Ac
Residual DFc = n1 + n2 – 3
Residual MS = (SYX)2c = SSc/DFc
bc = Bc/Ac
t=
(Y1 – Y2) – bc(X1 - X2)
(SYX)2c [1/n1 + 1/n2 + (X1 - X2)2/Ac]
X
Comparing slopes and elevations: An example
For sample 1:
For sample 2:
 x2 = 1012.1923
 x2 = 1659.4333
 xy = 1585.3385
 xy = 2475.4333
 y2 =2618.3077
 y2 = 3848.9333
n = 13
n = 15
Mean X = 54.65
Mean X = 56.93
Mean Y = 170.23
Mean Y = 162.93
2
b =  xy /  x
b = 1.57
b = 1.49
RSS =  y2 – ( xy )2/ x2
= 136.2230
= 156.2449 Y
RDF = n –2 = 11
= 13
2
(SY·X)p = (RSS1 + RSS2)/(RDF1 + RDF2) = 12.1862
Test Ho: equal slopes
X
DF for t test = 11 +13 = 24
Sb1-b2 = 0.1392
t = (1.57 – 1.49)/0.1392 = 0.575 < Critical t 0.05(2), 24 = 2.064, p > 0.50
Accept Ho
Comparing slopes and elevations: An example
For sample 1:
For sample 2:
 x2 = 1012.1923
 x2 = 1659.4333
Ac = 2671.6256
 xy = 1585.3385
 xy = 2475.4333
Bc = 4060.7718
 y2 =2618.3077
 y2 = 3848.9333
Cc = 6467.2410
n = 13
n = 15
Mean X = 54.65
Mean X = 56.93
Mean Y = 170.23
Mean Y = 162.93
b = 1.57
b = 1.49
bc = 4060.7718/ 2671.6256 = 1.520
SSc = 6467.2410 – (4060.7718)2/2671.6256 = 295.0185
Y
DFc = 13 + 15 – 3 = 25
Residual MS = (SYX)2c = SSc/DFc = 11.8007
Test Ho: equal elevation
X
t = (170.23 – 162.93) – 1520(54.65-56.93)/
11.8007[1/13 + 1/15 + (54.65 – 56.93)2/2671.6256] = 8.218
> Critical t 0.05(2), 25 = 2.060, p < 0.001
Reject Ho
Comparing more than two
slopes

See Section 18.4 (Zar, 1999) for details
 You can also perform an Analysis of
Covariance (ANCOVA) using SPSS
ANCOVA - example
A
A
A
A
A
A
B
B
B
B
B
B
Water Heart beat
temp (°C)
rate
(beats per
minute)
10.1
12.2
13.5
11.2
10.2
9.8
14.1
12.3
9.5
11.6
10.1
9.4
89
94.8
99.6
93.8
91
89.2
104.5
103.6
91.1
99.6
99.1
88.7

Covariate MUST be a continuous
ratio or interval scale (e.g. Temp)
 Dependent variable(s) is/are
dependent on the covariate
(increase or decrease)
y = 3.089x + 63.273
R2 = 0.7734
110
B
105
Beat per m in.
Species
A
100
95
y = 2.7436x + 62.263
R2 = 0.9617
90
85
80
9
10
11
12
Tem p (oC)
13
14
15
ANCOVA - example
A
A
A
A
A
A
B
B
B
B
B
B
Water Heart beat
temp (°C)
rate
(beats per
minute)
10.1
12.2
13.5
11.2
10.2
9.8
14.1
12.3
9.5
11.6
10.1
9.4
89
94.8
99.6
93.8
91
89.2
104.5
103.6
91.1
99.6
99.1
88.7

Plot the figure
 Test Ho: equal slope
 Then Ho: equal elevation
y = 3.089x + 63.273
R2 = 0.7734
110
105
Beat per m in.
Species
B
100
95
A
y = 2.7436x + 62.263
R2 = 0.9617
90
85
80
9
10
11
12
Tem p (oC)
13
14
15
ANCOVA in SPSS
1.
Heart beat as dependent variable
2.
Temp as covariate
3.
Species as fixed factor
4.
Test Ho: equal slope using Model with an interaction:
Species x Temp
5.
Sig. interaction indicates different slopes
6.
No sig. Interaction, then remove this term from the
Model: Sig. Species effect  different elevations
A
A
A
A
A
A
B
B
B
B
B
B
Water Heart beat
temp (°C)
rate
(beats per
minute)
10.1
12.2
13.5
11.2
10.2
9.8
14.1
12.3
9.5
11.6
10.1
9.4
89
94.8
99.6
93.8
91
89.2
104.5
103.6
91.1
99.6
99.1
88.7
y = 3.089x + 63.273
R2 = 0.7734
110
105
Beat per m in.
Species
B
A
100
95
y = 2.7436x + 62.263
R2 = 0.9617
90
85
80
9
10
11
12
Tem p (oC)
13
14
15
Double log transformation
Log Y
Y
X
Log X
Y = a + bx
Then log Y = k + m log X
log Y = log (10k) + log Xm
log Y = log (10k)Xm
Y = (10k) Xm
Y = CXm
Double log transformation
wt
wt
length
Then ln W = a + b ln L
W= (expa) Lb
length
(natural log can also be used)
Similarly, physiological response (e.g. E = ammonia excretion
rate) vs. body wt:
E = expaW-b
Multiple regression analysis

Y = a + bX
Models:
 Y = a + b1X1 + b2X2
 Y = a + b1X1 + b2X2 + b3X3
 Y = a + b1X1 + b2X2 + b3X3 + b4X4
 Y = a + biXi
 Similar to simple linear regression, there is only
one effect or dependent variable (Y). However,
there are several cause or independent variables
chosen by the experimenter.
Multiple regression analysis

The same assumptions as for linear
regression apply so each of the cause
(independent) variables must be measured
without error.
 AND each of these cause variables must be
independent of each other
 If not independent, partial correlation should
be used.
Multiple regression analysis

Works in exactly the same way as linear
regression only the best-fit line is made up a
separate slope for each of the cause
variables;
 There is a single intercept which is the value
of the effect variable when all cause variables
are zero
Multiple regression analysis

A multiple regression using just two cause
variables is possible to visualize using a 3D
diagram
 If there are any more cause variables, there is
no way to display the relationships
 Can be done easily with SPSS – Stepwise
multiple regression analysis will help you to
determine the most important cause factor(s)