Transcript Slide 1

Inference for regression
- More details about simple linear regression
IPS chapter 10.2
© 2006 W.H. Freeman and Company
Objectives (IPS chapter 10.2)
Inference for regression—more details

Analysis of variance for regression

Calculations for regression inference

Inference for correlation
Analysis of variance for regression
The regression model is:
Data =
fit
+ residual
y i = (b 0 + b 1 x i ) +
(ei)
where the ei are independent and
normally distributed N(0,s), and
s is the same for all values of x.
It resembles an ANOVA, which also assumes equal variance, where
SST = SS model +
DFT = DF model +
SS error
DF error
and
For a simple linear relationship, the ANOVA tests the hypotheses
H0: β1 = 0 versus Ha: β1 ≠ 0
by comparing MSM (model) to MSE (error): F = MSM/MSE
When H0 is true, F follows
the F(1, n − 2) distribution.
The p-value is P(> F).
The ANOVA test and the two-sided t-test for H0: β1 = 0 yield the same p-value.
Software output for regression may provide t, F, or both, along with the p-value.
ANOVA table
Source
Model
Error
Sum of squares SS
2
ˆ
(
y

y
)
 i
 ( y  yˆ )
i
Total
2
DF
Mean square MS
F
P-value
1
SSG/DFG
MSG/MSE
Tail area above F
n−2
SSE/DFE
i
( yi  y)2
n−1
SST = SSM + SSE
DFT = DFM + DFE
The standard deviation of the sampling distribution, s, for n sample
data points is calculated from the residuals ei = yi – ŷi
s 
2
2
e
i
n2

2
ˆ
(
y

y
)
 i i
n2
SSE

 MSE
DFE
s is an unbiased estimate of the regression standard deviation σ.
Coefficient of determination, r2
The coefficient of determination, r2, square of the correlation
coefficient, is the percentage of the variance in y (vertical scatter
from the regression line) that can be explained by changes in x.
r 2 = variation in y caused by x
(i.e., the regression line)
total variation in observed y values around the mean
2
ˆ
(
y

y
)
 i
SSM
r 

2
 ( yi  y ) SST
2
What is the relationship between
the average speed a car is
driven and its fuel efficiency?
We plot fuel efficiency (in miles
per gallon, MPG) against average
speed (in miles per hour, MPH)
for a random sample of 60 cars.
The relationship is curved.
When speed is log transformed
(log of miles per hour, LOGMPH)
the new scatterplot shows a
positive, linear relationship.
Calculations for regression inference
To estimate the parameters of the regression, we calculate the
standard errors for the estimated regression coefficients.
The standard error of the least-squares slope β1 is:
SEb1 
s
2
(
x

x
)
 i i
The standard error of the intercept β0 is:
SEb0
1
s

n

x2
( xi  xi ) 2
To estimate or predict future responses, we calculate the following
standard errors
The standard error of the mean response µy is:
The standard error for predicting an individual response ŷ is:
1918 flu epidemics
1918 influenza epidemic
Date
# Cases # Deaths
17
ee
k
15
ee
k
13
ee
k
11
9
ee
k
ee
k
7
w
ee
k
5
w
ee
k
3
w
ee
k
w
ee
k
1
1918 influenza epidemic
w
w
w
w
10000
800
9000
700
8000
600
# Cases
# Deaths
7000
500
6000
5000
400
4000
The line graph suggests that about 7 to 8% of300
those
3000
200
2000
diagnosed with the flu died within about a week
of
100
1000
0
0 the
diagnosis.
We look at the relationship between
w
0
0
130
552
738
414
198
90
56
50
71
137
178
194
290
310
149
800
700
600
500
400
300
200
100
0
number of deaths in a given week and the number of
w
ee
k
1
w
ee
k
3
w
ee
k
5
w
ee
k
7
w
ee
k
9
w
ee
k
11
w
ee
k
13
w
ee
k
15
w
ee
k
17
36
531
4233
8682
7164
2229
600
164
57
722
1517
1828
1539
2416
3148
3465
1440
Incidence
week 1
week 2
week 3
week 4
week 5
week 6
week 7
week 8
week 9
week 10
week 11
week 12
week 13
week 14
week 15
week 16
week 17
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
# deaths reported
# cases diagnosed
1918 influenza epidemic
new diagnosed cases one week earlier.
# Cases
# Deaths
r = 0.91
1918 flu epidemic: Relationship between the number of
deaths in a given week and the number of new diagnosed
cases one week earlier.
MINITAB
- Regression Analysis:
FluDeaths1 versus FluCases0
The regression equation is
FluDeaths1 = 49.3 + 0.0722 FluCases0
Predictor
Coef
Constant
49.29
FluCases
0.072222
S = 85.07
s  MSE
SE Coef
SEb 0
0.008741 SE
b1
29.85
R-Sq = 83.0%
T
P
1.65
0.121
8.26
0.000
R-Sq(adj) = 81.8%
r2 = SSM / SST
Analysis of Variance
Source
Regression
DF
1
SS
P-value for
H0: β = 0; Ha: β ≠ 0
MS
F
P
68.27
0.000
Residual Error
14
494041 SSM 494041
101308
7236
Total
15
595349 SST
MSE  s 2
Inference for correlation
To test for the null hypothesis of no linear association, we have the
choice of also using the correlation parameter ρ.

When x is clearly the explanatory variable, this test
is equivalent to testing the hypothesis H0: β = 0.

b1  r
sy
sx
When there is no clear explanatory variable (e.g., arm length vs. leg length),
a regression of x on y is not any more legitimate than one of y on x. In that
case, the correlation test of significance should be used. Technically, in that
case, the test is a test of independence much like we saw in an earlier
chapter on contingency tables.
The test of significance for ρ uses the one-sample t-test for:
H0: ρ = 0.
We compute the t statistics
for sample size n and
correlation coefficient r.
This calculation turns out
To be identical to the tstatistic based on slope
t
r n2
1 r2