Transcript Slide 1

Statistics for Business and
Economics
Module 2: Regression and time series analysis
Spring 2010
Lecture 4: Inference about regression
Priyantha Wijayatunga, Department of Statistics, Umeå University
[email protected]
These materials are altered ones from copyrighted lecture slides (© 2009 W.H.
Freeman and Company) from the homepage of the book:
The Practice of Business Statistics Using Data for Decisions :Second Edition
by Moore, McCabe, Duckworth and Alwan.
Inference about the regression model
and using the model
Reference to the book: Chapter 10.1, 10.2 and 10.3)

Statistical model for simple linear regression

Estimating the regression parameters and standard errors

Conditions for regression inference

Confidence intervals and significance tests

Inference about correlation

Confidence and prediction intervals

Analysis of variance for regression and coefficient of determination
Error Variable: Required Conditions



Linear regrssion model: Y  0  1 X  e
The error e is a critical part of the regression
model.
Four requirements involving the distribution of e
must be satisfied.





The probability distribution of e is normal.
The mean of e is zero: E(e) = 0.
The standard deviation of e is s for all values of x.
The set of errors associated with different values of y
are all independent.
Our estimated model: Yˆ
 b0  b1 X
yˆ  0.125 x  41.4
The data in a scatterplot is a random
sample from a population that may
exhibit a linear relationship between x
and y. Different sample  different plot.
Now we want to describe the population mean response my as a
function of the explanatory variable x: my = 0 + 1x.
And to assess whether the observed relationship is statistically
significant (not entirely explained by chance events due to random
sampling).
Statistical model for simple linear regression
In the population, the linear regression equation is my = 0 + 1x.
Sample data then fits the model:
Data =
fit
+ residual
y i = ( 0 +  1 x i ) +
(ei)
where the ei are
independent and
normally distributed N(0,s).
Linear regression assumes equal variance of y
(s is the same for all values of x).
Estimating the regression parameters
my = 0 + 1x
The intercept 0, the slope 1, and the standard deviation s of y are the
unknown parameters of the regression model. We rely on the random
sample data to provide unbiased estimates of these parameters.

The value of ŷ from the least-squares regression line is really a prediction
of the mean value of y (my) for a given value of x.

The least-squares regression line (ŷ = b0 + b1x) obtained from sample data
is the best estimate of the true population regression line (my = 0 + 1x).
ŷ unbiased estimate for mean response my
b0 unbiased estimate for intercept 0
b1 unbiased estimate for slope 1
Regression standard error: s
(Standard error of estimate)





The standard deviation of the error variables shows the dispersion
around the true line for a certain x.
If s is big we have big dispersion around the true line. If s is small
the observations tend to be close to the line. Then, the model fits the
data well.
Therefore, we can, use s as a measure of the suitability of using a
linear model.
However we often do not know s
Therefore we use s to estimate it
The population standard deviation s
for y at any given value of x represents
the spread of the normal distribution of
the ei around the mean my .
The regression standard error, s, for n sample data points is
calculated from the residuals (yi – ŷi):
s
2
residual

n2

2
ˆ
(
y

y
)
 i i
n2
s is an unbiased estimate of the regression standard deviation s.
s s
Conditions for regression inference

The observations are independent (error term in each observation
should be independent from each other)

The relationship is indeed linear.

The standard deviation of y, σ, is the same for all values of x.

The response y varies normally
around its mean.
That is, error term should
be normally distributed
with mean zero
Using residual plots to check for regression validity
The residuals (y − ŷ) give useful information about the contribution of
individual data points to the overall pattern of scatter.
We view the residuals in
a residual plot:
If residuals are scattered randomly around 0 with uniform variation, it
indicates that the data fits a linear model, and has normally distributed
residuals for each value of x and constant standard deviation σ.
Residuals are randomly scattered
 good!
Curved pattern
 the relationship is not linear.
Change in variability across plot
 σ not equal for all values of x.
What is the relationship between
the average speed a car is
driven and its fuel efficiency?
We plot fuel efficiency (in miles
per gallon, MPG) against average
speed (in miles per hour, MPH)
for a random sample of 60 cars.
The relationship is curved.
When speed is log transformed
(log of miles per hour, LOGMPH)
the new scatterplot shows a
positive, linear relationship.
Residual plot:
The spread of the residuals is
reasonably random—no clear pattern.
The relationship is indeed linear.
But we see one low residual (3.8, −4)
and one potentially influential point
(2.5, 0.5).
Normal quantile plot for residuals:
The plot is fairly straight, supporting
the assumption of normally distributed
residuals.
 Data okay for inference.
Checking normality of residuals
Most of the departure from the required conditions can be
diagnosed by the residual analysis
Standardiz ed residual 
For our case
Residual  mean of the residual
standard deviation of the residual
Standardiz ed i th residual 
ei
s
Food company... –a
1st data: when ADVER= 276, SALES= 115.0
predicted SALES= 118.0087
residual= 115.0 –118.0087= –3.008726
Standardiz ed residual 
– 3.008726
 0.1869935
16.09
Checking normality
Non–normality of the residuals can be checked by making a
histogram on residuals
1.5
1.0
0.5
0.0
Frequency
2.0
2.5
3.0
Food Company Example
-30
-20
-10
0
Residual
10
20
30
Heteroscedasticity
Variance of the errors is not constant
(Violation of the requirement)
Homoscedasticity
Non–independnece of
error variable
Residual
0
-20
-10
Check: plot the residuals against
predicted values of Y by
the model
10
20
Variance of the errors is constant
(No violation of the requirement)
120
130
140
150
160
predicted SALES values
170
180
Confidence interval for regression parameters
Estimating the regression parameters 0, 1 is a case of one-sample
inference with unknown population variance s.
 We rely on the t distribution, with n – 2 degrees of freedom.
A level C confidence interval for the slope, 1, is:
b1 ± t* SEb1
A level C confidence interval for the intercept, 0 :
b0 ± t* SEb0
t* is the t critical for the t (n – 2) distribution with area C between –t* and +t*.
Significance test for the slope
We can test the hypothesis H0: 1 = 0 versus a 1 or 2 sided alternative.
We calculate
t = b1 / SEb1
which has the t (n – 2)
distribution to find the
p-value of the test.
Note: Software typically provides
two-sided p-values.
Standard errors of b1 and b0
To estimate the parameters of the regression, we calculate the
standard errors for the estimated regression coefficients.
The standard error of the least-squares slope β1 is:
SE b1 
s
2
(
x

x
)
 i
The standard error of the intercept β0 is:
1
x2
SEb 0  s

n  ( xi  x ) 2
Testing the hypothesis of no relationship
We may look for evidence of a significant relationship between
variables x and y in the population from which our data were drawn.
For that, we can test the hypothesis that the regression slope
parameter β is equal to zero.
H0: β1 = 0 vs. H0: β1 ≠ 0
s y Testing H0: β1 = 0 also allows us to test the hypothesis of
slope b1  r
sx no correlation between x and y in the population.
Note: A test of hypothesis for 0 is irrelevant (0 is often not even achievable).
Example: ADVER and SALES
SˆALES  99.273 0.06806 ADVER
i
xi
yi
xi  x 2
1
2
3
4
5
6
7
8
9
276
552
720
648
336
396
1056
1188
372
115.0
135.6
153.6
117.6
106.8
150.0
164.4
190.8
136.8
115600
4096
10816
1024
78400
48400
193600
327184
59536
Total
5544
1270.6
838656
Mean
616
s x2 
141.178
1
xi  x 2  838656  104832
n 1
8
yˆ i
 yi  yˆ i 2
118.0087
136.8165
148.2648
143.3584
122.0974
126.1860
171.1613
180.1563
124.5506
9.052434
1.479981
28.464554
663.494881
234.009911
567.104757
45.714584
113.288358
150.048384
1812.658
Regression standard error (Standard error of the estimate)
SSE
1812.658
s

 16.09196
n2
9-2
Standard deviation of
b1
s
16.09196
SEb1 

 0.01757183
2
(9 -1)104832
(n  1)sx
H 0 : 1  0
H1 : 1  0
b1
0.06806
T statistic t 

 3.87
sb1 0.01757183
If H0 is true the test statistic has a t–dist with df=9–2 =7
P–value > 2 x 0.005: Reject H0 at 0.01 level of significance
Example: ADVER and SALES
SPSS output
Coefficientsa
Unstandardized
Coefficients
Model
1
(Constant)
Standardized
Coefficients
B
99,201
Std. Error
12,080
,068
,018
ADVER
Beta
,826
t
8,212
Sig.
,000
3,878
,006
a. Dependent Variable: SALES
Coefficientsa
95,0% Confidence Interval for B
Model
1
(Constant)
ADVER
a. Dependent Variable: SALES
Lower Bound
70,635
Upper Bound
127,767
,027
,110
Using technology
Computer software runs all the computations for regression analysis.
Here is some software output for the car speed/gas efficiency example.
SPSS
Slope
Intercept
p-value for tests
of significance
Confidence
intervals
The t-test for regression slope is highly significant (p < 0.001). There is a
significant relationship between average car speed and gas efficiency.
Excel
“intercept”: intercept
“logmph”: slope
SAS
P-value for tests
of significance
confidence
intervals
Inference for correlation
To test for the null hypothesis of no linear association, we have the
choice of also using the correlation parameter ρ.

When x is clearly the explanatory variable, this test
is equivalent to testing the hypothesis H0: β1 = 0.

b1  r
sy
sx
When there is no clear explanatory variable (e.g., arm length vs. leg length),
a regression of x on y is not any more legitimate than one of y on x. In that
case, the correlation test of significance should be used.

When both x and y are normally distributed H0: ρ = 0 tests for no association
of any kind between x and y—not just linear associations.
The test of significance for ρ uses the one-sample t-test for: H0: ρ = 0.
We compute the t statistics
for sample size n and
correlation coefficient r.
The p-value is the area
under t (n – 2) for values of
T as extreme as t or more
in the direction of Ha:
t
r n2
1 r2
Relationship between average car speed and fuel efficiency
r
p-value
n
There is a significant correlation (r is not 0) between fuel efficiency
(MPG) and the logarithm of average speed (LOGMPH).
Example: ADVER and SALES
Correlation coefficient = r = 0.8258
To test
H0: ρ = 0
Ha: ρ ≠ 0
Value of the tests statistic
n2
92
tr
 0.8258
 3.87
2
2
1 r
1  0.8258
P–value from the t–distribution with df 7: P–value > 2 x 0.005, Do not accept H0
Correlations
ADVER
ADVER
Pearson Correlation
SALES
1
,826**
Sig. (2-tailed)
SALES
N
Pearson Correlation
Sig. (2-tailed)
,006
9
,826**
9
1
,006
N
9
**. Correlation is significant at the 0.01 level (2-tailed).
9
Confidence Interval for Regression Response
Using inference, we can also calculate a confidence interval for the
population mean μy of all responses y when x takes the value x*
(within the range of data tested):
This interval is centered on ŷ, the unbiased estimate of μy.
The true value of the population mean μy at a given
value of x will indeed be within our confidence
interval in C% of all intervals calculated
from many different random samples.
The level C confidence interval for the mean response μy at a given
value x* of x is:
t* is the t critical for the t (n – 2)
y^± tn − 2 * SEm^
distribution with area C between
–t* and +t*.
A separate confidence interval is
calculated for μy along all the values
that x takes.
Graphically, the series of confidence
intervals is shown as a continuous
interval on either side of ŷ.
95% confidence
interval for my
Prediction Interval for Regression Response
One use of regression is for predicting the value of y, ŷ, for any value
of x within the range of data tested: ŷ = b0 + b1x.
But the regression equation depends on the particular sample drawn.
More reliable predictions require statistical inference:
To estimate an individual response y for a given value of x, we use a
prediction interval.
If we randomly sampled many times, there
would be many different values of y
obtained for a particular x following
N(0, σ) around the mean response µy.
The level C prediction interval for a single observation on y when x
takes the value x* is:
t* is the t critical for the t (n – 2)
ŷ ± t*n − 2 SEŷ
distribution with area C between
–t* and +t*.
The prediction interval represents
mainly the error from the normal
95% prediction
distribution of the residuals ei.
interval for ŷ
Graphically, the series confidence
intervals is shown as a continuous
interval on either side of ŷ.

The confidence interval for μy contains, with C% confidence, the
population mean μy of all responses at a particular value of x.

The prediction interval contains C% of all the individual values
taken by y at a particular value of x.
95% prediction interval for ŷ
95% confidence interval for my
Estimating my uses a smaller
confidence interval than estimating
an individual in the population
(sampling distribution narrower
than population
distribution).
To estimate or predict future responses, we calculate the following
standard errors.
The standard error of the mean response µy is:
The standard error for predicting an individual response ŷ is:
1918 flu epidemics
1918 influenza epidemic
Date
# Cases # Deaths
17
ee
k
15
ee
k
13
ee
k
11
9
ee
k
ee
k
7
w
ee
k
5
w
ee
k
3
w
ee
k
w
ee
k
1
1918 influenza epidemic
w
w
w
w
10000
800
9000
700
8000
600
# Cases
# Deaths
7000
500
6000
The line graph suggests that 7 to 9% of those
5000
400
4000
300 of
diagnosed with the flu died within about a week
3000
200
2000
their diagnosis.
100
1000
0
0
w
0
0
130
552
738
414
198
90
56
50
71
137
178
194
290
310
149
800
700
600
500
400
300
200
100
0
We look at the relationship between the number of
w
ee
k
1
w
ee
k
3
w
ee
k
5
w
ee
k
7
w
ee
k
9
w
ee
k
11
w
ee
k
13
w
ee
k
15
w
ee
k
17
36
531
4233
8682
7164
2229
600
164
57
722
1517
1828
1539
2416
3148
3465
1440
Incidence
week 1
week 2
week 3
week 4
week 5
week 6
week 7
week 8
week 9
week 10
week 11
week 12
week 13
week 14
week 15
week 16
week 17
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
deaths in a given week and the number of new
diagnosed# Cases
cases one# week
Deaths earlier.
# deaths reported
# cases diagnosed
1918 influenza epidemic
1918 flu epidemic: Relationship between the number of
r = 0.91
deaths in a given week and the number of new diagnosed
cases one week earlier.
EXCEL
Regression Statistics
Multiple R
0.911
R Square
0.830
Adjusted R Square
0.82
Standard Error
85.07 s
Observations
16.00
Coefficients
Intercept
49.292
FluCases0
0.072
b1
St. Error
29.845
0.009
SEb1
t Stat
1.652
8.263
P-value Lower 95% Upper 95%
0.1209
(14.720) 113.304
0.0000
0.053
0.091
P-value for
H0: β1 = 0
P-value very small  reject H0  β1 significantly different from 0
There is a significant relationship between the number of flu
cases and the number of deaths from flu a week later.
SPSS
CI for mean weekly death
count one week after 4000
flu cases are diagnosed: µy
within about 300–380.
Prediction interval for a
weekly death count one
week after 4000 flu cases
are diagnosed: ŷ within
about 180–500 deaths.
Least squares regression line
95% prediction interval for ŷ
95% confidence interval for my
What is this?
A 90% prediction interval
for the height (above) and
a 90% prediction interval for
the weight (below) of male
children, ages 3 to 18.
Analysis of variance and coefficient of
determination
The regression model
Overall variability in y
The error
Variation in the dependent variable Y
= variation explained by the independent variable
+ unexplained variation (by random error)
SST=SSR+SSE
The greater the explained independent variable, better the model
Co-efficient of determination is a measure of explanatory power of the model
Analysis of variance for regression
The regression model is:
Data =
fit
+ residual
y i = ( 0 +  1 x i ) +
(ei)
where the ei is independent and
normally distributed N(0,s), and
s is the same for all values of x.
For data:   y  y 2    y  yˆ 2    yˆ  y 2    yˆ  y 2    y  yˆ 2
i
i
i
i
i
i
i
SST  SSR SSE
SST   yi  y 
2
SSR   yˆ i  yˆ 
Sometimes SSR is called SSM
2
SSE   yi  yˆ i 
2
Analysis of variance for regression
The regression model is:
Data =
fit
+ residual
y i = ( 0 +  1 x i ) +
(ei)
where the ei is independent and
normally distributed N(0,s), and
s is the same for all values of x.
It resembles an ANOVA, which also assumes equal variance, where
SST = SS model +
SS error
DFT = DF model +
DF error
and
For a simple linear relationship, the ANOVA tests the hypotheses
H0: β1 = 0 versus Ha: β1 ≠ 0
by comparing MSM (model) to MSE (error): F = MSM/MSE
When H0 is true, F follows
the F(1, n − 2) distribution.
The p-value is P(F > f ).
The ANOVA test and the two-sided t-test for H0: β1 = 0 yield the same p-value.
Software output for regression may provide t, F, or both, along with the p-value.
ANOVA table
Source
Sum of squares SS
Model
(Regression)
2
ˆ
(
y

y
)
 i
Error
(Residual)
 ( y  yˆ )
Total
 ( y  y)
i
2
DF
Mean square MS
F
P-value
1
SSM/DFM
MSM/MSE
Tail area above F
n−2
SSE/DFE
i
n−1
2
i
SST = SSM + SSE
DFT = DFM + DFE
The standard deviation of the sampling distribution, s, for n sample
data points is calculated from the residuals ei = yi – ŷi
s 
2
2
e
i
n2

2
ˆ
(
y

y
)
 i i
n2
SSE

 MSE
DFE
s is an unbiased estimate of the regression standard deviation σ.
F–Distribution:
Critical values
(partly)
Coefficient of determination, r2
The coefficient of determination, r2, square of the correlation
coefficient, is the percentage of the variance in y (vertical scatter
from the regression line) that can be explained by changes in x.
r 2 = variation in y caused by x
(i.e., the regression line)
total variation in observed y values around the mean
r2 
2
ˆ
(
y

y
)
 i
2
(
y

y
)
 i

SSM
SST
Food Company example: Call X=ADVER and Y=SALES
i
xi
yi
1
2
3
4
5
6
7
8
9
276
552
720
648
336
396
1056
1188
372
115.0
135.6
153.6
117.6
106.8
150.0
164.4
190.8
136.8
Total
5544
1270.6
Mean
616
yi  y
 yi  y 2
-26.177778
685.27605
-5.577778
31.11160
12.422222
154.31160
-23.577778
555.91160
-34.377778 1181.83160
8.822222
77.83160
23.222222
539.27160
49.622222 2462.36494
-4.377778
19.16494
yˆ i
118.0087
136.8165
148.2648
143.3584
122.0974
126.1860
171.1613
180.1563
124.5506
5707.076
 yi  yˆ i 2
9.052434
1.479981
28.464554
663.494881
234.009911
567.104757
45.714584
113.288358
150.048384
1812.658
141.178
SST   yi  y   5707.076
2
SSE   yi  yˆi   1812.658
2
SSR   yˆi  yˆ   SST  SSE
2
SST  SSR  SSE
R2  1
SSE
1812.658
 1
 0.6823841
SST
5707.076
Food company example:
ANOVAb
Model
1
Regression
Sum of
Squares
3894,418
df
Mean Square
1
3894,418
Residual
1812,658
7
Total
5707,076
8
a. Predictors: (Constant), ADVER
b. Dependent Variable: SALES
258,951
F
15,039
Sig.
,006a
What is the relationship between
the average speed a car is
driven and its fuel efficiency?
We plot fuel efficiency (in miles
per gallon, MPG) against average
speed (in miles per hour, MPH)
for a random sample of 60 cars.
The relationship is curved.
When speed is log transformed
(log of miles per hour, LOGMPH)
the new scatterplot shows a
positive, linear relationship.
Using software: SPSS
r2 =SSM/SST
= 494/552
ANOVA and t-test
give same p-value.
1918 flu epidemics
1918 influenza epidemic
Date
# Cases # Deaths
17
ee
k
15
ee
k
13
ee
k
11
9
ee
k
ee
k
7
w
ee
k
5
w
ee
k
3
w
ee
k
w
ee
k
1
1918 influenza epidemic
w
w
w
w
10000
800
9000
700
8000
600
# Cases
# Deaths
7000
500
6000
5000
400
4000
The line graph suggests that about 7 to 8% of300
those
3000
200
2000
diagnosed with the flu died within about a week
of
100
1000
0 diagnosis. We look at the relationship between
0
their
w
0
0
130
552
738
414
198
90
56
50
71
137
178
194
290
310
149
800
700
600
500
400
300
200
100
0
the number of deaths in a given week and the
w
ee
k
1
w
ee
k
3
w
ee
k
5
w
ee
k
7
w
ee
k
9
w
ee
k
11
w
ee
k
13
w
ee
k
15
w
ee
k
17
36
531
4233
8682
7164
2229
600
164
57
722
1517
1828
1539
2416
3148
3465
1440
Incidence
week 1
week 2
week 3
week 4
week 5
week 6
week 7
week 8
week 9
week 10
week 11
week 12
week 13
week 14
week 15
week 16
week 17
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
# deaths reported
# cases diagnosed
1918 influenza epidemic
number of new diagnosed cases one week earlier.
# Cases
# Deaths
r = 0.91
1918 flu epidemic: Relationship between the number of
deaths in a given week and the number of new diagnosed
cases one week earlier.
MINITAB
- Regression Analysis:
FluDeaths1 versus FluCases0
The regression equation is
FluDeaths1 = 49.3 + 0.0722 FluCases0
Predictor
Coef
Constant
49.29
FluCases
0.072222
S = 85.07
s  MSE
SE Coef
SEb 0
0.008741 SE
b1
29.85
R-Sq = 83.0%
T
P
1.65
0.121
8.26
0.000
R-Sq(adj) = 81.8%
r2 = SSM / SST
Analysis of Variance
Source
Regression
DF
1
SS
P-value for
H0: β = 0; Ha: β ≠ 0
MS
F
P
68.27
0.000
Residual Error
14
494041 SSM 494041
101308
7236
Total
15
595349 SST
MSE  s 2