Linear regression and inference

Download Report

Transcript Linear regression and inference

Objectives
10.1
Simple linear regression

Statistical model for linear regression

Estimating the regression parameters

Confidence interval for regression parameters

Significance test for the slope

Confidence interval for µy

Prediction intervals
Statistical model for linear regression

In the population, the linear regression equation is
y = 0 + 1x + e,
where e is the random deviation (or error) of the response
variable from the prediction formula.

Usually, we assume that e has Normal(0,σ) distribution.

0 (y-intercept) and 1 (slope) are the parameters.
Statistical inference is conducted to draw conclusions about
the parameters.

Confidence interval and hypothesis test for 1. We especially
want to test whether the slope equals zero.


Confidence interval for 0 + 1x, given a value for x.

Prediction interval for a random y, given a value for x.
Estimating the parameters

The population linear regression equation is
y = 0 + 1x + e.

The sample fitted regression line is
ŷ = b0 + b1x.


b0 is the estimate for the intercept 0 and

b1 is the estimate for the slope 1.
We also estimate σ (the standard deviation of e), using
se 
 residual
n2
2
2

( n  1)(1  r )
n2
sy.

se is a measure of the typical size of a residual y − ŷ.

We will use se to compute the standard errors we need.
Confidence interval for the slope parameter

Before we do inference for the slope parameter 1, we need
the standard error for the estimate b1:
se
SE b 
1
(n
.
2
 1) s x

We use the t distribution, now with n – 2 degrees of freedom.

A level C confidence interval for the slope, 1, is
*
b1  t  SE b .
1

t* is the table value for the t(n – 2) distribution with area C
between −t* and t*.

“Confidence” has the same interpretation as always.
Significance test for the slope parameter
We can test the hypothesis H0: 1 = m versus either a 1-sided or
a 2-sided alternative, using a t-statistic. (The primary case is
with m = 0.)
We calculate t 
b1  m
SE b
1
and use the t(n – 2)
distribution to find the
P-value of the test.
Note: Software typically
provides two-sided p-values.
Relationship between ozone and carbon pollutants
In StatCrunch: Stat-Regression-Simple Linear; choose Hypothesis Test
se
df = n − 2
To test H0: 1 = 0 with α = 0.05, we compute
t
b1  m
SE b
1

0.0057084  0.0
yˆ  0.0515  0.005708 x .
 4.584.
0.0012452
From the t-table, using df = 28 − 2 = 26, we can see that the P-value is less than
0.0005. Since it is very small we reject H0 and conclude the slope is not zero.
Relationship between ozone and carbon pollutants
In StatCrunch: Stat-Regression-Simple Linear; choose Confidence Interval
Having decided that the slope is not zero,
we next estimate it with a 95% confidence
interval:
*
yˆ  0.0515  0.005708 x .
b1  t  SE b  0.0057084  2.056  0.0012452  (0.00315, 0.00827) .
1
Confidence interval for 0 + 1x
We can also calculate a confidence interval for the regression line
itself, at any choice x. Generally this is sensible as long as x is within
the range of data observed (interpolation). Extrapolation should only
be done with a great deal of caution.
The interval is centered on ŷ = b0 + b1x, but we need a standard error for this
particular estimate.
SE yˆ  s e 
1
n

(x  x )
(n
2
2
 1) s x
.
The confidence interval is then calculated in the usual fashion:
*
*
yˆ  t  SE yˆ  b0  b1 x  t  SE yˆ .
This is an estimate of the point on the line (the expected value of y) for the
given value of x.
Prediction interval for a new obs. y
It often is of greater interest to predict what the actual y value might be
(not just what it is expected to be). Such a prediction interval for an
actual (new) observation y, must necessarily account for both the
estimation of the line and the random deviation e away from that line.
The interval is again centered on ŷ = b0 + b1x, but now we also account for the
random deviation. The prediction interval for the actual y, with given value for
x, is
*
yˆ  t 
2
2
s e  SE yˆ .
The distinction between a confidence interval and a prediction interval is
whether you want to capture the expected value of y or the actual value of y.
Prediction intervals




Unlike confidence intervals, the size of the prediction interval does
not get narrower as you increase the sample size. This is because:
The confidence interval is estimating a parameter, such as the
mean, the slope, the slope equation. For example, if I am interesting
in the mean grade of all people taking midterm 3 who scored 10 on
midterm 2, the CI will get narrower as the sample size grows
(because the estimators tend to get better for large sample size).
The prediction interval is completely different. Here we are trying to
predict the grade of a randomly selected person who scored 10 on
midterm 2. There will be a lot of variability, and it does not improve
as we increase the sample size: very individual is different (it is like
predicting the weight of someone who is 6 foot tall, even if we know
what the average weight of a 6 footer is, there is a huge variation in
this group, thus the prediction interval must be wide for us to be able
to capture the height).
This is a fundamental difference between predicting the
measurement of an individual and estimating the mean. The mean
estimator will get better with sample size, the individual won’t.
Efficiency of a biofilter, by temperature
In StatCrunch: Stat-Regression-Simple Linear; choose Predict Y for X
For a 95% confidence interval of the expected yˆ  97.5  0.0757  16  98.71.
ozone level, with temperature = 16, we compute
*
yˆ  t  SE yˆ  98.71  2.042  0.0393  (98.63, 98.79).
For a 95% prediction interval of the actual ozone level, with temperature = 16,
we compute
*
2
2
2
2
yˆ  t 
s e  SE yˆ  98.71  2.042 
0.1552  0.0393
 (98.38, 99.04).