Transcript Chapter 7:

Chapter 8
Lecture Slides
1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter 8:
Inference in Linear Models
2
Introduction
• We discussed bivariate data in Chapter 2.
• In this chapter, we learn to compute confidence
intervals and to perform hypothesis tests on the slope
and intercept of the true regression line.
• We have talked about a single predictor but
sometimes a single independent variable is not
enough, in these cases, we have several independent
variables that are related to a dependent variable.
• If the relationship between the independent and
dependent variables is linear, the technique of
multiple regression can be used to include all of the
independent variables in the model.
3
Section 8.1: Inferences Using the
Least-Squares Coefficients
• When two variables have a linear relationship,
the scatterplot tends to be clustered around a
line known as the least squares line.
• We think of the slope and intercept of the leastsquares line as estimates of the slope and
intercept of the true regression line.
4
Vocabulary
•
•
•
•
•
•
•
The linear model is yi = β0 + β1xi+ εi
The dependent variable is yi.
The independent variable is xi.
The regression coefficients are β0 and β1.
The error is εi.
The line y = β0 + β1x is the true regression line.
The quantities ˆ0 and ˆ1 are called the least squares
coefficients and can be computed in the way we
discussed in Chapter 2.
5
Assumptions for Errors in Linear
Models
In the simplest situation, the following
assumptions are satisfied:
1. The errors 1,…,n are random and independent.
In particular, the magnitude of any error i does
not influence the value of the next error i+1.
2. The errors 1,…,n all have mean 0.
3. The errors 1,…,n all have the same variance,
which we denote by 2.
4. The errors 1,…,n are normally distributed.
6
Distribution
In the linear model yi = 0 +1xi +i, under assumptions
1 through 4, the observations y1,…,yn are independent
random variables that follow the normal distribution.
The mean and variance of yi are given by
 y  0  1 xi
i
  .
2
yi
2
The slope represents the change in the mean of y
associated with an increase in one unit in the value of x.
7
More Distributions
Under assumptions 1 – 4:
• The quantitiesˆ0 and ˆ1 are normally distributed random variables.
• The means of ˆ0 and ˆ1 are the true values 0 and 1, respectively.
• The standard deviations of ˆ0 and ˆ1 are estimated with
1
sˆ  s

n
0
x2
n
(x
i
i 1
where s 
(1  r )
and
 x)
2
sˆ 
1
s
n
.
2
(
x

x
)
 i
i 1
n
2
( y
i 1
2

y
)
i
n2
is an estimate of the error standard deviation .
8
Example 1
For the Hooke’s law data, compute s, s βˆ1 , s βˆ0 .
9
Notes
1. Since there is a measure of variation of x in
the denominator in both of the uncertainties
we just defined, the more spread out x’s are
the smaller the uncertainties in ˆ and ˆ .
2. Use caution: if the range of x values extends
beyond the range where the linear model
holds, the results will not be valid.
ˆ   / s

ˆ  1  / sˆ

3. The quantities
 and  1
have Student’s t distribution with n – 2
degrees of freedom.
0
0
0
ˆ
0
1
1
10
Confidence Intervals
• Level 100(1 – )% confidence intervals for 0
and 1 are given by ˆ0  tn1, / 2 sˆ and ˆ1  tn1, / 2 sˆ .
0
1
• A level 100(1 – )% confidence intervals for
0 + 1x is given by ˆ0  ˆ1 x  tn2, / 2 s yˆ , where
1
( x  x )2
s yˆ 
 n
.
2
n
(
x

x
)
 i
i 1
11
Prediction Interval
• A level 100(1 – )% prediction interval for
0+1x is given by ˆ0  ˆ1 x  tn2, / 2 s pred , where
s pred
1
( x  x )2
 1  n
.
2
n
(
x

x
)
 i
i 1
12
Example 1 (cont.)
Find a 95% CI for the spring constant in the
Hooke’s law data.
13
Example 1 (cont.)
In the Hooke’s law data, find a 99% CI for the
unloaded length of the spring.
14
Example 1 (cont)
For the data in Example 1, compute a confidence
interval for the slope of the regression line.
15
Example 1 (cont.)
The manufacturer of the spring in the Hooke’s
law data claims that the spring constant β1 is at
least 0.215 in./lb. We have estimated the spring
constant to be 0.2046 in./lb. Can we conclude
that the manufacturer’s claim is false?
16
Example 1 (cont.)
Under the Hooke’s law data, compute a 95% CI
for the length of a spring under a load of 1.4 lb.
17
COMPUTER OUTPUT
Regression Analysis: Length versus Weight
The regression equation is
Length = 5.00 + 0.205 Weight (1)
Predictor
Constant
Weight
Coef(2) SE Coef(3)
T(4) P(5)
4.99971 0.02477 201.81 0.000
0.20462 0.01115
18.36 0.000
S = 0.05749 (6) R-Sq = 94.9% (7) R-Sq(adj) = 94.6%
Analysis of Variance (8)
Source
DF
SS
Regression
1
1.1138
Residual Error 18
0.0595
Total
19
1.1733
MS
F
P
1.1138 337.02 0.000
0.0033
Unusual Observations (9)
Obs Weight Length
Fit
12
2.20 5.5700
5.4499
SE Fit Residual St Resid
0.0133
0.1201
2.15R
R denotes an observation with a large standardized residual
Predicted Values for New Observations (10)
New Obs Fit SE Fit
95.0% CI
95.0% PI
1
5.2453 0.0150 ( 5.2137, 5.2769) ( 5.1204, 5.3701)
Values of Predictors for New Observations (11)
New Obs Weight
1
1.20
Interpreting Computer Output
1.
2.
3.
This is the equation of the least-squares line.
Coef The coefficients ˆ0  4.99971 and ˆ1  0.20462 .
SE Coef The standard deviations of the estimates for 0 and
1 .
4. T The values of the Student’s t statistics for testing the
hypotheses 0 = 0 and 1 = 0. The t statistic is equal to the
coefficient divided by its standard deviation.
5. P The P-values for the tests of the hypotheses 0 = 0 and
1 = 0. The more important P-value is that for 1. If this Pvalue is not small enough to reject the hypothesis that 1 = 0
the linear model is not useful for predicting y from x.
19
More Computer Output Interpretation
6.
7.
S The estimate of s, the error standard deviation.
R-Sq This is r2, the square of the correlation coefficient r, also
called the coefficient of determination.
8. Analysis of Variance This table is not so important in simple
linear regression, we will discuss it when we discuss multiple
linear regression.
9. Unusual Observations Minitab tries to alert you to data points
that may violate the assumptions 1-4.
10. Predicted Values for New Observations These are confidence
intervals and prediction intervals for values of x specified by the
user.
11. Values of Predictors for New Observations This is simply a
list of the x values for which confidence and prediction intervals
have been calculated.
20
Inferences on the Population
Correlation
• When we have a random sample from a population of
ordered pairs, the correlation coefficient, r, is often
called the sample correlation.
• We have the true population correlation, ρ.
• If the population of ordered pairs has a certain
distribution known as a bivariate normal
distribution, then the sample correlation can be used
to construct CI’s and perform hypothesis tests on the
population correlation.
21
Testing
• The null hypotheses of interest are of the form ρ = 0,
ρ ≤ 0, and ρ ≥ 0.
• The method of testing these hypotheses is based on
the test statistic U which has a Student’s t distribution
with n – 2 degrees of freedom.
U
r n2
1 r 2
.
22
Section 8.2: Checking Assumptions
• We stated some assumptions for the errors.
Here we want to see if any of those
assumptions are violated.
• The single best diagnostic for least-squares
regression is a plot of residuals versus the
fitted values, sometimes called a residual plot.
23
More of the Residual Plot
• When the linear model is valid, and assumptions 1 – 4
are satisfied, the plot will show no substantial pattern.
There should be no curve to the plot, and the vertical
spread of the points should not vary too much over
the horizontal range of the data.
• A good-looking residual plot does not by itself prove
that the linear model is appropriate. However, a
residual plot with a serious defect does clearly
indicate that the linear model is inappropriate.
24
Residual Plots
Upper left: No noticeable pattern. Upper right:
Heteroscedastic. Lower left: Trend. Lower Right:
Outlier.
25
Residuals versus Fitted Values
If the plot of residuals versus fitted values
• Shows no substantial trend or curve, and
• Is homoscedastic, that is, the vertical spread does not
vary too much along the horizontal length of plot,
except perhaps near the edges.
then it is likely, but not certain, that the assumptions of
the linear model hold.
However, if the residual plot does show a substantial
trend or curve, or is heteroscedastic, it is certain that
the assumptions of the linear model do not hold.
26
Transformations
• If we fit the linear model y = 0 +1x + and find that
the residual plot exhibits a trend or pattern, we can
sometimes fix the problem by raising x, y, or both to a
power.
• It may be the case that a model of the form
ya = 0 +1xb + fits the data well.
• Replacing a variable with a function of itself is called
transforming the variable.
27
Don’t Forget
• Once the transformation has been
completed, then you must inspect the
residual plot again to see if that model is a
good fit.
• It is fine to proceed through
transformations by trial and error.
• It is important to remember that power
transformations don’t always work.
28
Caution
• When there are only a few points in a residual
plot, it can be hard to determine whether the
assumptions of the linear model are met.
• When one is faced with a sparse residual plot
that is hard to interpret, a reasonable thing to
do is to fit a linear model, but to consider the
results tentative, with the understanding that
the appropriateness of the model has not been
established.
29
Independence of Observations
• If the plot of residuals versus fitted values looks
good, then further diagnostics may be used to further
check the fit of the linear model.
• A time order plot of the residuals versus order in
which observations were made.
• If there are trends in this plot, then x and y may be
varying with time. This means that the errors are not
independent. When this feature is severe, linear
regression should not be used, and the methods of
time series analysis should be used instead.
30
Normality Assumption
• To check that the errors are normally
distributed, a normal probability plot of the
residuals can be made.
• If the plot looks like it follows a rough straight
line, then we can conclude that the residuals
are approximately normally distributed.
31
Comments
• Physical laws are applicable to all future
observations.
• An empirical model is valid only for the data to
which it is fit. It may or may not be useful in
predicting outcomes for subsequent observations.
• Determining whether to apply an empirical model to
a future observation requires scientific judgment
rather that statistical analysis.
32
Section 8.3: Multiple Regression
• The methods of simple linear regression apply
when we wish to fit a linear model relating the
value of an dependent variable y to the value
of a single independent variable x.
• There are many situations when a single
independent variable is not enough.
• In situations like this, there are several
independent variables, x1,x2,…,xp, that are
related to a dependent variable y.
33
p Independent Variables
• Assume that we have a sample of n items and
that on each item we have measured a
dependent variable y and p independent
variables, x1,x2,…,xp.
• The ith sampled item gives rise to the ordered
set (yi, x1i,…, xpi).
• We can then fit the multiple regression model
yi = 0 + 1x1i +…+ pxpi + i.
34
Various Multiple Linear Regression
Models
• Polynomial regression model (the independent variables are
all powers of a single variable)
yi   0  1 xi   2 xi2  L   p xip   i
• Quadratic model (polynomial regression of model of degree 2, and
powers of several variables)
yi  0  1x1i  2 x2i  3 x1i x2i  4 x12i  5 x22i  i
• A variable that is the product of two other variables is called an
interaction.
• These models are considered linear models, even though they contain
nonlinear terms in the independent variables. The reason is that they are
linear in the coefficients, i .
35
Estimating the Coefficients
• In any multiple regression model, the estimates ˆ0 , ˆ1 ,..., ˆ p
are computed by least-squares, just as in simple linear
regression. The equation
yˆ  ˆ0  ˆ1 x1  L  ˆ p x p
is called the least-squares equation or fitted regression
equation.
• Now define yˆi to be the y coordinate of the least-squares
equation corresponding to the x values (x1i,…, xpi).
• The residuals are the quantities ei  yi  yˆi ,which are the
differences between the observed y values and the y values
given by the equation.
• We want to compute ˆ0 , ˆ1 ,..., ˆ p so as to minimize the sum of
the squared residuals. This is complicated and we rely on
computers to calculate them.
36
Sums of Squares
• Much of the analysis in multiple regression is
based on three fundamental quantities.
• They are regression sum of squares (SSR),
the error sum of squares (SSE), and the total
sum of squares (SST).
• We defined these quantities in Chapter 7 and
they hold here as well.
• The analysis of variance identity is
SST = SSR + SSE
37
Assumptions of the Error Terms
Recall: Assumptions for Errors in Linear Models:
In the simplest situation, the following assumptions are
satisfied (notice that these are the same as for simple
linear regression.):
1. The errors 1,…, n are random and independent. In
particular, the magnitude of any error i does not
influence the value of the next error i+1.
2. The errors 1,…, n all have mean 0.
3. The errors 1,…, n all have the same variance, which we
denote by 2.
4. The errors 1,…, n are normally distributed.
38
Mean and Variance of yi
The multiple linear regression model is
yi = 0 + 1x1i +…+ pxpi + i.
Under assumptions 1 through 4, the observations
y1,…, yn are independent random variables that follow
the normal distribution. The mean and variance of yi are
given by
 y  0  1 x1i 
i
  p x pi
 y2   2
i
Each coefficient represents the change in the mean of y
associated with an increase of one unit in the value of xi,
when the other x variables are held constant.
39
Statistics
• The three statistics most often used in multiple regression are
the estimated error variance s2, the coefficient of determination
R2, and the F statistic.
• We have to adjust the estimated standard deviation since we
are estimating p + 1 coefficients,
2
ˆ
(
y

y
)
SSE

2
i
i
i 1
s 

n  p 1
n  p 1
n
• The estimated variance of each least-squares coefficient is a
complicated calculation and we can find them on a computer.
• In simple linear regression, the coefficient of determination,
R2, measures the goodness of fit of the linear model. The
goodness of fit statistic in multiple regression denoted by R2 is
also called the coefficient of determination. The value of R2 is
calculated in the same way as r2 in simple linear regression.
That is, R2 = SSR/SST.
40
Distribution of βi
• When assumptions 1 through 4 are satisfied, the
quantity
ˆ
βi  βi
s βˆ
i
has a Student’s t distribution with n – p + 1 degrees of
freedom.
• The number of degrees of freedom is equal to the
denominator used to compute the estimated error
variance.
• This statistic is used to compute confidence intervals
and to perform hypothesis tests, as we did with
simple linear regression.
41
Tests of Hypothesis
• In simple linear regression, a test of the null hypothesis 1 = 0
is almost always made. If this hypothesis is not rejected, then
the linear model may not be useful.
• The test is multiple linear regression is
H0 = 1 = 2 = … = p = 0. This is a very strong hypothesis. It
says that none of the independent variables has any linear
relationship with the dependent variable.
• The test statistic for this hypothesis is
F = (SSR/p)/(SSE/(n – p – 1)).
• This is an F statistic and its null distribution is Fp,n-p-1. Note
that the denominator of the F statistic is s2. The subscripts p
and n – p – 1are the degrees of freedom for the F statistic.
• Slightly different versions of the F statistics can be used to test
milder null hypotheses.
42
Output
The regression equation is
Goodput = 96.0 - 1.82 Speed + 0.565 Pause + 0.0247 Speed*Pause + 0.0140 Speed^2
- 0.0118 Pause^2
Predictor
Coef
StDev
Constant
96.024
3.946
Speed
-1.8245
0.2376
Pause
0.5652
0.2256
Speed*Pa 0.024731 0.003249
Speed^2
0.014020 0.004745
Pause^2 -0.011793 0.003516
S = 2.942
R-Sq = 93.2%
T
24.34
-7.68
2.51
7.61
2.95
-3.35
P
0.000
0.000
0.022
0.000
0.008
0.003
R-Sq(adj) = 91.4%
Analysis of Variance
Source
DF
Regression
5
Residual Error 19
Total
24
SS
2240.49
164.46
2404.95
MS
448.10
8.66
F
51.77
P
0.000
Predicted Values for New Observations
New
Obs
Fit
1 74.272
SE Fit
95% CI
95% PI
1.175 (71.812, 76.732) (67.641, 80.903)
Values of Predictors for New Observations
New
Obs Speed Pause Speed*Pause Speed^2 Pause^2
1
25.0 15.0
375
625
225
Interpreting Output
• Much of the output is analogous to that of simple
linear regression.
1. The fitted regression equation is presented near the
top of the output.
2. Below that the coefficient estimates and their
estimated standard deviations.
3. Next to each standard deviation is the Student’s t
statistic for testing the null hypotheses that the true
value of the coefficient is equal to 0.
4. The P-values for the tests are given in the next
column.
44
Analysis of Variance Table
5. The DF column gives the degrees of freedom, the
degrees of freedom for regression is equal to the number
of independent variables in the model. The degrees of
freedom for “Residual Error” is the number of
observations – number of parameters estimated. The
total degrees of freedom is the sum of the degrees of
freedom for regression and for error.
6. The next column is SS. This column gives the sum of
squares, the first is regression sum of squares, SSR, the
second is error sum of squares, SSE, and the third is the
total sum of squares, SST = SSR +SSE.
45
More on the ANOVA Table
7. The column MS is the column with the mean sum of
squares which is the sums of squares divided by their
respective degrees of freedom. Note that the mean
square error is equal to the variance estimate, s2.
8. The column labeled F presents the mean square for
regression divide by the mean square for error.
9. This is the F statistic that we discussed earlier that is
used for testing the null hypothesis that none of the
independent variables are related to the dependent
variable.
46
Using the Output
• From the output, we can use the fitted
regression equation to predict for future
observations.
• It is also possible to calculate residuals for a
value of y.
• Constructing confidence interval for the
coefficient of the independent variables is also
possible from the output.
47
Example 2
Use the multiple regression model to predict the
goodput for a network with speed 12 m/s and pause
time 25 s.
For the goodput data, find the residual for the point
Speed = 20, Pause = 30.
Find a 95% confidence interval for the coefficient of
Speed in the multiple regression model.
Test the null hypothesis that the coefficient of Pause is
less than or equal to 0.3.
48
Checking Assumptions
• It is important in multiple linear regression to test the
validity of the assumptions for errors in the linear
model.
• Check plots of residuals versus fitted values, normal
probability plots of residuals, and plots of residuals
versus the order in which the observations were
made.
• It is also a good idea to make plots of residuals versus
each of the independent variables. If the residual
plots indicate a violation of assumptions,
transformations can be tried.
49
Section 8.4: Model Selection
• There are many situations in which a large number of
independent variables have been measured, and we
need to decide which of them to include in the model.
• This is the problem of model selection, and it is not
an easy one.
• Good model selection rests on this basic principle
known as Occam’s razor:
“The best scientific model is the simplest model that
explains the observed data.”
• In terms of linear models, Occam’s razor implies the
principle of parsimony:
“A model should contain the smallest number of
variables necessary to fit the data.”
50
Some Exceptions
1. A linear model should always contain an
intercept, unless physical theory dictates
otherwise.
2. If a power xn of a variable is included in the
model, all lower powers x, x2, …, xn-1 should
be included as well, unless physical theory
dictates otherwise.
3. If a product xy of two variables is included in
a model, then the variables x and y should be
included separately as well, unless physical
theory dictates otherwise.
51
Notes
• Models that include only the variables needed
to fit the data are called parsimonious models.
• Adding a variable to a model can substantially
change the coefficients of the variables already
in the model.
52
Can a Variable Be Dropped?
• It often happens that one has formed a model
that contains a large number of independent
variables, and one wishes to determine
whether a given subset of them may be
dropped from the model without significantly
reducing the accuracy of the model.
• Assume that we know that the model
yi=0 + 1x1i +…+kxki+k+1xk+1i +… pxpi + i
is correct. We will call this the full model.
53
Null Hypothesis
• We wish to test the null hypothesis
H0: k+1=…=p= 0.
• If H0 is true, the model will remain correct if
we drop the variables xk+1,…xp, so we can
replace the full model with the following
reduced model:
yi=0 + 1x1i +…+kxki + i.
54
Test Statistic
• To develop a test statistic for H0, we begin by
computing the error sums of squares for both
the full and reduced models.
• We call this SSfull and SSreduced, respectively.
• The number of degrees of freedom for SSfull is
n – p – 1, and for SSreduced is n – k – 1.
• The test statistic is
f =[(SSreduced – SSfull)/(p – k)]/[SSfull/(n – p –
1)]
• If H0 is true, then f tends to be close to 1. If H550
is false, then f tends to be larger.
Comments
• This method is very useful for developing
parsimonious models by removing unnecessary
variables. However, the conditions under which it is
formally correct are rarely met.
• More often, a large model is fit, some of the variables
are seen to have fairly large P-values, and the F test is
used to decide whether to drop them from the model.
• It is often the case that there is no one “correct”
model. There are several models that fit equally well.
56
Best Subsets Regression
• Assume that there are p independent variables, x1, x2,…, xp, that
are available to be put in the model.
• Let’s assume that we wish to find a good model that contains
exactly four independent variables.
• We can simply fit every possible model containing four of the
variables, and rank them in order of their goodness-of-fit, as
measured by the coefficient of determination, R2.
• The subset of four variables that yield the largest value of R2 is
the “best” subset of size four.
• One can repeat the process for subsets of other sizes, finding
the best subsets of size 1, 2,…, p.
• These best subsets can be examined to see which provides a
good fit, while being parsimonious.
57
Stepwise Regression
• This is the most widely use model selection technique.
• Its main advantage over best subsets regression is that it
is less computationally intensive, so it can be used in
situations where there are a very large number of
candidate independent variables and too many possible
subsets for every one of them to be examined.
• The user chooses two threshold P-values, in and out,
with in < out .
• The stepwise regression procedure begins with a step
called a forward selection step, in which the
independent variables with smallest P-value is selected,
provided that P < in.
• This variable is entered in the model, creating a model
with a single independent variable.
58
More on Stepwise Regression
• In the next step, the remaining variables are examined
one at a time as candidates for the second variable in
the model. The one with the smallest P-value is
added to the model, again provided that P < in.
• Now, it is possible that adding the second variables to
the model increased the P-value of the first variable.
In the next step, called a backward elimination step,
the first variable is dropped from the model if its Pvalue has grown to exceed the value out .
• The algorithm continues by alternating forward
selection steps with backward eliminations steps.
• The algorithm terminates when no variables meet the
criteria for being added to or dropped from the model.
59
Notes on Model Selection
• When there is little or no physical theory to rely on,
many different models will fit the data about equally
well.
• The methods for choosing a model involve statistics,
whose values depend on the data. Therefore, if the
experiment is repeated, these statistics will come out
differently, and different models may appear to be
“best.”
• Some or all of the independent variables in a selected
model may not really be related to the dependent
variable. Whenever possible, experiments should be
repeated to test these apparent relationships.
• Model selection is an art, not a science.
60
Summary
• Uncertainties in the least-squares coefficients
• Confidence intervals and hypothesis tests for leastsquares coefficients
• Checking assumption
• Residuals
• Multiple regression models
• Estimating the coefficients
• Checking assumptions in multiple regression
• Confounding and collinearity
• Model selection
61