Statistics for Managers Using Microsoft Excel, 3/e

Download Report

Transcript Statistics for Managers Using Microsoft Excel, 3/e

Statistics for Managers
Using Microsoft Excel
3rd Edition
Chapter 12
Multiple Regression
© 2002 Prentice-Hall, Inc.
Chap 12-1
Chapter Topics

The multiple regression model

Residual analysis



Testing for the significance of the regression
model
Inferences on the population regression
coefficients
Testing portions of the multiple regression
model
© 2002 Prentice-Hall, Inc.
Chap 12-2
Chapter Topics






(continued)
The quadratic regression model
Dummy variables
Using transformation in regression models
Collinearity
Model building
Pitfalls in multiple regression and ethical
considerations
© 2002 Prentice-Hall, Inc.
Chap 12-3
The Multiple Regression Model
Relationship between 1 dependent & 2 or more
independent variables is a linear function
Population
Y-intercept
Population slopes
Random
Error
Yi     X1i   X 2i   k X ki  i
Yi  b0  b1 X1i  b2 X 2i   bk X ki  ei
Dependent (Response)
variable for sample
© 2002 Prentice-Hall, Inc.
Independent (Explanatory)
variables for sample model
Residual
Chap 12-4
Population Multiple Regression
Model
Bivariate model
Y
Response
Plane
X1
© 2002 Prentice-Hall, Inc.
Yi =  0 +  1X1i +  2X2i + i
(Observed Y)
0
i
X2
(X1i,X2i)
 Y|X =  0 +  1X1i +  2X2i
Chap 12-5
Sample Multiple
Regression Model
Bivariate model
Response
Plane
X1
Y
Yi = b0 + b1X1i + b2X2i + ei
(Observed Y)
b0
ei
X2
(X1i, X2i)
^
Yi = b0 + b1X1i + b2X2i
© 2002 Prentice-Hall, Inc.
Sample Regression Plane
Chap 12-6
Simple and Multiple
Regression Compared


Coefficients in a simple regression pick up
the impact of that variable plus the
impacts of other variables that are
correlated with it and the dependent
variable.
Coefficients in a multiple regression net
out the impacts of other variables in the
equation.
© 2002 Prentice-Hall, Inc.
Chap 12-7
Simple and Multiple
Regression Compared:Example

Two simple regressions:



Oil   0  1 Temp
Oil   0  1 Insulation
Multiple regression:

© 2002 Prentice-Hall, Inc.
Oil  0  1 Temp   2 Insulation
Chap 12-8
Multiple Linear
Regression Equation
Too
complicated
by hand!
© 2002 Prentice-Hall, Inc.
Ouch!
Chap 12-9
Interpretation of Estimated
Coefficients

Slope (bi)



Estimated that the average value of Y changes by
bi for each 1 unit increase in Xi holding all other
variables constant (ceterus paribus)
Example: if b1 = -2, then fuel oil usage (Y) is
expected to decrease by an estimated 2 gallons for
each 1 degree increase in temperature (X1) given
the inches of insulation (X2)
Y-intercept (b0)

The estimated average value of Y when all Xi = 0
© 2002 Prentice-Hall, Inc.
Chap 12-10
Multiple Regression Model:
Example
Develop a model for estimating
heating oil used for a single
family home in the month of
January based on average
temperature and amount of
insulation in inches.
© 2002 Prentice-Hall, Inc.
0
Oil (Gal) Temp ( F) Insulation
275.30
40
3
363.80
27
3
164.30
40
10
40.80
73
6
94.30
64
6
230.90
34
6
366.70
9
6
300.60
8
10
237.80
23
10
121.40
63
3
31.40
65
10
203.50
41
6
441.10
21
3
323.00
38
3
52.50
58
10
Chap 12-11
Sample Multiple Regression
Equation: Example
Yˆi  b0  b1 X1i  b2 X 2i 
Excel Output
Intercept
X Variable 1
X Variable 2
 bk X ki
Coefficients
562.1510092
-5.436580588
-20.01232067
Yˆi  562.151 5.437 X1i  20.012 X 2i
For each degree increase in
temperature, the estimated average
amount of heating oil used is
decreased by 5.437 gallons,
holding insulation constant.
© 2002 Prentice-Hall, Inc.
For each increase in one inch
of insulation, the estimated
average use of heating oil is
decreased by 20.012 gallons,
holding temperature constant.
Chap 12-12
Multiple Regression in PHStat


PHStat | regression | multiple regression …
EXCEL spreadsheet for the heating oil
example.
© 2002 Prentice-Hall, Inc.
Chap 12-13
Venn Diagrams and
Explanatory Power of Regression
Variations in
Temp not used
in explaining
variation in Oil
Temp
© 2002 Prentice-Hall, Inc.
Oil
Variations in
Oil explained
by the error
term  SSE 
Variations in Oil
explained by Temp
or variations in
Temp used in
explaining variation
in Oil  SSR 
Chap 12-14
Venn Diagrams and
Explanatory Power of Regression
(continued)
r 
2
Oil
Temp
© 2002 Prentice-Hall, Inc.
SSR

SSR  SSE
Chap 12-15
Venn Diagrams and
Explanatory Power of Regression
Variation NOT
explained by
Temp nor
Insulation
 SSE 
Temp
© 2002 Prentice-Hall, Inc.
Overlapping
variation in
both Temp and
Oil
Insulation are
used in
explaining the
variation in Oil
but NOT in the
Insulation estimation of 
1
nor  2
Chap 12-16
Coefficient of
Multiple Determination

Proportion of total variation in Y explained by
all X variables taken together
2
Y 12 k
r


SSR Explained Variation


SST
Total Variation
Never decreases when a new X variable is
added to model

Disadvantage when comparing models
© 2002 Prentice-Hall, Inc.
Chap 12-17
Venn Diagrams and
Explanatory Power of Regression
Oil
2
Y 12
r

Temp
Insulation
© 2002 Prentice-Hall, Inc.
SSR

SSR  SSE
Chap 12-18
Adjusted Coefficient
of Multiple Determination

Proportion of variation in Y explained by all X
variables adjusted for the number of X
variables used




2
adj
r

2
 1  1  rY 12

n 1 
k 
n  k  1 
Penalize excessive use of independent variables
2
Smaller than rY 12 k
Useful in comparing among models
© 2002 Prentice-Hall, Inc.
Chap 12-19
Coefficient of Multiple
Determination
Excel Output
rY2,12
R e g re ssi o n S ta ti sti c s
M u lt ip le R
0.982654757
R S q u a re
0.965610371
A d ju s t e d R S q u a re
0.959878766
S t a n d a rd E rro r
26.01378323
O b s e rva t io n s
15
SSR

SST
Adjusted r2
 reflects the number
of explanatory
variables and sample
size
 is smaller than r2
© 2002 Prentice-Hall, Inc.
Chap 12-20
Interpretation of Coefficient of
Multiple Determination

2
Y ,12
r

SSR

 .9656
SST
96.56% of the total variation in heating oil can be
explained by different temperature and amount of
insulation
r  .9599
2
adj


95.99% of the total fluctuation in heating oil can
be explained by different temperature and amount
of insulation after adjusting for the number of
explanatory variables and sample size
© 2002 Prentice-Hall, Inc.
Chap 12-21
Using The Model to Make
Predictions
Predict the amount of heating oil used for a
home if the average temperature is 300 and
the insulation is six inches.
Yˆi  562.151  5.437 X 1i  20.012 X 2i
 562.151  5.437  30   20.012  6 
 278.969
© 2002 Prentice-Hall, Inc.
The predicted heating oil
used is 278.97 gallons
Chap 12-22
Predictions in PHStat

PHStat | regression | multiple regression …


Check the “confidence and prediction interval
estimate” box
EXCEL spreadsheet for the heating oil
example.
© 2002 Prentice-Hall, Inc.
Chap 12-23
Residual Plots

Residuals vs.



X1
May need to transform
Residuals vs.


May need to transform Y variable
Residuals vs.

Yˆ
X2
May need to transform
Residuals vs. time

X 1 variable
X 2variable
May have autocorrelation
© 2002 Prentice-Hall, Inc.
Chap 12-24
Residual Plots: Example
T em p eratu re R esid u al P lo t
Maybe some nonlinear relationship
60
Residuals
40
Insulation R esidual P lot
20
0
0
20
40
60
80
-20
-40
-60
0
2
4
6
8
10
12
No Discernable Pattern
© 2002 Prentice-Hall, Inc.
Chap 12-25
Influence Analysis



To determine observations that have influential effect
on the fitted model
Potentially influential points become candidate for
removal from the model
Criteria used are




The hat matrix elements hi
The Studentized deleted residuals ti*
Cook’s distance statistic Di
All three criteria are complementary

Only when all three criteria provide consistent result should
an observation be removed
© 2002 Prentice-Hall, Inc.
Chap 12-26
The Hat Matrix Element hi
X  X 
 X  X 
hi  2  k 1 / n , Xi
2
1
hi  
n

i
n
i 1

If
point

2
i
is an influential
Xi may be considered a candidate for removal
from the model
© 2002 Prentice-Hall, Inc.
Chap 12-27
The Hat Matrix Element hi :
Heating Oil Example
n  15
k 2
2  k  1 / n  0.4
 No hi > 0.4
 No observation
appears to be candidate
for removal from the
model
© 2002 Prentice-Hall, Inc.
Oil (Gal) Temp Insulation
275.30
40
3
363.80
27
3
164.30
40
10
40.80
73
6
94.30
64
6
230.90
34
6
366.70
9
6
300.60
8
10
237.80
23
10
121.40
63
3
31.40
65
10
203.50
41
6
441.10
21
3
323.00
38
3
52.50
58
10
hi
0.1567
0.1852
0.1757
0.2467
0.1618
0.0741
0.2306
0.3521
0.2268
0.2446
0.2759
0.0676
0.2174
0.1574
0.2268
Chap 12-28
The Studentized Deleted
*
Residuals ti

t 
*
i



ei 
Si  1  hi
ei  : difference between the observed Yi
and
predicted Yˆi based on a model that includes all
observations except observation i
Si : standard error of the estimate for a model
that includes all observations except observation i
*
An observation is considered influential if ti  tn  k  2

© 2002 Prentice-Hall, Inc.
tn k 2 is the critical value of a two-tail test at 10%
level of significance
Chap 12-29
The Studentized Deleted
*
Residuals ti :Example
n  15
k 2
tn k 2  t11  1.7957
 t10* and t13* are
influential points for
potential removal from
the model
© 2002 Prentice-Hall, Inc.
Oil (Gal) Temp Insulation
275.30
40
3
363.80
27
3
164.30
40
10
40.80
73
6
94.30
64
6
230.90
34
6
366.70
9
6
300.60
8
10
237.80
23
10
121.40
63
3
31.40
65
10
203.50
41
6
441.10
21
3
323.00
38
3
52.50
58
10
ti*
-0.3772
0.3474
0.8243
-0.1871
0.0066
-1.0571
-1.1776
-0.8464
0.0341
-1.8536
1.0304
-0.6075
2.9674
1.1681
0.2432
t10*
*
13
t
Chap 12-30
Cook’s Distance Statistic Di

2
i i
SR h
Di 
2 1  hi 


SRi 
SYX
ei
is the Studentized residual
1  hi
If Di  Fk 1,nk 1 , an observation is considered
influential
 Fk 1, n  k 1 is the critical value of the F distribution at
a 50% level of significance
© 2002 Prentice-Hall, Inc.
Chap 12-31
Cook’s Distance Statistic Di :
Heating Oil Example
n  15 k  2
Fk 1,nk 1  F3,12  0.835
 No Di > 0.835
 No observation appears to
be candidate for removal
from the model
Using the three criteria,
there is insufficient evidence
for the removal of any
observation from the model
© 2002 Prentice-Hall, Inc.
Oil (Gal) Temp Insulation
275.30
40
3
363.80
27
3
164.30
40
10
40.80
73
6
94.30
64
6
230.90
34
6
366.70
9
6
300.60
8
10
237.80
23
10
121.40
63
3
31.40
65
10
203.50
41
6
441.10
21
3
323.00
38
3
52.50
58
10
Di
0.0094
0.0098
0.0496
0.0041
0.0001
0.0295
0.1342
0.1328
0.0001
0.3083
0.1342
0.0094
0.4941
0.0824
0.0062
Chap 12-32
Testing for Overall Significance



Shows if there is a linear relationship between
all of the X variables together and Y
Use F test statistic
Hypotheses:




H0:     …  k = 0 (no linear relationship)
H1: at least one i   ( at least one independent
variable affects Y )
The null hypothesis is a very strong statement
Almost always reject the null hypothesis
© 2002 Prentice-Hall, Inc.
Chap 12-33
Testing for Overall Significance
(continued)

Test statistic:


MSR SSR  all  / p
F

MSE
MSE  all 
Where F has p numerator and (n-p-1)
denominator degrees of freedom
© 2002 Prentice-Hall, Inc.
Chap 12-34
Test for Overall Significance
Excel Output: Example
ANOVA
df
Regression
Residual
Total
SS
MS
F
Significance F
2 228014.6 114007.3 168.4712
1.65411E-09
12 8120.603 676.7169
14 236135.2
p = 2, the number of
explanatory variables
p value
n-1
MSR
 F Test Statistic
MSE
© 2002 Prentice-Hall, Inc.
Chap 12-35
Test for Overall Significance
Example Solution
H0: 1 = 2 = … = p = 0
H1: At least one i  0
 = .05
Test Statistic:
F 
168.47
(Excel Output)
df = 2 and 12
Decision:
Reject at  = 0.05
Critical Value(s):
Conclusion:
 = 0.05
0
© 2002 Prentice-Hall, Inc.
3.89
There is evidence that at
least one independent
variable affects Y
F
Chap 12-36
Test for Significance:
Individual Variables



Shows if there is a linear relationship between
the variable Xi and Y
Use t test statistic
Hypotheses:


H0: i  0 (no linear relationship)
H1: i  0 (linear relationship between Xi and Y)
© 2002 Prentice-Hall, Inc.
Chap 12-37
t Test Statistic
Excel Output: Example
t Test Statistic for X1
(Temperature)
Intercept
X Variable 1
X Variable 2
Coefficients Standard Error
562.1510092
21.09310433
-5.436580588
0.336216167
-20.01232067
2.342505227
bi
t
Sbi
© 2002 Prentice-Hall, Inc.
t Stat
26.65093769
-16.16989642
-8.543127434
t Test Statistic for X2
(Insulation)
Chap 12-38
t Test : Example Solution
Does temperature have a significant effect on monthly
consumption of heating oil? Test at  = 0.05.
H0: 1 = 0
Test Statistic:
H1: 1  0
t Test Statistic = -16.1699
Decision:
Reject H0 at  = 0.05
df = 12
Critical Value(s):
Reject H0
Reject H0
.025
.025
-2.1788
© 2002 Prentice-Hall, Inc.
0 2.1788
t
Conclusion:
There is evidence of a
significant effect of
temperature on oil
consumption.
Chap 12-39
Venn Diagrams and
Estimation of Regression Model
Only this
information is
used in the
estimation of
1
Oil
Only this
information is
used in the
estimation of  2
Temp
Insulation
© 2002 Prentice-Hall, Inc.
This
information
is NOT used
in the
estimation
of 1 nor  2
Chap 12-40
Confidence Interval Estimate
for the Slope
Provide the 95% confidence interval for the population
slope 1 (the effect of temperature on oil consumption).
b1  tn p1Sb1
Coefficients
Intercept
562.151009
X Variable 1 -5.4365806
X Variable 2 -20.012321
Lower 95% Upper 95%
516.1930837 608.108935
-6.169132673 -4.7040285
-25.11620102 -14.90844
-6.169  1  -4.704
The estimated average consumption of oil is reduced by
between 4.7 gallons to 6.17 gallons per each increase of 10 F.
© 2002 Prentice-Hall, Inc.
Chap 12-41
Contribution of a Single
Independent Variable X k


Let Xk be the independent variable of
interest
SSR  X k | all others except X k 
 SSR  all   SSR  all others except X k 

Measures the contribution of Xk in explaining the
total variation in Y (SST)
© 2002 Prentice-Hall, Inc.
Chap 12-42
Contribution of a Single
Independent Variable X k
SSR  X 1 | X 2 and X 3 
 SSR  X 1 , X 2 and X 3   SSR  X 2 and X 3 
From ANOVA section of
regression for
Yˆi  b0  b1 X1i  b2 X 2i  b3 X 3i
From ANOVA section
of regression for
Yˆi  b0  b2 X 2i  b3 X 3i
Measures the contribution of X 1 in explaining SST
© 2002 Prentice-Hall, Inc.
Chap 12-43
Coefficient of Partial
Determination of X k

rYk2  all others 
SSR  X k | all others 
SST  SSR  all   SSR  X k | all others 

Measures the proportion of variation in the
dependent variable that is explained by Xk
while controlling for (holding constant) the
other independent variables
© 2002 Prentice-Hall, Inc.
Chap 12-44
Coefficient of Partial
Determination for X k
(continued)
Example: Two Independent Variable Model
2
Y 1 2
r
SSR  X1 | X 2 

SST  SSR  X1 , X 2   SSR  X1 | X 2 
© 2002 Prentice-Hall, Inc.
Chap 12-45
Venn Diagrams and Coefficient of
Partial Determination for X k
2
Y1  2
r
SSR  X1 | X 2 
Oil

SSR  X 1 | X 2 
SST  SSR  X 1 , X 2   SSR  X 1 | X 2 
=
Temp
Insulation
© 2002 Prentice-Hall, Inc.
Chap 12-46
Coefficient of Partial
Determination in PHStat

PHStat | regression | multiple regression …


Check the “coefficient of partial determination” box
EXCEL spreadsheet for the heating oil
example
© 2002 Prentice-Hall, Inc.
Chap 12-47
Contribution of a Subset of
Independent Variables

Let Xs be the subset of independent variables
of interest
SSR  X s | all others except X s 

 SSR  all   SSR  all others except X s 

Measures the contribution of the subset xs in
explaining SST
© 2002 Prentice-Hall, Inc.
Chap 12-48
Contribution of a Subset of
Independent Variables: Example
Let Xs be X1 and X3
SSR  X 1 and X 3 | X 2 
 SSR  X 1 , X 2 and X 3   SSR  X 2 
From ANOVA section of
regression for
Yˆi  b0  b1 X1i  b2 X 2i  b3 X 3i
© 2002 Prentice-Hall, Inc.
From ANOVA
section of
regression for
Yˆi  b0  b2 X 2i
Chap 12-49
Testing Portions of Model


Examines the contribution of a subset Xs of
explanatory variables to the relationship with Y
Null hypothesis:


Variables in the subset do not improve significantly
the model when all other variables are included
Alternative hypothesis:

At least one variable is significant
© 2002 Prentice-Hall, Inc.
Chap 12-50
Testing Portions of Model
(continued)


Always one-tailed rejection region
Requires comparison of two regressions


One regression includes everything
Another regression includes everything except the
portion to be tested
© 2002 Prentice-Hall, Inc.
Chap 12-51
Partial F Test For Contribution of
Subset of X variables

Hypotheses:



H0 : Variables Xs do not significantly improve the
model given all others variables included
H1 : Variables Xs significantly improve the model
given all others included
Test Statistic:
SSR  X s | all others  / m
F
MSE  all 



with df = m and (n-p-1)
m = # of variables in the subset Xs
© 2002 Prentice-Hall, Inc.
Chap 12-52
Partial F Test For Contribution of
A Single X j

Hypotheses:



H0 : Variable Xj does not significantly improve
the model given all others included
H1 : Variable Xj significantly improves the
model given all others included
Test Statistic:
SSR  X j | all others 

F
MSE  all 


With df = 1 and (n-p-1)
m = 1 here
© 2002 Prentice-Hall, Inc.
Chap 12-53
Testing Portions of Model:
Example
Test at the  = .05
level to determine
whether the variable of
average temperature
significantly improves
the model given that
insulation is included.
© 2002 Prentice-Hall, Inc.
Chap 12-54
Testing Portions of Model:
Example
H0: X1 (temperature) does
not improve model with X2
(insulation) included
 = .05, df = 1 and 12
Critical Value = 4.75
H1: X1 does improve model
ANOVA
(For X1 and X2)
ANOVA
(For X2)
Regression
Residual
Total
SS
MS
228014.6263 114007.313
8120.603016 676.716918
236135.2293
SS
Regression 51076.47
Residual
185058.8
Total
236135.2
SSR  X1 | X 2   228,015  51,076 
F

 261.47
MSE  X1 , X 2 
676.717
© 2002 Prentice-Hall, Inc.
Conclusion: Reject H0; X1 does improve model
Chap 12-55
Testing Portions of Model in
PHStat

PHStat | regression | multiple regression …


Check the “coefficient of partial determination” box
EXCEL spreadsheet for the heating oil
example.
© 2002 Prentice-Hall, Inc.
Chap 12-56
Do We Need to
Do this for One Variable?


© 2002 Prentice-Hall, Inc.
The F test for the inclusion of a
single variable after all other variables
are included in the model is
IDENTICAL to the t test of the slope
for that variable
The only reason to do an F test is to
test several variables together
Chap 12-57
The Quadratic Regression Model



Relationship between one response variable
and two or more explanatory variables is a
quadratic polynomial function
Useful when scatter diagram indicates nonlinear relationship
Quadratic model :


Yi  0  1 X1i  2 X  i
2
1i
The second explanatory variable is the square
of the first variable
© 2002 Prentice-Hall, Inc.
Chap 12-58
Quadratic Regression Model
(continued)
Quadratic models may be considered when
scatter diagram takes on the following shapes:
Y
Y
2 > 0
X1
Y
2 > 0
X1
Y
2 < 0
X1
2 < 0 X1
2 = the coefficient of the quadratic term
© 2002 Prentice-Hall, Inc.
Chap 12-59
Testing for Significance:
Quadratic Model

Testing for Overall Relationship



Similar to test for linear model
MSR
F test statistic = MSE
Testing the Quadratic Effect

Compare quadratic model
Yi  0  1 X1i  2 X12i  i
with the linear model
Yi  0  1 X1i  i

Hypotheses


© 2002 Prentice-Hall, Inc.
H0 :  2  0
H1 : 2  0
(No 2nd order polynomial term)
(2nd order polynomial term is needed)
Chap 12-60
Heating Oil Example
Determine whether a quadratic
model is needed for estimating
heating oil used for a single
family home in the month of
January based on average
temperature and amount of
insulation in inches.
© 2002 Prentice-Hall, Inc.
0
Oil (Gal) Temp ( F) Insulation
275.30
40
3
363.80
27
3
164.30
40
10
40.80
73
6
94.30
64
6
230.90
34
6
366.70
9
6
300.60
8
10
237.80
23
10
121.40
63
3
31.40
65
10
203.50
41
6
441.10
21
3
323.00
38
3
52.50
58
10
Chap 12-61
Heating Oil Example:
Residual Analysis
T em p eratu re R esid u al P lo t
(continued)
Maybe some nonlinear relationship
60
Residuals
40
Insulation R esidual P lot
20
0
0
20
40
60
80
-20
-40
-60
0
2
4
6
8
10
12
No Discernable Pattern
© 2002 Prentice-Hall, Inc.
Chap 12-62
Heating Oil Example:
t Test for Quadratic Model
(continued)

Testing the quadratic effect

Compare quadratic model in insulation
Yi  0  1 X1i  2 X 2i  3 X 22i  i
With the linear model
Yi  0  1 X1i  2 X 2i  i

Hypotheses


© 2002 Prentice-Hall, Inc.
H 0 : 3  0
H1 : 3  0
(No quadratic term in insulation)
(Quadratic term is needed in
insulation)
Chap 12-63
Example Solution
Is quadratic model in insulation needed on monthly
consumption of heating oil? Test at  = 0.05.
H0: 3 = 0
Test Statistic:
H1: 3  0
t Test Statistic = 1.6611
Decision:
Do not reject H0 at  = 0.05
df = 11
Critical Value(s):
Reject H0
Reject H0
.025
.025
-2.2010
© 2002 Prentice-Hall, Inc.
0 2.2010
Z
Conclusion:
There is not sufficient
evidence for the need to
include quadratic effect of
insulation on oil consumption.
Chap 12-64
Example Solution in PHStat


PHStat | regression | multiple regression …
EXCEL spreadsheet for the heating oil
example.
© 2002 Prentice-Hall, Inc.
Chap 12-65
Dummy Variable Models







Categorical explanatory variable (dummy
variable) with two or more levels:
Yes or no, on or off, male or female,
Coded as 0 or 1
Only intercepts are different
Assumes equal slopes across categories
The number of dummy variables needed is
(number of levels - 1)
Regression model has same form:
Yi  0  1 X1i  2 X 2i    k X ki  i
© 2002 Prentice-Hall, Inc.
Chap 12-66
Dummy-Variable Models
(with 2 Levels)
Given: Yˆi  b0  b1 X1i  b2 X 2i
Y = Assessed Value of House
X1 = Square footage of House
X2 = Desirability of Neighborhood =
Desirable (X2 = 1)
Yˆi  b0  b1 X1i  b2 (1)  (b0  b2 )  b1 X1i
Undesirable (X2 = 0)
Yˆ  b  b X  b (0)  b  b X
i
© 2002 Prentice-Hall, Inc.
0
1
1i
2
0
1
0 if
undesirable
1 if desirable
Same
slopes
1i
Chap 12-67
Dummy-Variable Models
(with 2 Levels)
(continued)
Y (Assessed Value)
Same
slopes
b1
b0 + b2
Intercepts
different
b0
X1 (Square footage)
© 2002 Prentice-Hall, Inc.
Chap 12-68
Interpretation of the Dummy
Variable Coefficient (with 2 Levels)
Example:
Yˆi  b0  b1 X1i  b2 X 2i  20  5X1i  6 X 2i
Y : Annual salary of college graduate in thousand $
X 1 : GPA
X 2:
0 Female
1 Male
On average, male college graduates are making
an estimated six thousand dollars more than
female college graduates with the same GPA.
© 2002 Prentice-Hall, Inc.
Chap 12-69
Dummy-Variable Models
(with 3 Levels)
Given:
Y  Assessed Value of the House (1000 $)
X 1  Square Footage of the House
Style of the House = Split-level, Ranch, Condo
(3 Levels; Need 2 Dummy Variables)
1 if Split-level
1 if Ranch
X2  
X3  
 0 if not
 0 if not
Yˆi  b0  b1 X 1  b2 X 2  b3 X 3
© 2002 Prentice-Hall, Inc.
Chap 12-70
Interpretation of the Dummy
Variable Coefficients (with 3 Levels)
Given the Estimated Model:
Yˆi  20.43  0.045 X 1i  18.84 X 2i  23.53 X 3i
For Split-level  X 2  1 :
Yˆi  20.43  0.045 X 1i  18.84
For Ranch  X 3  1 :
Yˆi  20.43  0.045 X 1i  23.53
For Condo:
Yˆ  20.43  0.045 X
i
© 2002 Prentice-Hall, Inc.
1i
With the same footage, a Splitlevel will have an estimated
average assessed value of 18.84
thousand dollars more than a
Condo.
With the same footage, a Ranch
will have an estimated average
assessed value of 23.53
thousand dollars more than a
Condo.
Chap 12-71
Interaction
Regression Model

Hypothesizes interaction between pairs of X
variables



Response to one X variable varies at different
levels of another X variable
Contains two-way cross product terms
 Yi  0  1 X1i   2 X 2i  3 X1i X 2i  i
Can be combined with other models

E.G., Dummy variable model
© 2002 Prentice-Hall, Inc.
Chap 12-72
Effect of Interaction




Given:
 Yi  0  1 X1i   2 X 2i  3 X1i X 2i   i
Without interaction term, effect of X1 on Y is
measured by 1
With interaction term, effect of X1 on Y is
measured by 1 + 3 X2
Effect changes as X2 increases
© 2002 Prentice-Hall, Inc.
Chap 12-73
Interaction Example
Y
Y = 1 + 2X1 + 3X2 + 4X1X2
Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
12
8
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
4
0
X1
0
0.5
1
1.5
Effect (slope) of X1 on Y does depend on X2 value
© 2002 Prentice-Hall, Inc.
Chap 12-74
Interaction Regression Model
Worksheet
Case, i
Yi
X1i
X2i
X1i X2i
1
2
3
4
:
1
4
1
3
:
1
8
3
5
:
3
5
2
6
:
3
40
6
30
:
Multiply X1 by X2 to get X1X2.
Run regression with Y, X1, X2 , X1X2
© 2002 Prentice-Hall, Inc.
Chap 12-75
Interpretation when there are
more than Three Levels
Y   0  1MALE   2 MARRIED  3DIVORCED
  4 MALE  MARRIED  5MALE  DIVORCED
MALE = 0 if female and 1 if male
MARRIED = 1 if married; 0 if not
DIVORCED = 1 if divorced; 0 if not
MALE•MARRIED = 1 if male married; 0 otherwise
= (MALE times MARRIED)
MALE•DIVORCED = 1 if male divorced; 0 otherwise
= (MALE times DIVORCED)
© 2002 Prentice-Hall, Inc.
Chap 12-76
Interpretation when there are
more than Three Levels
(continued)
Y   0  1MALE   2 MARRIED  3DIVORCED
  4 MALE  MARRIED  5MALE  DIVORCED
SINGLE
FEMALE  
MALE
© 2002 Prentice-Hall, Inc.
MARRIED
DIVORCED
  2
   3
   1
  2   4  3  5
   1     1
Chap 12-77
Interpreting Results
FEMALE
Single:  0
Married: 0  2
Divorced: 0  3
MALE
Difference
1
Single:  0  1
1   4
Married: 0  1  2  4
Divorced: 0  1  3  5 1   5
Main Effects : MALE, MARRIED and DIVORCED
Interaction Effects : MALE•MARRIED and
MALE•DIVORCED
© 2002 Prentice-Hall, Inc.
Chap 12-78
Evaluating Presence
of Interaction


Hypothesize interaction between pairs of
independent variables
Contains 2-way product terms
Yi  0  1 X1i  2 X 2i  3 X1i X 2i  i

Hypotheses:


H0: 3 = 0 (no interaction between X1 and X2)
H1: 3  0 (X1 interacts with X2)
© 2002 Prentice-Hall, Inc.
Chap 12-79
Using Transformations



Requires data transformation
Either or both independent and dependent
variables may be transformed
Can be based on theory, logic or scatter
diagrams
© 2002 Prentice-Hall, Inc.
Chap 12-80
Inherently Linear Models

Non-linear models that can be expressed in
linear form


Can be estimated by least squares in linear form
Require data transformation
© 2002 Prentice-Hall, Inc.
Chap 12-81
Transformed Multiplicative Model
(Log-Log)
Original: Yi  0  X1i 1  X 2i2  i
Transformed: ln Yi   ln  0   1ln  X1i   2ln  X 2i   ln i 
1  1
Y
Y
0  1  1
1  1  0
1  1
1  1
X1
© 2002 Prentice-Hall, Inc.
Similarly for X2
X1
Chap 12-82
Square Root Transformation
Yi  0  1 X1i   2 X 2i  i
Y
1 > 0
Similarly for X2
1 < 0
X1
Transforms one of above model to one that appears
linear. Often used to overcome heteroscedasticity.
© 2002 Prentice-Hall, Inc.
Chap 12-83
Linear-Logarithmic
Transformation
Yi  0  1 ln( X1i )  2 ln( X 2i )  i
Y
1 >
0
Similarly for X2
1 <
0
X1
Transformed from an original multiplicative model
© 2002 Prentice-Hall, Inc.
Chap 12-84
Exponential Transformation
(Log-Linear)
Original Model
Y
Yi  e
0  1 X1i  2 X 2 i
i
1 > 0
1 < 0
Transformed Into:
© 2002 Prentice-Hall, Inc.
X1
ln Yi  0  1 X1i  2 X 2i  ln 1
Chap 12-85
Interpretation of Coefficients

The dependent variable is logged


The coefficient of the independent variable X k can
be approximately interpreted as: a 1 unit change
in X k leads to an estimated 100  bk  percentage
change in the average of Y
The independent variable is logged

The coefficient of the independent variable can be
approximately interpreted as: a 100 percent
change in X k leads to an estimated bk unit change
in the average of Y
© 2002 Prentice-Hall, Inc.
Chap 12-86
Interpretation of coefficients
(continued)

Both dependent and independent variables
are logged
 The coefficient of the independent variable X
k can
be approximately interpreted as : a 1 percent
change in X k leads to an estimated bk percentage
change in the average of Y. Therefore bk is the
elasticity of Y with respect to a change in X k
© 2002 Prentice-Hall, Inc.
Chap 12-87
Interpretation of Coefficients
(continued)

If both Y and X k are measured in
standardized form:
yi 


Yi  Y
Y
And
xki 
X ki   k
k
The bk ' s are called standardized coefficients

They indicate the estimated number of average
standard deviations Y will change when X k
changes by one standard deviation
© 2002 Prentice-Hall, Inc.
Chap 12-88
Collinearity
(Multicollinearity)




High correlation between explanatory
variables
Coefficient of multiple determination
measures combined effect of the correlated
explanatory variables
No new information provided
Leads to unstable coefficients (large standard
error)

Depending on the explanatory variables
© 2002 Prentice-Hall, Inc.
Chap 12-89
Venn Diagrams and Collinearity
Large
Overlap
reflects
collinearity
between
Temp and
Insulation
Temp
Oil
Large Overlap in
variation of Temp
and Insulation is
used in explaining
the variation in
Oil but NOT in
estimating 1 and
2
Insulation
© 2002 Prentice-Hall, Inc.
Chap 12-90
Detect Collinearity
(Variance Inflationary Factor)
 VIFj Used to Measure Collinearity
1
VIFj 
2
1  R j 
R 2j  coefficient of multiple
determination of regression
X j on all the other
explantory variables
 If VIFj  5, X j is Highly Correlated with the
Other Explanatory Variables.
© 2002 Prentice-Hall, Inc.
Chap 12-91
Detect Collinearity in PHStat

PHStat | regression | multiple regression …


Check the “variance inflationary factor (VIF)” box
EXCEL spreadsheet for the heating oil example

Since there are only two explanatory variables, only
one VIF is reported in the excel spreadsheet
 No VIF is > 5

© 2002 Prentice-Hall, Inc.
There is no evidence of collinearity
Chap 12-92
Model Building

Goal is to develop a good model with the
fewest explanatory variables



Stepwise regression procedure


Easier to interpret
Lower probability of collinearity
Provide limited evaluation of alternative models
Best-subset approach


Uses the cp statistic
Selects model with small cp near p+1
© 2002 Prentice-Hall, Inc.
Chap 12-93
Model Building Flowchart
Choose
X1,X2,…Xp
Run Regression
to find VIFs
Any
VIF>5?
Yes
Remove
Variable with
Highest
VIF
Yes
More than
One?
No
Remove
this X
© 2002 Prentice-Hall, Inc.
No
Run Subsets
Regression to Obtain
“best” models in
terms of Cp
Do Complete Analysis
Add Curvilinear Term and/or
Transform Variables as Indicated
Perform
Predictions
Chap 12-94
Pitfalls and Ethical
Considerations
To avoid pitfalls and address ethical considerations:



Understand that interpretation of the
estimated regression coefficients are
performed holding all other independent
variables constant
Evaluate residual plots for each
independent variable
Evaluate interaction terms
© 2002 Prentice-Hall, Inc.
Chap 12-95
Additional Pitfalls
and Ethical Considerations
(continued)
To avoid pitfalls and address ethical considerations:
 Obtain VIF for each independent variable and
remove variables that exhibit a high collinearity
with other independent variables before
performing significance test on each
independent variable
 Examine several alternative models using bestsubsets regression
 Use other methods when the assumptions
necessary for least-squares regression have
been seriously violated
© 2002 Prentice-Hall, Inc.
Chap 12-96
Chapter Summary





Developed the multiple regression model
Discussed residual plots
Addressed testing the significance of the
multiple regression model
Discussed inferences on population regression
coefficients
Addressed testing portion of the multiple
regression model
© 2002 Prentice-Hall, Inc.
Chap 12-97
Chapter Summary






(continued)
Described the quadratic regression model
Addressed dummy variables
Discussed using transformation in
regression models
Described collinearity
Discussed model building
Addressed pitfalls in multiple regression
and ethical considerations
© 2002 Prentice-Hall, Inc.
Chap 12-98