Multiple Regression

Download Report

Transcript Multiple Regression

Multiple Regression
• Model
• Error Term Assumptions
– Example 1: Locating a motor inn
• Goodness of Fit (R-square)
• Validity of estimates (t-stats & F-stats)
• Interpreting the regression coefficients & R-Square
1
Introduction
• In this model we extend the simple linear
regression model, and allow for any number of
independent variables.
• We will also learn to detect econometric
problems.
2
Model and Required Conditions
• We allow for k independent variables to
potentially be related to the dependent variable
Coefficients
Random error variable
y = b0 + b1x1+ b2x2 + …+ bkxk + e
Dependent variable
Independent variables
3
y
The simple linear regression model
allows for one independent variable, “x”
y =b0 + b1x + e
Note how the straight line
becomes a plain, and...
X1
The multiple linear regression model
allows for more than one independent variable.
Y = b0 + b1x1 + b2x2 + e
X2
4
• Required conditions for the error variable e
– The error e is normally distributed with mean equal
to zero and a constant standard deviation se
(independent of the value of y). se is unknown.
– The errors are independent.
• These conditions are required in order to
– estimate the model coefficients,
– assess the resulting model.
5
Example 1 Where to locate a new motor inn?
– La Quinta Motor Inns is planning an expansion.
– Management wishes to predict which sites are likely to
be profitable.
– Several areas where predictors of profitability can be
identified are:
•
•
•
•
•
Competition
Market awareness
Demand generators
Demographics
Physical quality
6
Profitability
Competition
Rooms
Number of
hotels/motels
rooms within
3 miles from
the site.
Market
awareness
Nearest
Distance to
the nearest
La Quinta inn.
Customers
Office
space
College
enrollment
Margin
Community
Physical
Income
Disttwn
Median
household
income.
Distance to
downtown.
7
– Data was collected from randomly selected 100 inns
that belong to La Quinta, and ran for the following
suggested model:
Margin =b0 + b1Rooms + b2Nearest + b3Office + b4College
+ b5Income + b6Disttwn +
INN
1
2
3
4
5
6
MARGIN
55.5
33.8
49
31.9
57.4
49
e
ROOMS NEAREST OFFICE COLLEGE INCOME DISTTWN
3203
0.1
549
8
37
12.1
2810
1.5
496
17.5
39
0.4
2890
1.9
254
20
39
12.2
3422
1
434
15.5
36
2.7
2687
3.4
678
15.5
32
7.9
3759
1.4
635
19
41
4
8
• Excel output
SUMMARY OUTPUT
This is the sample regression equation
(sometimes called the prediction equation)
MARGIN = 72.455 - 0.008ROOMS -1.646NEAREST
+ 0.02OFFICE +0.212COLLEGE
- 0.413INCOME + 0.225DISTTWN
Regression Statistics
Multiple R 0.724611
R Square 0.525062
Adjusted R Square
0.49442
Standard Error
5.512084
Observations
100
Let us assess this equation
ANOVA
df
Regression
Residual
Total
SS
MS
F
Significance F
6 3123.832 520.6387 17.13581 3.03E-13
93 2825.626 30.38307
99 5949.458
Coefficients
Standard Error t Stat
Intercept
72.45461 7.893104 9.179483
ROOMS
-0.00762 0.001255 -6.06871
NEAREST -1.64624 0.632837 -2.60136
OFFICE
0.019766
0.00341 5.795594
COLLEGE 0.211783 0.133428 1.587246
INCOME
-0.41312 0.139552 -2.96034
DISTTWN 0.225258 0.178709 1.260475
P-value Lower 95% Upper 95%
1.11E-14 56.78049 88.12874
2.77E-08 -0.01011 -0.00513
0.010803 -2.90292 -0.38955
9.24E-08 0.012993 0.026538
0.115851 -0.05318 0.476744
0.003899 -0.69025
-0.136
0.210651 -0.12962 0.580138
9
• Testing the coefficients
– The hypothesis for each bi
H0: bi = 0
H1: bi = 0
Test statistic
b b
t i i
sb i
– Excel printout
Coefficients
Standard Error t Stat
Intercept
72.45461 7.893104 9.179483
ROOMS
-0.00762 0.001255 -6.06871
NEAREST -1.64624 0.632837 -2.60136
OFFICE
0.019766 0.00341 5.795594
COLLEGE 0.211783 0.133428 1.587246
INCOME
-0.41312 0.139552 -2.96034
DISTTWN 0.225258 0.178709 1.260475
P-value
1.11E-14
2.77E-08
0.010803
9.24E-08
0.115851
0.003899
0.210651
d.f. = n - k -1
Lower 95% Upper 95%
56.78048735 88.12874
-0.010110582 -0.00513
-2.902924523 -0.38955
0.012993085 0.026538
-0.053178229 0.476744
-0.690245235
-0.136
-0.12962198 0.580138
10
• Standard error of estimate
– We need to estimate the standard error of estimate
SSE
se 
n  k 1
– Compare se to the mean value of y
• From the printout, Standard Error = 5.5121
• Calculating the mean value of y we have y  45 .739
– It seems se is not particularly small.
– Can we conclude the model does not fit the data
well?
11
• Coefficient of determination
– The definition is
R2  1 

SSE
( y i  y )2
– From the printout, R2 = 0.5251
– 52.51% of the variation in the measure of profitability is
explained by the linear regression model formulated
above.
– When adjusted for degrees of freedom,
Adjusted R2 = 1-[SSE/(n-k-1)] / [SS(Total)/(n-1)] =
12
= 49.44%
• Testing the validity of the model
– We pose the question:
Is there at least one independent variable linearly
related to the dependent variable?
– To answer the question we test the hypothesis
H0: b1 = b2 = … = bk = 0
H1: At least one bi is not equal to zero.
– If at least one bi is not equal to zero, the model is
valid.
13
• To test these hypotheses we perform an analysis
of variance procedure.
• The F test
– Construct the F statistic
[Variation in y] = SSR + SSE.
Large F results from a large SSR.
Then, much of the variation in y is
explained by the regression model.
The null hypothesis should
be rejected;
the modelregion
is valid.
– thus,
Rejection
MSR=SSR/k
MSR
F
MSE
MSE=SSE/(n-k-1)
F>Fa,k,n-k-1
Required conditions must
be satisfied.
14
Two data points (x1,y1) and (x2,y2) of a certain sample are shown.
y2
y
y1
x1
Total variation in y =
(y1  y ) 2 + (y 2  y ) 2 
x2
Variation explained by the
regression line)
(yˆ 1  y) 2 + (yˆ 2  y) 2
+ Unexplained variation (error)
+ (y1  yˆ 1 ) 2 + (y 2  yˆ 2 ) 2
15
Example 1 - continued
• Excel provides the following ANOVA results
MSR/MSE
ANOVA
df
Regression
Residual
Total
SSE
SSR
SS
MS
F
Significance F
6 3123.832 520.6387 17.13581 3.03382E-13
93 2825.626 30.38307
99 5949.458
MSE
MSR
16
Example 1 - continued
• Excel provides the following ANOVA results
ANOVA Conclusion: There is sufficient evidence to reject
the nulldfhypothesisSS
in favor of the
MS alternativeFhypothesis.
Significance F
At least one6of the
bi is not equal
to zero.17.13581
Thus, at least
Regression
3123.832
520.6387
3.03382E-13
variable is30.38307
linearly related to y.
Residual one independent
93 2825.626
This linear
regression model is valid
Total
99 5949.458
Fa,k,n-k-1 = F0.05,6,100-6-1=2.17
F = 17.14 > 2.17
Also, the p-value (Significance F) = 3.03382(10)-13
Clearly, a = 0.05>3.03382(10)-13, and the null hypothesis
is rejected.
17
• Let us interpret the coefficients
–
This is the intercept, the value of y when all
the variables take the value zero. Since the data
range of all the independent variables do not cover
the value zero, do not interpret the intercept.
–
In this model, for each additional 1000
rooms within 3 mile of the La Quinta inn, the
operating margin decreases on the average by 7.6%
(assuming the other variables are held constant).
b 0  72.5
b1  .0076
18
–
b 2  1.65
–
b 4  .21
In this model, for each additional mile that
the nearest competitor is to La Quinta inn, the average
operating margin decreases by 1.65%
– b 3  .02 For each additional 1000 sq-ft of office space,
the average increase in operating margin will be .02%.
For additional thousand students MARGIN
increases by .21%.
–
b 5  .41
For additional $1000 increase in median
household income, MARGIN decreases by .41%
–
b 6  .23
For each additional mile to the downtown
center, MARGIN increases by .23% on the average
19