Lecture 23 - University of Pennsylvania

Download Report

Transcript Lecture 23 - University of Pennsylvania

Lecture 23
• Multiple Regression (Sections 19.3-19.4)
Multiple Regression Model
• Multiple regression model:
E( y | x1,, xk )  b0  b1x1 bk xk
y = b0 + b1x1+ b2x2 + …+ bkxk + e
• Required conditions :
 The regression function is a linear function of the
independent variables x1,…,xk (multiple regression line
does not systematically overestimate/underestimate y
for any combination of x1,…,xk ).
 The error e is normally distributed.
 The standard deviation is constant (se) for all values
of x’s.
 The errors are independent.
Estimating the Coefficients and
Assessing the Model, Example
• Data were collected from randomly selected 100 inns that
belong to La Quinta, and ran for the following suggested
model:
Margin = b0  b1Rooms  b2Nearest  b3Office

b4College + b5Income + b6Disttwn
Xm19-01
Margin
55.5
33.8
49
31.9
57.4
49
Number
3203
2810
2890
3422
2687
3759
Nearest
4.2
2.8
2.4
3.3
0.9
2.9
Office Space
549
496
254
434
678
635
Enrollment
8
17.5
20
15.5
15.5
19
Income
37
35
35
38
42
33
Distance
2.7
14.4
2.6
12.1
6.9
10.8
Model Assessment
• The model is assessed using three tools:
– The standard error of estimate
– The coefficient of determination
– The F-test of the analysis of variance
SSE
se 
n  k 1
SSE
R  1
2
(
y

y
)
 i
2
Testing the Validity of the Model
• We pose the question:
Is there at least one independent variable linearly
related to the dependent variable (Are any of the
X’s useful in predicting Y)?
• To answer the question we test the hypothesis
H0: b1 = b2 = … = bk=0
H1: At least one bi is not equal to zero.
• If at least one bi is not equal to zero, the model has
some validity.
Testing the Validity of the La
Quinta Inns Regression Model
• The hypotheses are tested by an ANOVA
procedure.
Analysis of Variance
Source
DF
Sum of
Squares
Model
6 3123.8320
Error
93 2825.6259
C.
99 5949.4579
Total
Mean
F Ratio
Square
520.639 17.1358
30.383 Prob > F
<.0001
Testing the Validity of the La
Quinta Inns Regression Model
[Variation in y] = SSR + SSE.
If SSR is large relative to SSE, much of the
variation in y is explained by the regression
model; the model is useful and thus, the null
hypothesis should be rejected. Thus, we reject for
large F.
Rejection region
SSR
F
SSE
k
n  k 1
F>Fa,k,n-k-1
Testing the Validity of the La
Quinta Inns Regression Model
ANOVA
Regression
Residual
Total
Conclusion: There is sufficient evidence to reject
the null hypothesis in favor of the alternative hypothesis.
At least dfone of the b
at least
SSi is not equal
MS to zero.
F Thus,
Significance
F
one independent
variable
6
3123.8is linearly
520.6 related
17.14 to y. 0.0000
This linear
93 regression
2825.6 model
30.4 is valid
99
5949.5
Fa,k,n-k-1 = F0.05,6,100-6-1=2.17
F = 17.14 > 2.17
Also, the p-value (Significance F) = 0.0000
Reject the null hypothesis.
2
Relationships among se , R ,and F
SSE
0
Small
Large
2
(
y

y
)
 i
se
R2
F
Asses.
of
model
Interpreting the Coefficients
• b0 = 38.14. This is the intercept, the value of y
when all the variables take the value zero. Since
the data range of all the independent variables do
not cover the value zero, do not interpret the
intercept.
• b1 = – 0.0076. In this model, for each additional
room within 3 mile of the La Quinta inn, the
operating margin decreases on average by .0076%
(assuming the other variables are held constant).
Interpreting the Coefficients
• b2 = 1.65. In this model, for each additional mile that
the nearest competitor is to a La Quinta inn, the
operating margin increases on average by 1.65% when
the other variables are held constant.
• b3 = 0.020. For each additional 1000 sq-ft of office
space, the operating margin will increase on average by
.02% when the other variables are held constant.
• b4 = 0.21. For each additional thousand students the
operating margin increases on average by .21% when
the other variables are held constant.
Interpreting the Coefficients
• b5 = 0.41. For additional $1000 increase in median
household income, the operating margin increases
on average by .41%, when the other variables
remain constant.
• b6 = -0.23. For each additional mile to the
downtown center, the operating margin decreases
on average by .23% when the other variables are
held constant.
Testing the Coefficients
• The hypothesis for each bi is
H0: bi  0
H1: bi  0
• JMP printout
Test statistic
bi  bi
t
sb i
d.f. = n - k -1
Parameter Estimates
Term
Intercept
Number
Nearest
Office Space
Enrollment
Income
Distance
Estimate
Std Error
t Ratio
Prob>|t|
38.138575
-0.007618
1.6462371
0.0197655
0.2117829
0.4131221
-0.225258
6.992948
0.001255
0.632837
0.00341
0.133428
0.139552
0.178709
5.45
-6.07
2.60
5.80
1.59
2.96
-1.26
<.0001
<.0001
0.0108
<.0001
0.1159
0.0039
0.2107
Confidence Intervals for
Coefficients
• Note that test of H0 : bi  0 is a test of whether xi
helps to predict y given x1,…,xi-1,xi+1,…xk.
Results of test might change as we change other
independent variables in the model.
• A confidence interval for bi is
bi  t( n#b 's,a / 2) se(bi )
• In La Quinta data, a 95% confidence interval for b1
(the coefficient on number of rooms) is
 .0076 .0013*1.987  (.0050,.0102)
Using the Linear Regression
Equation
• The model can be used for making predictions by
– Producing prediction interval estimate for the particular
value of y, for a given values of xi.
– Producing a confidence interval estimate for the expected
value of y, for given values of xi.
• The model can be used to learn about relationships
between the independent variables xi, and the
dependent variable y, by interpreting the coefficients
bi
La Quinta Inns, Predictions
Xm19-01
• Predict the average operating margin of an inn at a
site with the following characteristics:
–
–
–
–
–
–
3815 rooms within 3 miles,
Closet competitor .9 miles away,
476,000 sq-ft of office space,
24,500 college students,
$35,000 median household income,
11.2 miles distance to downtown center.
MARGIN = 38.14 - 0.0076(3815) +1.65(.9) + 0.020(476)
+0.21(24.5) + 0.41(35) - 0.23(11.2) = 37.1%
Prediction Intervals and
Confidence Intervals for Mean
• Prediction interval for y given x1,…,xk:
yˆ  tn(#b 's ) sepred ( yˆ )
• Confidence interval for mean of y given x1,…,xk:
yˆ  tn(#b 's ) seind ( yˆ )
• For inn with characteristics on previous slide:
yˆ  37.091
Confidence interval for mean = (32.970,41.213)
Prediction interval = (25.395,48.788)
19.4 Regression Diagnostics - II
• The conditions required for the model
assessment to apply must be checked.
– Is the error variable normally
Draw a histogram of the residuals
distributed?
– Is the regression function correctly specified as a linear
function of x1,…,xk
Plot the residuals versus x’s and yˆ
– Is the error variance constant? Plot the residuals versus ^y
Plot the residuals versus the
– Are the errors independent?
time periods
– Can we identify outlier?
– Is multicollinearity a problem?
Multicollinearity
• Condition in which independent variables are
highly correlated.
• Multicollinearity causes two kinds of difficulties:
– The t statistics appear to be too small.
– The b coefficients cannot be interpreted as “slopes”.
• Diagnostics:
– High correlation between independent variables
– Counterintuitive signs on regression coefficients
– Low values for t-statistics despite a significant overall
fit, as measured by the F statistics
Diagnostics: Multicollinearity
• Example 19.2: Predicting house price (Xm1902)
– A real estate agent believes that a house selling price can be
predicted using the house size, number of bedrooms, and
lot size.
– A random sample of 100 houses was drawn and data
recorded. Price Bedrooms H Size Lot Size
124100
218300
117800
.
.
3
4
3
.
.
1290
2080
1250
.
.
3900
6600
3750
.
.
– Analyze the relationship among the four variables
Diagnostics: Multicollinearity
• The proposed model is
PRICE = b0 + b1BEDROOMS + b2H-SIZE +b3LOTSIZE
+e
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.559998
0.546248
25022.71
154066
100
The model is valid, but no
variable is significantly related
to the selling price ?!
Analysis of Variance
Source
Model
Error
C. Total
DF
Sum of Squares
Mean Square
F Ratio
3
96
99
7.65017e10
6.0109e+10
1.36611e11
2.5501e10
626135896
40.7269
Prob > F
<.0001
Parameter Estimates
Term
Intercept
Bedrooms
House Size
Lot Size
Estimate
Std Error
t Ratio
Prob>|t|
37717.595
2306.0808
74.296806
-4.363783
14176.74
6994.192
52.97858
17.024
2.66
0.33
1.40
-0.26
0.0091
0.7423
0.1640
0.7982
Diagnostics: Multicollinearity
• Multicollinearity is found to be a problem.
Price
Price
Bedrooms
H Size
Lot Size
1
0.6454
0.7478
0.7409
Bedrooms H Size
1
0.8465
0.8374
1
0.9936
Lot Size
1
• Multicollinearity causes two kinds of difficulties:
– The t statistics appear to be too small.
– The b coefficients cannot be interpreted as “slopes”.