Lecture 6-7 Notes - Wharton Statistics Department

Download Report

Transcript Lecture 6-7 Notes - Wharton Statistics Department

Stat 112: Lecture 9 Notes
• Homework 3: Due next Thursday
• Prediction Intervals for Multiple Regression
(Chapter 4.5)
• Multicollinearity (Chapter 4.6).
Summary of F tests
• Partial F tests are used to test whether a subset of the
slopes in multiple regression are zero. The whole model F
test (test of the useful of the model) tests whether the
slopes on all variables in multiple regression are zero, i.e.,
it tests whether the multiple regression is more useful for
prediction than just ignoring the X’s and using Y to
predict Y.
• For testing whether one slope in multiple regression is
zero, we can use the t-test. But in fact, the partial F test for
one slope being zero is equivalent to the t-test (it gives the
same p-values and the same decisions).
• Why use the F test to test whether two or more slopes are
not both equal to zero rather than two t-tests? The F test is
more powerful. This will be illustrated later in the lecture.
Prediction in Automobile
Example
• The design team is planning a new car with
the following characteristics: horsepower =
200, weight = 4000 lb, cargo = 18 ft3,
seating = 5 adults.
• What is a 95% prediction interval for the
GPM1000 of this car?
Prediction with Multiple Regression Equation
• Prediction interval for individual with x1,…,xK:
yˆ p  b0  b1 x1 
 bK xK
CI :yˆ p  t / 2,n K 1 * s p
s p  SE ( y p  yˆ p )
Finding Prediction Interval in
JMP
• Enter a line with the independent variables x1,…,xK for the
new individual. Do not enter a y for the new individual.
• Fit the model. Because the new individual does not have a
y, JMP will not include the new individual when
calculating the least squares fit.
• Click red triangle next to response, click Save Columns:
– To find yˆ p , click Predicted Values. Creates column
with yˆ p
– To find 95% PI, click Indiv Confid Interval. Creates
column with lower and upper endpoints of 95% PI.
Prediction in Automobile
Example
• The design team is planning a new car with
the following characteristics: horsepower =
200, weight = 4000 lb, cargo = 18 ft3,
seating = 5 adults.
• From JMP,
– yˆ p  45.08
– 95% prediction interval: (37.86, 52.31)
Multicollinearity
• DATA: A real estate agents wants to
develop a model to predict the selling price
of a home. The agent takes a random
sample of 100 homes that were recently
sold and records the selling price (y), the
number of bedrooms (x1), the size in square
feet (x2) and the lot size in square feet (x3).
Data is in houseprice.JMP.
Scatterplot Matrix
200000
Price
100000
5.0
4.0
3.0
2.0
3000
2500
2000
1500
Bedrooms
House Size
8000
6000
Lot Size
4000
100000
2.0 3.5 5.0 1500 2500 4000 7000
Response Price
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.559998
0.546248
25022.71
154066
100
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
F Ratio
Model
3 7.65017e10 2.5501e10 40.7269
Prob > F
Error
96 6.0109e+10 626135896
C. Total 99 1.36611e11
<.0001
Parameter Estimates
Term
Intercept
Bedrooms
House Size
Lot Size
Estimate
Std Error
t Ratio
Prob>|t|
37717.595 14176.74 2.66 0.0091
2306.0808 6994.192 0.33 0.7423
74.296806 52.97858 1.40 0.1640
-4.363783 17.024 -0.26 0.7982
There is strong evidence that predictors are useful, p-value for F-test <.0001
and R 2  .560, but the t-tests for each coefficient are not significant.
Indicative of multicollinearity.
Note: These results illustrate how the F test is more powerful for testing whether
a group of slopes in multiple regression are all zero than individual
t tests.
Multicollinearity
• Multicollinearity: Explanatory variables are
highly correlated with each other. It is often
hard to determine their individual regression
coefficients.
Multivariate
Correlations
Bedrooms
House Size
Lot Size
Bedrooms
House Size
Lot Size
1.0000
0.8465
0.8374
0.8465
1.0000
0.9936
0.8374
0.9936
1.0000
• There is very little information in the data set to
find out what would happen if we fix house
size and change lot size.
• Since house size and lot size are highly correlated,
for fixed house size, lot sizes do not change much.
• The standard error for estimating the coefficient of
lot sizes is large. Consequently the coefficient may
not be significant.
• Similarly for the coefficient of house size.
• So, while it seems that at least one of the
coefficients is significant (See ANOVA) you can
not tell which one is the useful one.
Consequences of Multicollinearity
• Standard errors of regression coefficients are
large. As a result t statistics for testing the
population regression coefficients are small.
• Regression coefficient estimates are unstable.
Signs of coefficients may be opposite of what is
intuitively reasonable (e.g., negative sign on lot
size). Dropping or adding one variable in the
regression causes large change in estimates of
coefficients of other variables.
Detecting Multicollinearity
1. Pairwise correlations between explanatory
variables are high.
2. Large overall F-statistic for testing
usefulness of predictors but small t
statistics.
3. Variance inflation factors
Variance Inflation Factors
• Variance inflation factor (VIF): Let R2j denote the R2 for the multiple
regression of xj on the other x-variables. Then
VIFj 
1
.
2
1  Rj
• Fact:
 MSE 
ˆ
SD  j  
VIF
  n  1 S x2  j
j 

2
 
• VIFj for variable xj: Measure of the increase in the variance of the
coefficient on xj due to the correlation among the explanatory variables
compared to what the variance of the coefficient on xj would be if xj were
independent of the other explanatory variables.
Using VIFs
• To obtain VIFs, after Fit Model, go to
Parameter Estimates, right click, click
Columns and click VIFs.
• Detecting multicollinearity with VIFs:
– Any individual VIF greater than 10 indicates
multicollinearity.
– Average of all VIFs considerably greater than 1
also indicates multicollinearity.
Summary of Fit
RSquare
0.559998
Parameter Estimates
Term
Intercept
Bedrooms
House Size
Lot Size
Estimate
Std Error
t Ratio
Prob>|t|
VIF
37717.595 14176.74 2.66 0.0091
.
2306.0808 6994.192 0.33 0.7423 3.5399784
74.296806 52.97858 1.40 0.1640 83.066839
-4.363783 17.024 -0.26 0.7982 78.841292
Multicollinearity and Prediction
• If interest is in predicting y, as long as pattern of
multicollinearity continues for those observations
where forecasts are desired (e.g., house size and
lot size are either both high, both medium or both
small), multicollinearity is not particularly
problematic.
• If interest is in predicting y for observations where
pattern of multicollinearity is different than that in
sample (e.g., large house size, small lot size), no
good solution (this would be extrapolation).
Problems caused by
multicollinearity
• If interest is in predicting y, as long as pattern of
multicollinearity continues for those observations where
forecasts are desired (e.g., house size and lot size are either
both high, both medium or both small), multicollinearity is
not particularly problematic.
• If interest is in obtaining individual regression coefficients,
there is no good solution in face of multicollinearity.
• If interest is in predicting y for observations where pattern
of multicollinearity is different than that in sample (e.g.,
large house size, small lot size), no good solution (this
would be extrapolation).
Dealing with Multicollinearity
• Suffer: If prediction within the range of the data is the
only goal, not the interpretation of the coefficients,
then leave the multicollinearity alone.
• Combine: In some cases, it may be possible to
combine variables to reduce multicollinearity (see
next slide)
• Omit a variable. Multicollinearity can be reduced by
removing one of the highly correlated variables.
However, if one wants to estimate the partial slope of
one variable holding fixed the other variables,
omitting a variable is not an option, as it changes the
interpretation of the slope.
Combining horsepower and weight
in cars data
Response GP1000MHwy
Parameter Estimates
Term
Intercept
Weight(lb)
Cargo
Seating
Horsepower
Estimate
19.100521
0.0040877
0.0533
0.0268912
0.0426999
Std Error
2.098478
0.001203
0.013787
0.428283
0.01567
t Ratio
9.10
3.40
3.87
0.06
2.73
VIF
.
3.8589527
1.3173318
1.9717046
3.4149672
Combining Horsepower and Weight into Horsepower/Weight
Parameter Estimates
Term
Intercept
Cargo
Seating
Horsepower//Weight
Estimate
15.021983
0.0544811
1.5680411
302.16217
Std Error
3.699961
0.017328
0.470098
51.41088
t Ratio
4.06
3.14
3.34
5.88
VIF
.
1.3150096
1.5011661
1.1905447
Multiple Regression Example:
California Test Score Data
• The California Standardized Testing and
Reporting (STAR) data set californiastar.JMP
contains data on test performance, school
characteristics and student demographic
backgrounds from 1998-1999.
• Average Test Score is the average of the reading
and math scores for a standardized test
administered to 5th grade students.
• One interesting question: What would be the
causal effect of decreasing the student-teacher
ratio by one student per teacher?
Multiple Regression and Causal
Inference
• Goal: Figure out what the causal effect on average test
score would be of decreasing student-teacher ratio and
keeping everything else in the world fixed.
• Lurking variable: A variable that is associated with both
average test score and student-teacher ratio.
• In order to figure out whether a drop in student-teacher
ratio causes higher test scores, we want to compare
mean test scores among schools with different studentteacher ratios but the same values of the lurking
variables, i.e. we want to hold the value of the lurking
variable fixed.
• If we include all of the lurking variables in the multiple
regression model, the coefficient on student-teacher
ratio represents the change in the mean of test scores
that is caused by a one unit increase in student-teacher
ratio.
Omitted Variables Bias
Response Average Test Score
Parameter Estimates
Term
Intercept
Student Teacher Ratio
Estimate
698.93295
-2.279808
Std Error
9.467491
0.479826
t Ratio
73.82
-4.75
Prob>|t|
<.0001
<.0001
Response Average Test Score
Parameter Estimates
Term
Intercept
Student Teacher Ratio
Percent of English Learners
Estimate
686.03225
-1.101296
-0.649777
Std Error
7.411312
0.380278
0.039343
t Ratio
92.57
-2.90
-16.52
Prob>|t|
<.0001
0.0040
<.0001
• Schools with many English learners tend to have worst resources. The
multiple regression that shows how mean test score changes when student
teacher ratio changes but percent of English learners is held fixed gives a
better idea of the causal effect of the student-teacher ratio than the simple
linear regression that does not hold percent of English learners fixed.
• Omitted variables bias: bias in estimating the causal effect of a variable
from omitting a lurking variable from the multiple regression.
• Omitted variables bias of omitting percentage of English learners =
-2.28-(-1.10)=-1.28.
Key Warning About Multiple
Regression
• Even if we have included many lurking
variables in the multiple regression, we may
have failed to include one or not have
enough data to include one. There will then
be omitted variables bias.
• The best way to study causal effects is to do
a randomized experiment.