Lecture 24 - University of Pennsylvania

Download Report

Transcript Lecture 24 - University of Pennsylvania

Lecture 24
• Multiple Regression (Sections 19.4-19.5)
19.4 Regression Diagnostics - II
• The conditions required for the model
assessment to apply must be checked.
– Is the error variable normally
Draw a histogram of the residuals
distributed?
– Is the regression function correctly specified as a linear
function of x1,…,xk
Plot the residuals versus x’s and yˆ
– Is the error variance constant? Plot the residuals versus ^y
Plot the residuals versus the
– Are the errors independent?
time periods
– Can we identify outliers and influential observations?
– Is multicollinearity a problem?
Influential Observation
• Influential observation: An observation is
influential if removing it would markedly change
the results of the analysis.
• In order to be influential, a point must either be an
outlier in terms of the relationship between its y
and x’s or have unusually distant x’s (high
leverage) and not fall exactly into the relationship
between y and x’s that the rest of the data follows.
Simple Linear Regression
Example
• Data in salary.jmp. Y=Weekly Salary,
X=Years of Experience.
Bivariate Fit of Weekly Salary By Years of Experience
Weekly Salary
700
600
500
400
300
200
100
0
0
5 10 15 20 25 30 35 40 45
Year s of Expe r ie nce
Identification of Influential
Observations
• Cook’s distance is a measure of the
influence of a point – the effect that
omitting the observation has on the
estimated regression coefficients.
• Use Save Columns, Cook’s D Influence to
obtain Cook’s Distance.
Cook’s Distance
• Rule of thumb: Observation with Cook’s Distance
(Di) >1 has high influence. You may also be
concerned about any observation that has Di<1 but
has a much bigger Di than any other observation.
MBA
GPA
Under
GPA
8.43
6.58
8.15
8.88
7.69
8.55
8.7
GMAT
10.89
10.38
10.39
10.73
11.17
9.69
10.37
Work
584
483
484
646
551
591
625
9
7
4
6
4
4
3
Cook's D
Influence
MBA GPA
0.00032141
0.00678475
0.02884311
0.00016153
0.000077
0.00192334
0.00093674
Strategy for dealing with
influential observations/outliers
• Do the conclusions change when the obs. is
deleted?
– If No. Proceed with the obs. Included. Study the obs to
see if anything can be learned.
– If Yes. Is there reason to believe the case belongs to a
population other than the one under investigation?
• If Yes. Omit the case and proceed.
• If No. Does the case have unusually “distant” independent
variables.
– If Yes. Omit the case and proceed. Report conclusions for the
reduced range of explanatory variables.
– If No. Not much can be said. More data are needed to resolve
the questions.
Multicollinearity
• Multicollinearity: Condition in which independent
variables are highly correlated.
• Exact collinearity: Y=Weight, X1=Height in
inches, X2=Height in feet. Then
Yˆ  1.5  .5 X 1  .167X 2
Yˆ  1.5  2.5 X  0 X
1
2
provide the same predictions.
• Multicollinearity causes two kinds of difficulties:
– The t statistics appear to be too small.
– The b coefficients cannot be interpreted as “slopes”.
Multicollinearity Diagnostics
• Diagnostics:
– High correlation between independent variables
– Counterintuitive signs on regression
coefficients
– Low values for t-statistics despite a significant
overall fit, as measured by the F statistic.
Diagnostics: Multicollinearity
• Example 19.2: Predicting house price (Xm1902)
– A real estate agent believes that a house selling price can be
predicted using the house size, number of bedrooms, and
lot size.
– A random sample of 100 houses was drawn and data
recorded. Price Bedrooms H Size Lot Size
124100
218300
117800
.
.
3
4
3
.
.
1290
2080
1250
.
.
3900
6600
3750
.
.
– Analyze the relationship among the four variables
Diagnostics: Multicollinearity
• The proposed model is
PRICE = b0 + b1BEDROOMS + b2H-SIZE +b3LOTSIZE
+e
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.559998
0.546248
25022.71
154066
100
The model is valid, but no
variable is significantly related
to the selling price ?!
Analysis of Variance
Source
Model
Error
C. Total
DF
Sum of Squares
Mean Square
F Ratio
3
96
99
7.65017e10
6.0109e+10
1.36611e11
2.5501e10
626135896
40.7269
Prob > F
<.0001
Parameter Estimates
Term
Intercept
Bedrooms
House Size
Lot Size
Estimate
Std Error
t Ratio
Prob>|t|
37717.595
2306.0808
74.296806
-4.363783
14176.74
6994.192
52.97858
17.024
2.66
0.33
1.40
-0.26
0.0091
0.7423
0.1640
0.7982
Diagnostics: Multicollinearity
• Multicollinearity is found to be a problem.
Price
Price
Bedrooms
H Size
Lot Size
1
0.6454
0.7478
0.7409
Bedrooms H Size
1
0.8465
0.8374
1
0.9936
Lot Size
1
• Multicollinearity causes two kinds of difficulties:
– The t statistics appear to be too small.
– The b coefficients cannot be interpreted as “slopes”.
Remedying Violations of the
Required Conditions
• Nonnormality or heteroscedasticity can be
remedied using transformations on the y variable.
• The transformations can improve the linear
relationship between the dependent variable and
the independent variables.
• Many computer software systems allow us to
make the transformations easily.
Transformations, Example.
Reducing Nonnormality by
Transformations
• A brief list of transformations
» y’ = log y (for y > 0)
• Use when the se increases with y, or
• Use when the error distribution is positively skewed
» y’ = y2
• Use when the s2e is proportional to E(y), or
• Use when the error distribution is negatively skewed
» y’ = y1/2 (for y > 0)
• Use when the s2e is proportional to E(y)
» y’ = 1/y
• Use when s2e increases significantly when y increases beyond
some critical value.
Durbin - Watson Test:
Are the Errors Autocorrelated?
• This test detects first order autocorrelation
between consecutive residuals in a time series
• If autocorrelation exists the error variables are not
independent
n
Residual at time i
d

(ei  ei 1 ) 2
i2
n

ei 2
i 1
The range of d is 0  d  4
Positive First Order
Autocorrelation
+
+
+
Residuals
+
0
+
+
+ +
Positive first order autocorrelation occurs when
consecutive residuals tend to be similar. Then,
the value of d is small (less than 2).
Time
Negative First Order
Autocorrelation
Residuals
+
+
+
+
+
+
+
0
Negative first order autocorrelation occurs when
consecutive residuals tend to markedly differ.
Then, the value of d is large (greater than 2).
Time
Durbin-Watson Test in JMP
• H0: No first-order autocorrelation.
H1: First-order autocorrelation
• Use row diagnostics, Durbin-Watson test in
JMP after fitting the model.
Durbin-Watson
Durbin-Watson
Number of Obs.
AutoCorrelation
Prob<DW
0.5931403
20
0.5914
0.0002
• Autocorrelation is an estimate of
correlation between errors.
Testing the Existence of
Autocorrelation, Example
• Example 19.3 (Xm19-03)
– How does the weather affect the sales of lift tickets in a ski
resort?
– Data of the past 20 years sales of tickets, along with the
total snowfall and the average temperature during
Christmas week in each year, was collected.
– The model hypothesized was
TICKETS=b0+b1SNOWFALL+b2TEMPERATURE+e
– Regression analysis yielded the following results: