Document 7703202

Download Report

Transcript Document 7703202

Lecture 25
• Multiple Regression Diagnostics (Sections
19.4-19.5)
• Polynomial Models (Section 20.2)
19.4 Regression Diagnostics - II
• The conditions required for the model assessment
to apply must be checked.
– Is the error variable normally
Draw a histogram of the residuals
distributed?
– Is the regression function correctly specified as a linear
function of x1,…,xk (E ( i )  0 ) Plot the residuals versus x
and ŷ
– Is the error variance constant?
Plot the residuals versus ^y
– Are the errors independent? Plot the residuals versus the
time periods
– Can we identify outliers and influential observations?
– Is multicollinearity a problem?
Effects of Violated Assumptions
• Curvature ( E ( i )  0): slopes  j no longer
meaningful
(Potential remedy: Transformations of
responses and predictors)
• Violations of other assumptions: tests, pvalues, CIs are no longer accurate. That is,
inference is invalidated (Remedies may be
difficult)
Influential Observation
• Influential observation: An observation is
influential if removing it would markedly change
the results of the analysis.
• In order to be influential, a point must either be
(i) an outlier in terms of the relationship between
its y and x’s or
(ii) have unusually distant x’s (high leverage) and
not fall exactly into the relationship between y and
x’s that the rest of the data follows.
Simple Linear Regression
Example
• Data in salary.jmp. Y=Weekly Salary,
X=Years of Experience.
Bivariate Fit of Weekly Salary By Years of Experience
Weekly Salary
700
600
500
400
300
200
100
0
0
5 10 15 20 25 30 35 40 45
Years of Experience
Identification of Influential
Observations
• Cook’s distance is a measure of the influence of a
point – the effect that omitting the observation has
on the estimated regression coefficients.
• Use Save Columns, Cook’s D Influence to obtain
Cook’s Distance.
• Plot Cook’s Distances: Graph, Overlay Plot, put
Cook’s D Influence in Y and leave X blank (plots
Cook’
Cook’s Distance
• Rule of thumb: Observation with Cook’s Distance
(Di) >1 has high influence. You should also be
concerned about any observation that has Di<1 but
has a much bigger Di than any other observation.
Ex. 19.2:
Cook's D Influence Price
Overlay Plot
0.25
0.2
0.15
0.1
0.05
0
-10 0 10 20 30 40 50 60 70 80 90 110
Rows
Strategy for dealing with
influential observations/outliers
• Do the conclusions change when the obs. is
deleted?
– If No. Proceed with the obs. Included. Study the obs to
see if anything can be learned.
– If Yes. Is there reason to believe the case belongs to a
population other than the one under investigation?
• If Yes. Omit the case and proceed.
• If No. Does the case have unusually “distant” independent
variables.
– If Yes. Omit the case and proceed. Report conclusions for the
reduced range of explanatory variables.
– If No. Not much can be said. More data are needed to resolve
the questions.
Multicollinearity
• Multicollinearity: Condition in which independent
variables are highly correlated.
• Exact collinearity: Y=Weight, X1=Height in
inches, X2=Height in feet. Then
Yˆ  1.5  .5 X 1  24 X 2
Yˆ  1.5  2.5 X  0 X
1
2
provide the same predictions.
• Multicollinearity causes two kinds of difficulties:
– The t statistics appear to be too small.
– The  coefficients cannot be interpreted as “slopes”.
Multicollinearity Diagnostics
• Diagnostics:
– High correlation between independent variables
– Counterintuitive signs on regression
coefficients
– Low values for t-statistics despite a significant
overall fit, as measured by the F statistic.
Diagnostics: Multicollinearity
• Example 19.2: Predicting house price (Xm1902)
– A real estate agent believes that a house selling price can be
predicted using the house size, number of bedrooms, and
lot size.
– A random sample of 100 houses was drawn and data
recorded. Price Bedrooms H Size Lot Size
124100
218300
117800
.
.
3
4
3
.
.
1290
2080
1250
.
.
3900
6600
3750
.
.
– Analyze the relationship among the four variables
Diagnostics: Multicollinearity
• The proposed model is
PRICE = 0 + 1BEDROOMS + 2H-SIZE +3LOTSIZE
+
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.559998
0.546248
25022.71
154066
100
The model is valid, but no
variable is significantly related
to the selling price ?!
Analysis of Variance
Source
Model
Error
C. Total
DF
Sum of Squares
Mean Square
F Ratio
3
96
99
7.65017e10
6.0109e+10
1.36611e11
2.5501e10
626135896
40.7269
Prob > F
<.0001
Parameter Estimates
Term
Intercept
Bedrooms
House Size
Lot Size
Estimate
Std Error
t Ratio
Prob>|t|
37717.595
2306.0808
74.296806
-4.363783
14176.74
6994.192
52.97858
17.024
2.66
0.33
1.40
-0.26
0.0091
0.7423
0.1640
0.7982
Diagnostics: Multicollinearity
• Multicollinearity is found to be a problem.
Price
Price
Bedrooms
H Size
Lot Size
1
0.6454
0.7478
0.7409
Bedrooms H Size
1
0.8465
0.8374
1
0.9936
Lot Size
1
• Multicollinearity causes two kinds of difficulties:
– The t statistics appear to be too small.
– The  coefficients cannot be interpreted as “slopes”.
Remedying Violations of the
Required Conditions
• Nonnormality or heteroscedasticity can be
remedied using transformations on the y variable.
• The transformations can improve the linear
relationship between the dependent variable and
the independent variables.
• Many computer software systems allow us to
make the transformations easily.
Transformations, Example.
Reducing Nonnormality by
Transformations
• A brief list of transformations
» y’ = log y (for y > 0)
• Use when the s increases with y, or
• Use when the error distribution is positively skewed
» y’ = y2
• Use when the s2 is proportional to E(y), or
• Use when the error distribution is negatively skewed
» y’ = y1/2 (for y > 0)
• Use when the s2 is proportional to E(y)
» y’ = 1/y
• Use when s2 increases significantly when y increases beyond
some critical value.
Durbin - Watson Test:
Are the Errors Autocorrelated?
• This test detects first order autocorrelation
between consecutive residuals in a time series
• If autocorrelation exists the error variables are not
independent
Positive First Order
Autocorrelation
+
+
+
Residuals
+
0
+
+
+ +
Positive first order autocorrelation occurs when
consecutive residuals tend to be similar. Then,
the value of d is small (less than 2).
Time
Negative First Order
Autocorrelation
Residuals
+
+
+
+
+
+
+
0
Negative first order autocorrelation occurs when
consecutive residuals tend to markedly differ.
Then, the value of d is large (greater than 2).
Time
Durbin-Watson Test in JMP
• H0: No first-order autocorrelation.
H1: First-order autocorrelation
• Use row diagnostics, Durbin-Watson test in
JMP after fitting the model.
Durbin-Watson
Durbin-Watson
Number of Obs.
AutoCorrelation
Prob<DW
0.5931403
20
0.5914
0.0002
• Autocorrelation is an estimate of
correlation between errors.
Testing the Existence of
Autocorrelation, Example
• Example 19.3 (Xm19-03)
– How does the weather affect the sales of lift tickets in a ski
resort?
– Data of the past 20 years sales of tickets, along with the
total snowfall and the average temperature during
Christmas week in each year, was collected.
– The model hypothesized was
TICKETS=0+1SNOWFALL+2TEMPERATURE+
– Regression analysis yielded the following results:
20.1 Introduction
• Regression analysis is one of the most commonly
used techniques in statistics.
• It is considered powerful for several reasons:
– It can cover a variety of mathematical models
• linear relationships.
• non - linear relationships.
• nominal independent variables.
– It provides efficient methods for model building
Curvature: Midterm Problem 10
Bivariate Fit of MPG City By Weight(lb)
40
MPG City
35
30
25
20
15
Residual
1500
2500 3000 3500 4000
Weight(lb)
10
6
2
-2
-6
1500 2000
2500 3000 3500
Weight(lb)
4000
Remedy I: Transformations
• Use Tukey’s Bulging Rule to choose a
transformation.
Bivariate Fit of 1/MPGCity By Weight(lb)
1/MPGCity
0.07
0.06
0.05
0.04
0.03
Residual
1500
2500 3000 3500 4000
Weight(lb)
0.010
0.000
-0.010
1500 2000
2500 3000 3500
Weight(lb)
4000
Remedy II: Polynomial Models
y = 0 + 1x1+ 2x2 +…+ pxp + 
y = 0 + 1x + 2x2 + …+pxp + 
Quadratic Regression
Bivariate Fit of MPG City By Weight(lb)
40
MPG City
35
30
25
20
15
1500
2500 3000 3500 4000
Weight(lb)
Parameter Estimates
Term
Residual
Intercept
Weight(lb)
(Weight(lb)-2809.5)^2
5
2
-1
-4
-7
1500 2000
2500
3000
Weight(lb)
3500
Estimate
Std Error
40.166608
-0.006894
0.000003
0.902231
0.00032
4.634e-7
4000
Polynomial Models with One Predictor
Variable
• First order model (p = 1)
y  0  1x  
• Second order model
y = 0 + 1x +2x2 + 
(p=2)

2 < 0
2 > 0
Polynomial Models with One Predictor
Variable
• Third order model (p = 3)
y = 0 + 1x + 2x2 +3x3 + 
3 < 0
3 > 0
Interaction
• Two independent variables x1 and x2
interact if the effect of x1 on y is influenced
by the value of x2.
• Interaction can be brought into the multiple
linear regression model by including the
independent variable x1* x2.
• Example:
Incoˆme  1000  2000 * Educ  100 * IQ  10 * IQ * Educ
Interaction Cont.
• y  0  1x1  2 x2  3 x1x2
• “Slope” for x1=E(y|x1+1,x2)-E(y|x1,x2)=
1  3 x2
• Incoˆme  1000  2000 * Educ  100 * IQ  10 * IQ * Educ
• Is the expected income increase from an
extra year of education higher for people
with IQ 100 or with IQ 130 (or is it the
same)?
Polynomial Models with Two Predictor
Variables
• First order model
y = 0 + 1x1 + 2x2 + 
The effect of one predictor variable on y
is independent of the effect of the other
predictor variable on y.
X2 = 3
X2 = 2
X2 = 1
• First order model, two
predictors,and
interaction
y = 0 + 1x1 + 2x2
The two
variables
interact
+
x
x
3 1 2+ 
to affect the value of y.
X2 = 3
X2 = 2
X2 =1
x1
x1
Polynomial Models with Two Predictor
Variables
Second order model
y = 0 + 1x1 + 2x2
+ 3x12 + 4x22 + 
X2 = 3
y = [0+2(3)+4(32)]+ 1x1 + 3x12 + 
X2 = 2
y = [0+2(2)+4
(22)]+
1 x 1 + 3 x 1 + 
2
X2 =1
y = [0+2(1)+4(12)]+ 1x1 + 3x12 + 
x1
Second order model with
interaction
y = 0 + 1x1 + 2x2
+3x12
+5x41xx222++ 
X2 = 3
X2 = 2
X2 =1