Transcript Document
6-3 Multiple Regression
6-3.1 Estimation of Parameters in Multiple
Regression
6-3 Multiple Regression
6-3.1 Estimation of Parameters in Multiple
Regression
๐ฆ๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ๐1 + ๐ฝ2 ๐ฅ๐2 +โโโ +๐ฝ๐ ๐ฅ๐๐ + ๐๐
= ๐ฝ0 +
๐
๐=1 ๐ฝ๐ ๐ฅ๐๐
+ ๐๐, ๐ = 1, 2, โฏ , ๐
โข The least squares function is given by
โข The least squares estimates must satisfy
6-3 Multiple Regression
6-3.1 Estimation of Parameters in Multiple
Regression
โข The least squares normal equations are
โข The solution to the normal equations are the least
squares estimators of the regression coefficients.
6-3 Multiple Regression
XโX in Multiple Regression
๐
๐
๐
๐1๐
๐=1
๐
๐
๐1๐
(๐ โฒ ๐) =
๐=1
๐
โฎ
๐โฒ๐
โ1
๐๐ 2
โฏ
๐=1
๐=1
๐
๐=1
๐1๐ ๐๐
๐=1
๐
๐๐๐ 2
โฏ
๐=1
๐
๐1๐ ๐๐
๐=1
๐=1
๐
โฎ
๐
๐1๐ ๐๐๐
๐=1
๐
๐๐
๐1๐ ๐๐๐
โฑ
๐
๐๐
๐=1
๐๐๐
๐1๐ 2
๐๐๐
๐=1
๐
๐
๐๐๐ ๐๐
๐=1
๐
๐๐๐ ๐๐
๐=1
๐๐ด๐
(๐ฝ0 )
๐ถ๐๐(๐ฝ0 , ๐ฝ1 )
โฏ
๐ถ๐๐(
๐ฝ
,
๐ฝ
)
๐๐ด๐
(
๐ฝ
)
0 1
1
=
โฎ
โฑ
๐ถ๐๐(๐ฝ0 , ๐ฝ๐ ) ๐ถ๐๐(๐ฝ1 , ๐ฝ๐ ) โฏ
๐๐ 2
๐=1
๐ถ๐๐(๐ฝ0 , ๐ฝ๐ )
๐ถ๐๐(๐ฝ1 , ๐ฝ๐ )
โฎ
๐๐ด๐
(๐ฝ๐ )
6-3 Multiple Regression
6-3 Multiple Regression
6-3 Multiple Regression
6-3 Multiple Regression
6-3 Multiple Regression
6-3 Multiple Regression
6-3.1 Estimation of Parameters in Multiple
Regression
6-3 Multiple Regression
Adjusted R2
We can adjust the R2 to take into account the number of regressors
in the model:
๐
2 = ๐ด๐ท๐ฝ ๐
๐๐ = 1 โ (1 โ ๐
2 )
๐โ1
๐ โ (๐ + 1)
(i) The ADJ RSQ does not always increase, like R2, as k increases. ADJ
RSQ is especially preferred to R2 if k/n is a large fraction (greater
than 10%). If k/n is small, then both measures are almost identical.
(ii) Always:
ADJ RSQโค ๐
2 โค 1
(iii) R2
= 1โ SSE/SS(TOTAL)
ADJ RSQ
= 1 โ MSE/MS(TOTAL)
where MS(TOTAL)=SS(TOTAL)/(nโ1) = sample variance of y.
6-3 Multiple Regression
6-3 Multiple Regression
6-3.2 Inferences in Multiple Regression
Test for Significance of Regression
6-3 Multiple Regression
6-3.2 Inferences in Multiple Regression
Inference on Individual Regression Coefficients
๐ป0 : ๐ฝ๐ = 0
๐ฃ๐ .
๐ป1 : ๐ฝ๐ โ 0
โขThis is called a partial or marginal test
6-3 Multiple Regression
6-3.2 Inferences in Multiple Regression
Confidence Intervals on the Mean Response
and Prediction Intervals
๐๐|๐ฅ10,๐ฅ20,โฏ,๐ฅ๐0 = ๐ฝ0 +๐ฝ1 ๐ฅ10 + ๐ฝ2 ๐ฅ20 + โฏ + ๐ฝ๐ ๐ฅ๐0
6-3 Multiple Regression
Confidence Intervals on the Mean Response
and Prediction Intervals
The response at the point of interest is
๐0 = ๐ฝ0 + ๐ฝ1 ๐ฅ10 + ๐ฝ2 ๐ฅ20 + โฏ + ๐ฝ๐ ๐ฅ๐0 + ๐
and the corresponding predicted value is
๐0 = ๐๐|๐ฅ10,๐ฅ20,โฏ,๐ฅ๐0 = ๐ฝ0 +๐ฝ1 ๐ฅ10 + ๐ฝ2 ๐ฅ20 + โฏ + ๐ฝ๐ ๐ฅ๐0
The prediction error is ๐0 โ ๐0 , and the standard deviation of this
prediction error is
๐ 2 + ๐ ๐ ๐๐|๐ฅ10,๐ฅ20,โฏ,๐ฅ๐0
2
6-3 Multiple Regression
6-3.2 Inferences in Multiple Regression
Confidence Intervals on the Mean Response
and Prediction Intervals
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Residual Analysis
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Residual Analysis
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Residual Analysis
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Residual Analysis
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Residual Analysis
0 < โ๐๐ โค 1
Because the โ๐๐ โs are always between zero and unity, a studentized residual is always
larger than the corresponding standardized residual. Consequently, studentized residuals
are a more sensitive diagnostic when looking for outliers.
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Influential Observations
โข
โข
โข
The disposition of points in the x-space is
important in determining the properties of the
model in R2, the regression coefficients, and the
magnitude of the error mean squares.
A large value of Di implies that the ith points is
influential.
A value of Di>1 would indicate that the point is
influential.
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
6-3 Multiple Regression
Example 6-7
OPTIONS NOOVP NODATE NONUMBER LS=140;
DATA ex67;
INPUT strength length height @@;
label strength='Pull Strength' length='Wire length' height='Die Height';
CARDS;
9.95 2 50 24.45 8 110 31.75 11 120 35 10 550
25.02 8 295 16.86 4 200 14.38 2 375 9.6 2 52
24.35 9 100 27.5 8 300 17.08 4 412 37 11 400
41.95 12 500 11.66 2 360 21.65 4 205 17.89 4 400
69 20 600 10.3 1 585 34.93 10 540 46.59 15 250
44.88 15 290 54.12 16 510 56.63 17 590 22.13 6 100
21.15 5 400
PROC SGSCATTER data=ex67;
MATRIX STRENGTH LENGTH HEIGHT;
TITLE 'Scatter Plot Matrix for Wire Bond Data';
PROC REG data=ex67;
PLOT npp.*Residual.; /* Normal Probability Plot */
MODEL strength=length height/xpx r CLB CLM CLI;
PLOT RESIDual.*Pred.; /* Residual Plot */
TITLE 'Multiple Regression';
PLOT Residual.*length;
DATA EX67N;
PLOT Residual.*height;
INPUT LENGTH HEIGHT @@;
DATALINES;
11 35 5 20
DATA EX67N1;
SET EX67 EX67N;
PROC REG DATA=EX67N1;
MODEL STRENGTH=LENGTH HEIGHT/CLM CLI;
TITLE 'CIs FOR MEAN RESPONSE AND FUTURE OBSERVATION';
RUN; QUIT;
6-3 Multiple Regression
6-3 Multiple Regression
The REG Procedure
Model: MODEL1
Model Crossproducts X'X X'Y Y'Y
Variable
Label
Intercept
length
height
strength
Intercept
Wire length
Die Height
Intercept
length
height
strength
25
206
8294
725.82
206
2396
77177
8008.47
8294
77177
3531848
274816.71
725.82
8008.47
274816.71
27178.5316
--------------------------------------------------------------------------------------------------------------- ---------------------------------------The REG Procedure
Model: MODEL1
Dependent Variable: strength
Number of Observations Read
Number of Observations Used
25
25
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
2
22
24
5990.77122
115.17348
6105.94470
2995.38561
5.23516
Root MSE
Dependent Mean
Coeff Var
2.28805
29.03280
7.88090
R-Square
Adj R-Sq
F Value
Pr > F
572.17
<.0001
0.9811
0.9794
Parameter Estimates
Variable
Label
Intercept
length
height
Intercept
Wire length
Die Height
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
2.26379
2.74427
0.01253
1.06007
0.09352
0.00280
2.14
29.34
4.48
0.0441
<.0001
0.0002
95% Confidence Limits
0.06535
2.55031
0.00672
4.46223
2.93823
0.01833
6-3 Multiple Regression
Multiple Regression
The REG Procedure
Model: MODEL1
Dependent Variable: strength
Output Statistics
Dependent Predicted
Std Error
Obs Variable
Value Mean Predict
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
9.9500
24.4500
31.7500
35.0000
25.0200
16.8600
14.3800
9.6000
24.3500
27.5000
17.0800
37.0000
41.9500
11.6600
21.6500
17.8900
69.0000
10.3000
34.9300
46.5900
44.8800
54.1200
56.6300
22.1300
21.1500
8.3787
25.5960
33.9541
36.5968
27.9137
15.7464
12.4503
8.4038
28.2150
27.9763
18.4023
37.4619
41.4589
12.2623
15.8091
18.2520
64.6659
12.3368
36.4715
46.5598
47.0609
52.5613
56.3078
19.9822
20.9963
0.9074
0.7645
0.8620
0.7303
0.4677
0.6261
0.7862
0.9039
0.8185
0.4651
0.6960
0.5246
0.6553
0.7689
0.6213
0.6785
1.1652
1.2383
0.7096
0.8780
0.8238
0.8432
0.9771
0.7557
0.6176
95% CL Mean
6.4968
24.0105
32.1665
35.0821
26.9437
14.4481
10.8198
6.5291
26.5175
27.0118
16.9588
36.3739
40.0999
10.6678
14.5206
16.8448
62.2494
9.7689
34.9999
44.7389
45.3524
50.8127
54.2814
18.4149
19.7153
10.2606
27.1815
35.7417
38.1114
28.8836
17.0448
14.0807
10.2784
29.9125
28.9408
19.8458
38.5498
42.8179
13.8568
17.0976
19.6592
67.0824
14.9048
37.9431
48.3807
48.7694
54.3099
58.3342
21.5494
22.2772
95% CL Predict
3.2740
20.5930
28.8834
31.6158
23.0704
10.8269
7.4328
3.3018
23.1754
23.1341
13.4425
32.5936
36.5230
7.2565
10.8921
13.3026
59.3409
6.9414
31.5034
41.4773
42.0176
47.5042
51.1481
14.9850
16.0813
13.4834
30.5990
39.0248
41.5778
32.7569
20.6660
17.4677
13.5058
33.2546
32.8184
23.3621
42.3301
46.3948
17.2682
20.7260
23.2014
69.9909
17.7323
41.4396
51.6423
52.1042
57.6183
61.4675
24.9794
25.9112
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
Residual
1.5713
-1.1460
-2.2041
-1.5968
-2.8937
1.1136
1.9297
1.1962
-3.8650
-0.4763
-1.3223
-0.4619
0.4911
-0.6023
5.8409
-0.3620
4.3341
-2.0368
-1.5415
0.0302
-2.1809
1.5587
0.3222
2.1478
0.1537
0
115.17348
156.16295
Std Error Student
Residual Residual
2.100
2.157
2.119
2.168
2.240
2.201
2.149
2.102
2.137
2.240
2.180
2.227
2.192
2.155
2.202
2.185
1.969
1.924
2.175
2.113
2.135
2.127
2.069
2.160
2.203
0.748
-0.531
-1.040
-0.736
-1.292
0.506
0.898
0.569
-1.809
-0.213
-0.607
-0.207
0.224
-0.280
2.652
-0.166
2.201
-1.059
-0.709
0.0143
-1.022
0.733
0.156
0.995
0.0698
Cook's
D
-2-1 0 1 2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|*
*|
**|
*|
**|
|*
|*
|*
***|
|
*|
|
|
|
|*****
|
|****
**|
*|
|
**|
|*
|
|*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.035
0.012
0.060
0.021
0.024
0.007
0.036
0.020
0.160
0.001
0.013
0.001
0.001
0.003
0.187
0.001
0.565
0.155
0.018
0.000
0.052
0.028
0.002
0.040
0.000
6-3 Multiple Regression
6-3 Multiple Regression
6-3 Multiple Regression
CIs FOR MEAN RESPONSE AND FUTURE OBSERVATION
The REG Procedure
Model: MODEL1
Dependent Variable: strength Pull Strength
Number of Observations Read
Number of Observations Used
Number of Observations with Missing Values
27
25
2
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
2
22
24
5990.77122
115.17348
6105.94470
2995.38561
5.23516
Root MSE
Dependent Mean
Coeff Var
2.28805
29.03280
7.88090
R-Square
Adj R-Sq
F Value
Pr > F
572.17
<.0001
0.9811
0.9794
Parameter Estimates
Variable
Label
Intercept
length
height
Intercept
Wire length
Die Height
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
2.26379
2.74427
0.01253
1.06007
0.09352
0.00280
2.14
29.34
4.48
0.0441
<.0001
0.0002
6-3 Multiple Regression
CIs FOR MEAN RESPONSE AND FUTURE OBSERVATION
The REG Procedure
Model: MODEL1
Dependent Variable: strength Pull Strength
Output Statistics
Obs
Dependent
Variable
Predicted
Value
Std Error
Mean Predict
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
9.9500
24.4500
31.7500
35.0000
25.0200
16.8600
14.3800
9.6000
24.3500
27.5000
17.0800
37.0000
41.9500
11.6600
21.6500
17.8900
69.0000
10.3000
34.9300
46.5900
44.8800
54.1200
56.6300
22.1300
21.1500
.
.
8.3787
25.5960
33.9541
36.5968
27.9137
15.7464
12.4503
8.4038
28.2150
27.9763
18.4023
37.4619
41.4589
12.2623
15.8091
18.2520
64.6659
12.3368
36.4715
46.5598
47.0609
52.5613
56.3078
19.9822
20.9963
32.8892
16.2357
0.9074
0.7645
0.8620
0.7303
0.4677
0.6261
0.7862
0.9039
0.8185
0.4651
0.6960
0.5246
0.6553
0.7689
0.6213
0.6785
1.1652
1.2383
0.7096
0.8780
0.8238
0.8432
0.9771
0.7557
0.6176
1.0620
0.9286
95% CL Mean
6.4968
24.0105
32.1665
35.0821
26.9437
14.4481
10.8198
6.5291
26.5175
27.0118
16.9588
36.3739
40.0999
10.6678
14.5206
16.8448
62.2494
9.7689
34.9999
44.7389
45.3524
50.8127
54.2814
18.4149
19.7153
30.6867
14.3099
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
10.2606
27.1815
35.7417
38.1114
28.8836
17.0448
14.0807
10.2784
29.9125
28.9408
19.8458
38.5498
42.8179
13.8568
17.0976
19.6592
67.0824
14.9048
37.9431
48.3807
48.7694
54.3099
58.3342
21.5494
22.2772
35.0918
18.1615
0
115.17348
156.16295
95% CL Predict
3.2740
20.5930
28.8834
31.6158
23.0704
10.8269
7.4328
3.3018
23.1754
23.1341
13.4425
32.5936
36.5230
7.2565
10.8921
13.3026
59.3409
6.9414
31.5034
41.4773
42.0176
47.5042
51.1481
14.9850
16.0813
27.6579
11.1147
13.4834
30.5990
39.0248
41.5778
32.7569
20.6660
17.4677
13.5058
33.2546
32.8184
23.3621
42.3301
46.3948
17.2682
20.7260
23.2014
69.9909
17.7323
41.4396
51.6423
52.1042
57.6183
61.4675
24.9794
25.9112
38.1206
21.3567
Residual
1.5713
-1.1460
-2.2041
-1.5968
-2.8937
1.1136
1.9297
1.1962
-3.8650
-0.4763
-1.3223
-0.4619
0.4911
-0.6023
5.8409
-0.3620
4.3341
-2.0368
-1.5415
0.0302
-2.1809
1.5587
0.3222
2.1478
0.1537
.
.
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Multicollinearity
Multicollinearity is a catch-all phase referring to problems caused by the
independent variables being correlated with each other. This can cause a number
of problems
1) Individual F-tests can be non-significant for important variables. The sign of
a ๐ฝ๐ can be flopped. Recall, the partial slopes measure the change in Y for a
unit change in the ๐๐ holding the other Xโs constant. If two Xโs are highly
correlated, this interpretation doesnโt do much good.
2) The MSE can be inflated. Also the SEโs of the partial slopes are inflated.
3) ๐
2 < ๐๐๐1 2 + ๐๐๐2 2 + โฏ + ๐๐๐๐ 2
4) Removing one X from the model may make another more significant or less
significant.
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Variance Inflation Factor
The Quantity 1/(1 โ ๐
2๐๐โ ๐1 โฏ๐๐+1 ๐๐โ1 โโ๐๐ ) called the variance inflation factor is
denoted as VIF(Xj). The larger the value of VIF(Xj), the more the multicollinearity
and the larger the standard error of the ๐ฝ๐ due to having Xj in the model. A common
rule of thumb is that if VIF(Xj)>5 then multicollinearity is high. Also 10 has been
proposed (see Kutner book referenced below) as a cut off value.
Mallowโs CP
Another measure of the amount of multicollinearity is Mallowโs CP.
Assume we have a total of r variables. Suppose we fit a model with only p of the r
variables. Let SSEP be the sums of squares error from the p variable model and
MSE the mean square error from the model with all r variables. Then
๐ถ๐ = ๐๐๐ธ๐ (๐๐๐ธ โ ๐ โ 2๐ )
We want CP to be near p+1 for a good model.
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Multicollinearity
6-3 Multiple Regression
Consider the Full Model
๐ป0 : ๐ฝ1 = โฏ = ๐ฝ5 = 0 vs: ๐ป1 : ๐๐๐ก ๐๐๐ ๐ง๐๐๐
98.01% of the variability in the Yโs is explained by the relation to the
Xโs. The adjusted R2 is 0.9746 which is very close to the R2 value. This
indicates no serious problems with the number of independent variables.
Possible multicollinearity between units, area and size since they have
large correlations. Age and parking have low correlations with price so
may not be needed.
6-3 Multiple Regression
Example
OPTIONS NOOVP NODATE NONUMBER LS=100;
DATA appraise;
INPUT price units age size parking area cond$ @@;
CARDS;
90300 4 82 4635 0 4266
F 384000 20 13 17798 0 14391 G
157500 5 66 5913 0 6615 G 676200 26 64 7750 6 34144 E
165000 5 55 5150 0 6120 G 300000 10 65 12506 0 14552 G
108750 4 82 7160 0 3040 G 276538 11 23 5120 0 7881 G
420000 20 18 11745 20 12600 G 950000 62 71 21000 3 39448 G
560000 26 74 11221 0 30000 G 268000 13 56 7818 13 8088 F
290000 9 76 4900 0 11315 E 173200 6 21 5424 6 4461 G
323650 11 24 11834 8 9000 G 162500 5 19 5246 5 3828 G
353500 20 62 11223 2 13680 F 134400 4 70 5834 0 4680 E
187000 8 19 9075 0 7392 G 93600 4 82 6864 0 3840 F
110000 4 50 4510 0 3092 G 573200 14 10 11192 0 23704 E
79300 4 82 7425 0 3876 F 272000 5 82 7500 0 9542 E
PROC CORR DATA=APPRAISE;
VAR PRICE UNITS AGE SIZE PARKING AREA;
TITLE 'CORRELATIONS OF VARIABLES IN MODEL';
PROC REG DATA=APPRAISE;
MODEL PRICE=UNITS AGE SIZE PARKING AREA/R VIF;
TITLE 'ALL VARIABLES IN MODEL';
PROC REG DATA=APPRAISE;
MODEL PRICE=UNITS AGE AREA/R INFLUENCE;
TITLE 'REDUCED MODEL';
RUN; QUIT;
6-3 Multiple Regression
CORRELATIONS OF VARIABLES IN MODEL
CORR ํ๋ก์์
6 Variables:
price
units
age
size
parking area
๋จ์ ํต๊ณ๋
๋ณ์
price
units
age
size
parking
area
N
ํ๊ท
ํ์คํธ์ฐจ
ํฉ
์ต์๊ฐ
์ต๋๊ฐ
24
24
24
24
24
24
296193
12.50000
52.75000
8702
2.62500
11648
214164
12.73475
26.43655
4221
5.01140
10170
7108638
300.00000
1266
208843
63.00000
279555
79300
4.00000
10.00000
4510
0
3040
950000
62.00000
82.00000
21000
20.00000
39448
ํผ์ด์จ ์๊ด ๊ณ์, N = 24
H0: Rho=0 ๊ฐ์ ํ์์ Prob > |r|
price
units
age
size
parking
area
price
1.00000
0.92207
<.0001
-0.11118
0.6050
0.73582
<.0001
0.21385
0.3157
0.96784
<.0001
units
0.92207
<.0001
1.00000
-0.00982
0.9637
0.79583
<.0001
0.21290
0.3179
0.87622
<.0001
age
-0.11118
0.6050
-0.00982
0.9637
1.00000
-0.18563
0.3852
-0.36141
0.0827
0.03090
0.8860
size
0.73582
<.0001
0.79583
<.0001
-0.18563
0.3852
1.00000
0.15151
0.4797
0.66741
0.0004
parking
0.21385
0.3157
0.21290
0.3179
-0.36141
0.0827
0.15151
0.4797
1.00000
0.07830
0.7161
area
0.96784
<.0001
0.87622
<.0001
0.03090
0.8860
0.66741
0.0004
0.07830
0.7161
1.00000
6-3 Multiple Regression
ALL VARIABLES IN MODEL
The REG Procedure
Model: MODEL1
Dependent Variable: price
Number of Observations Read
Number of Observations Used
24
24
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
5
18
23
1.033962E12
20959224743
1.054921E12
2.067924E11
1164401375
Root MSE
Dependent Mean
Coeff Var
34123
296193
11.52063
R-Square
Adj R-Sq
F Value
Pr > F
177.60
<.0001
0.9801
0.9746
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Variance
Inflation
Intercept
units
age
size
parking
area
1
1
1
1
1
1
93629
4156.17223
-856.06670
0.88901
2675.62291
15.53982
29874
1532.28739
306.65871
2.96966
1626.23661
1.50259
3.13
2.71
-2.79
0.30
1.65
10.34
0.0057
0.0143
0.0121
0.7681
0.1173
<.0001
0
7.52119
1.29821
3.10362
1.31193
4.61289
6-3 Multiple Regression
ALL VARIABLES IN MODEL
The REG Procedure
Model: MODEL1
Dependent Variable: price
Output Statistics
Dependent Predicted
Std Error
Std Error Student
Obs Variable
Value Mean Predict Residual Residual Residual
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
90300
384000
157500
676200
165000
300000
108750
276538
420000
950000
560000
268000
290000
173200
323650
162500
353500
134400
187000
93600
110000
573200
79300
272000
110470
405080
165962
700437
167009
316800
93663
246679
421099
930242
614511
267139
246163
190788
290586
175673
351590
128242
233552
105832
119509
521561
106890
199161
12281
-20170
23185
-21080
9178
-8462
25152
-24237
10095
-2009
17858
-16800
13018
15087
19376
29859
25938
-1099
31527
19758
17207
-54511
18075 860.6816
11851
43837
14200
-17588
14788
33064
14612
-13173
10164
1910
9951
6158
13949
-46552
12433
-12232
12404
-9509
22525
51639
12957
-27590
13080
72839
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
31837 -0.634 |
25037 -0.842 |
32866 -0.257 |
23061 -1.051 |
32596 -0.0616 |
29077 -0.578 |
31543
0.478 |
28088
1.063 |
22173 -0.0496 |
13057
1.513 |
29467 -1.850 |
28943 0.0297 |
31999
1.370 |
31028 -0.567 |
30752
1.075 |
30836 -0.427 |
32575 0.0586 |
32640
0.189 |
31142 -1.495 |
31778 -0.385 |
31789 -0.299 |
25632
2.015 |
31567 -0.874 |
31517
2.311 |
0
20959224743
56380131094
Cook's
D
-2-1 0 1 2
*|
*|
|
**|
|
*|
|
|**
|
|***
***|
|
|**
*|
|**
|
|
|
**|
|
|
|****
*|
|****
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.010
0.101
0.001
0.219
0.000
0.021
0.006
0.090
0.001
2.225
0.194
0.000
0.043
0.011
0.045
0.007
0.000
0.001
0.075
0.004
0.002
0.522
0.021
0.153
6-3 Multiple Regression
6-3 Multiple Regression
6-3 Multiple Regression
We have some evidence of multicollinearity, thus we must consider
dropping some of the variables. Letโs look at the individual tests of
๐ป0 : ๐ฝ๐ = 0 vs: ๐ป1 : ๐ฝ๐ โ 0, i=1, 2, โฏ , 5
These tests are summarized in the SAS output of PROC REG. Size is very
non-significant (p-value=0.7681) and parking is also not significant (pvalue=0.1173). There is evidence from the correlations that size is related
to both units and area, so removing this variable might remove much of
the multicollinearity. Parking just doesnโt seem to explain much
variability in price.
Letโs look at a 95% confidence interval for ๐ฝ4 .
๐ฝ4 ยฑ ๐ก0.025;18 โ ๐๐ธ(๐ฝ4 )
2675ยฑ(2.101)*(1626.24)
(โ741.1, 6092.4)
6-3 Multiple Regression
A Test for the Significance of a Group
of Regressors (Partial F-Test)
Suppose that the full model has k regressors, and we are interested in testing
whether the last k-r of them can be deleted from the model. This smaller
model is called the reduced model. That is, the full model is
๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ1 + ๐ฝ2 ๐ฅ2 + โฏ + ๐ฝ๐ ๐ฅ๐ + ๐ฝ๐+1 ๐ฅ๐+1 + โฏ + ๐ฝ๐ ๐ฅ๐ + ๐
and the reduced model has ๐ฝ๐+1 = ๐ฝ๐+2=โฏ = ๐ฝ๐ =0, so the reduced model is
๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ1 + ๐ฝ2 ๐ฅ2 + โฏ + ๐ฝ๐ ๐ฅ๐ + ๐
Then, to test the hypotheses
6-3 Multiple Regression
A Test for the Significance of a Group
of Regressors (Partial F-Test)
๐น=
๐๐๐ธ๐
โ ๐๐๐ธ๐น (๐ โ ๐)
๐๐๐ธ๐น
where:
SSER = SSE for Reduced Model
SSEF = SSE for Full Model
๐ โ ๐ = number of ๐ฝโs in H0
For given ๐ผ, we reject H0 if:
Partial F>tabled F
with dof = ๐ โ ๐, numerator
๐ โ ๐, denominator
6-3 Multiple Regression
Testing ๐ฏ๐ : ๐ท๐ = ๐ท๐ = ๐
The Full model is
๐ = ๐ฝ0 + ๐ฝ1 ๐๐๐๐ก๐ + ๐ฝ2 ๐ด๐๐ + ๐ฝ3 ๐๐๐ง๐ + ๐ฝ4 ๐๐๐๐๐๐๐ + ๐ฝ5 ๐ด๐๐๐ + ๐
The Reduced model is
๐ = ๐ฝ0 + ๐ฝ1 ๐๐๐๐ก๐ + ๐ฝ2 ๐ด๐๐ + ๐ฝ5 ๐ด๐๐๐ + ๐
From the SAS output we have
๐๐๐ธ๐น = 20,959,224,743, ๐๐๐๐น = 18, ๐๐๐ธ๐น = 1,164,401,375
๐๐๐ธ๐
= 24,111,264,632, ๐๐๐๐น = 20,
315203989
๐น=
2 = 0.135
1164401375
๐0.05;2,18 = 3.55
No evidence to reject the null hypothesis.
6-3 Multiple Regression
Interpreting the ๐ท๐ โs
For the apartment appraisal problem we have ๐ = 296,193.3.
The ๐ฝโs are
๐ฝ0 = 114,857.4
๐ฝ2 = โ1,054.0 (age)
๐ฝ1 =5,012.6 (units)
๐ฝ1 =14.96 (area)
If one extra unit is added (all other factors held constant) the value of
the complex will increase by $5,012.6. If the complex ages one more year
it will lose $1,054.0 in value (all other factors held constant). If the area is
increased by one square feet the value of the complex will increase by
$14.96 (all other factors held constant).
Notice the potential for multicollinearity. If one more unit is added, the
number of square feet would also increase. Thus the interdependency of
some of the variables makes the ๐ฝโs harder to interpret.
6-3 Multiple Regression
Notes on the Reduced Model
The MSE has increased in the reduced model (MSE=34,721) vs. the
full model(MSE=34,123), but the standard error of the individual ๐ฝโs
have all decreased. This is another indication that there was
multicollinearity in the full model. We will be able to do more
accurate influence in this reduced model.
The R2 and adjusted R2 have been decreased by only a small amount.
This justified dropping the two variables, also.
All the individual ๐ฝโs are significantly different from zero (all pvalues small). This indicates that we probably cannot remove further
variables without losing some information about the Yโs.
6-3 Multiple Regression
Examining the Final Model
Some final checks on the model are:
1) Residual
2) Studentized (standardized) residuals
The studentized residuals should be between -2 and 2 around
95% of the time. If an excessive number of greater than 2 in
absolute value or if any one studentized residual is much greater
than 2 you should investigate closer.
3) Hat diagonals are the main diagonal element of the matrix
๐ ๐ โฒ ๐ โ1 ๐ โฒ
We have already seen that ๐ โฒ ๐ โ1 is important. The diagonal elements as
well as the eigen values of this matrix contain much information. Each
diagonal corresponds to a particular observation. Look for values of the
diagonal that are greater than 1.
6-3 Multiple Regression
One More Diagnostic
4) DFFITS
This diagnostic investigates the influence of each observation on the value
of the parameters. The parameters are first fit with all observations, call
the parameter ๐ฝ๐ .
Next the parameters are estimated using all but the jth observation. Call
these estimated ๐ฝ๐[๐] . The DFFITS for the ith parameter and jth observation
is calculated as
๐ท๐น๐น๐ผ๐๐ =
๐ฝ๐ โ ๐ฝ๐[๐]
๐๐ธ ๐ฝ๐
You look for values of DFFITS that are much larger than the other values.
This indicates that the observation is too influential in determining the
value of the parameters. A combined DFFITS can also be calculated
which looks at all the parameters at once.
6-3 Multiple Regression
REDUCED MODEL (NO SIZE AND PARKING)
The REG Procedure
Model: MODEL1
Dependent Variable: price
Number of Observations Read
Number of Observations Used
24
24
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
3
20
23
1.03081E12
24111264632
1.054921E12
3.436033E11
1205563232
Root MSE
Dependent Mean
Coeff Var
34721
296193
11.72249
R-Square
Adj R-Sq
F Value
Pr > F
285.01
<.0001
0.9771
0.9737
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
units
age
area
1
1
1
1
114857
5012.58292
-1054.84586
14.96564
17919
1183.19286
274.79652
1.48218
6.41
4.24
-3.84
10.10
<.0001
0.0004
0.0010
<.0001
6-3 Multiple Regression
REDUCED MODEL
The REG Procedure
Model: MODEL1
Dependent Variable: price
Output Statistics
Dependent Predicted
Std Error
Std Error Student
Obs Variable
Value Mean Predict Residual Residual Residual
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
90300
384000
157500
676200
165000
300000
108750
276538
420000
950000
560000
268000
290000
173200
323650
162500
353500
134400
187000
93600
110000
573200
79300
272000
112254
416767
169298
688661
173494
314198
93906
263679
384689
941108
616095
241992
249139
189543
279370
177167
354439
131108
245542
105878
128439
529231
106417
196225
12030
-21954
13928
-32767
9015
-11798
21982
-12461
8304
-8494
10357
-14198
12575
14844
11347
12859
13763
35311
31332
8892
17479
-56095
9249
26008
10066
40861
12261
-16343
10773
44280
12803
-14667
9992 -938.6096
9953
3292
11977
-58542
12192
-12278
9416
-18439
22052
43969
12177
-27117
12163
75775
32570 -0.674 |
31805 -1.030 |
33530 -0.352 |
26877 -0.464 |
33714 -0.252 |
33141 -0.428 |
32364
0.459 |
32815
0.392 |
31877
1.108 |
14961
0.594 |
30001 -1.870 |
33467
0.777 |
33230
1.230 |
32484 -0.503 |
33008
1.341 |
32275 -0.454 |
33252 -0.0282 |
33264 0.0990 |
32590 -1.796 |
32510 -0.378 |
33420 -0.552 |
26819
1.639 |
32516 -0.834 |
32521
2.330 |
Cook's
D
-2-1 0 1 2
*|
**|
|
|
|
|
|
|
|**
|*
***|
|*
|**
*|
|**
|
|
|
***|
|
*|
|***
*|
|****
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.015
0.051
0.002
0.036
0.001
0.004
0.008
0.005
0.057
0.387
0.297
0.012
0.035
0.009
0.048
0.008
0.000
0.000
0.109
0.005
0.006
0.454
0.024
0.190
6-3 Multiple Regression
REDUCED MODEL
The REG Procedure
Output Statistics
Obs RStudent
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-0.6646
-1.0319
-0.3440
-0.4544
-0.2459
-0.4195
0.4494
0.3834
1.1144
0.5845
-2.0062
0.7692
1.2466
-0.4935
1.3706
-0.4452
-0.0275
0.0965
-1.9118
-0.3694
-0.5419
1.7175
-0.8274
2.6607
Hat Diag
H
Cov
Ratio
DFFITS
0.1201
0.1609
0.0674
0.4008
0.0572
0.0890
0.1312
0.1068
0.1571
0.8143
0.2534
0.0710
0.0840
0.1247
0.0963
0.1360
0.0828
0.0822
0.1190
0.1233
0.0735
0.4034
0.1230
0.1227
1.2727
1.1764
1.2842
1.9623
1.2858
1.2989
1.3546
1.3328
1.1307
6.1574
0.7625
1.1690
0.9787
1.3330
0.9317
1.3632
1.3384
1.3350
0.6894
1.3609
1.2463
1.1553
1.2151
0.3943
-0.2455
-0.4519
-0.0925
-0.3716
-0.0606
-0.1311
0.1746
0.1326
0.4812
1.2240
-1.1689
0.2126
0.3776
-0.1863
0.4473
-0.1766
-0.0083
0.0289
-0.7026
-0.1385
-0.1527
1.4122
-0.3098
0.9951
------------------DFBETAS----------------Intercept
units
age
area
0.0255
-0.3370
-0.0163
0.1032
-0.0300
0.0062
-0.0130
0.1207
0.3411
-0.4218
0.4828
0.0686
-0.0751
-0.1807
0.4080
-0.1731
0.0001
0.0037
-0.6733
0.0130
-0.0977
0.5731
0.0292
-0.2172
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
-0.0031
-0.1451
0.0211
0.2203
0.0120
0.0820
0.0243
0.0292
0.2414
0.8913
0.4972
0.1215
-0.1207
-0.0149
0.0441
-0.0080
-0.0053
-0.0018
0.0295
-0.0080
-0.0163
-0.9475
-0.0168
-0.4517
0
24111264632
37937505741
-0.1666
0.3432
-0.0367
-0.0267
-0.0045
-0.0353
0.1154
-0.0918
-0.3141
0.2392
-0.3232
0.0315
0.2293
0.1282
-0.3203
0.1242
-0.0025
0.0140
0.5377
-0.0934
0.0079
-0.8374
-0.2089
0.6229
0.0567
0.0915
-0.0002
-0.3226
0.0034
-0.0838
-0.0638
-0.0392
-0.1954
-0.4129
-0.8502
-0.1349
0.0981
0.0485
-0.0715
0.0421
0.0041
-0.0055
0.0515
0.0388
0.0616
1.1063
0.0855
0.3274
6-3 Multiple Regression
6-3 Multiple Regression