Transcript Document

6-3 Multiple Regression
6-3.1 Estimation of Parameters in Multiple
Regression
6-3 Multiple Regression
6-3.1 Estimation of Parameters in Multiple
Regression
๐‘ฆ๐‘– = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ๐‘–1 + ๐›ฝ2 ๐‘ฅ๐‘–2 +โˆ™โˆ™โˆ™ +๐›ฝ๐‘˜ ๐‘ฅ๐‘–๐‘˜ + ๐œ–๐‘–
= ๐›ฝ0 +
๐‘˜
๐‘—=1 ๐›ฝ๐‘— ๐‘ฅ๐‘–๐‘—
+ ๐œ–๐‘–, ๐‘– = 1, 2, โ‹ฏ , ๐‘›
โ€ข The least squares function is given by
โ€ข The least squares estimates must satisfy
6-3 Multiple Regression
6-3.1 Estimation of Parameters in Multiple
Regression
โ€ข The least squares normal equations are
โ€ข The solution to the normal equations are the least
squares estimators of the regression coefficients.
6-3 Multiple Regression
Xโ€™X in Multiple Regression
๐‘›
๐‘›
๐‘›
๐‘‹1๐‘–
๐‘–=1
๐‘›
๐‘›
๐‘‹1๐‘–
(๐‘‹ โ€ฒ ๐‘‹) =
๐‘–=1
๐‘›
โ‹ฎ
๐‘‹โ€ฒ๐‘‹
โˆ’1
๐œŽ๐œ– 2
โ‹ฏ
๐‘–=1
๐‘–=1
๐‘›
๐‘–=1
๐‘‹1๐‘– ๐‘Œ๐‘–
๐‘–=1
๐‘›
๐‘‹๐‘๐‘– 2
โ‹ฏ
๐‘–=1
๐‘›
๐‘‹1๐‘– ๐‘Œ๐‘–
๐‘–=1
๐‘–=1
๐‘›
โ‹ฎ
๐‘›
๐‘‹1๐‘– ๐‘‹๐‘๐‘–
๐‘–=1
๐‘›
๐‘Œ๐‘–
๐‘‹1๐‘– ๐‘‹๐‘๐‘–
โ‹ฑ
๐‘›
๐‘Œ๐‘–
๐‘–=1
๐‘‹๐‘๐‘–
๐‘‹1๐‘– 2
๐‘‹๐‘๐‘–
๐‘–=1
๐‘›
๐‘›
๐‘‹๐‘๐‘– ๐‘Œ๐‘–
๐‘–=1
๐‘›
๐‘‹๐‘๐‘– ๐‘Œ๐‘–
๐‘–=1
๐‘‰๐ด๐‘…(๐›ฝ0 )
๐ถ๐‘‚๐‘‰(๐›ฝ0 , ๐›ฝ1 )
โ‹ฏ
๐ถ๐‘‚๐‘‰(
๐›ฝ
,
๐›ฝ
)
๐‘‰๐ด๐‘…(
๐›ฝ
)
0 1
1
=
โ‹ฎ
โ‹ฑ
๐ถ๐‘‚๐‘‰(๐›ฝ0 , ๐›ฝ๐‘ ) ๐ถ๐‘‚๐‘‰(๐›ฝ1 , ๐›ฝ๐‘ ) โ‹ฏ
๐‘Œ๐‘– 2
๐‘–=1
๐ถ๐‘‚๐‘‰(๐›ฝ0 , ๐›ฝ๐‘ )
๐ถ๐‘‚๐‘‰(๐›ฝ1 , ๐›ฝ๐‘ )
โ‹ฎ
๐‘‰๐ด๐‘…(๐›ฝ๐‘ )
6-3 Multiple Regression
6-3 Multiple Regression
6-3 Multiple Regression
6-3 Multiple Regression
6-3 Multiple Regression
6-3 Multiple Regression
6-3.1 Estimation of Parameters in Multiple
Regression
6-3 Multiple Regression
Adjusted R2
We can adjust the R2 to take into account the number of regressors
in the model:
๐‘…2 = ๐ด๐ท๐ฝ ๐‘…๐‘†๐‘„ = 1 โˆ’ (1 โˆ’ ๐‘…2 )
๐‘›โˆ’1
๐‘› โˆ’ (๐‘˜ + 1)
(i) The ADJ RSQ does not always increase, like R2, as k increases. ADJ
RSQ is especially preferred to R2 if k/n is a large fraction (greater
than 10%). If k/n is small, then both measures are almost identical.
(ii) Always:
ADJ RSQโ‰ค ๐‘…2 โ‰ค 1
(iii) R2
= 1โˆ’ SSE/SS(TOTAL)
ADJ RSQ
= 1 โ€“ MSE/MS(TOTAL)
where MS(TOTAL)=SS(TOTAL)/(nโˆ’1) = sample variance of y.
6-3 Multiple Regression
6-3 Multiple Regression
6-3.2 Inferences in Multiple Regression
Test for Significance of Regression
6-3 Multiple Regression
6-3.2 Inferences in Multiple Regression
Inference on Individual Regression Coefficients
๐ป0 : ๐›ฝ๐‘— = 0
๐‘ฃ๐‘ .
๐ป1 : ๐›ฝ๐‘— โ‰  0
โ€ขThis is called a partial or marginal test
6-3 Multiple Regression
6-3.2 Inferences in Multiple Regression
Confidence Intervals on the Mean Response
and Prediction Intervals
๐œ‡๐‘Œ|๐‘ฅ10,๐‘ฅ20,โ‹ฏ,๐‘ฅ๐‘˜0 = ๐›ฝ0 +๐›ฝ1 ๐‘ฅ10 + ๐›ฝ2 ๐‘ฅ20 + โ‹ฏ + ๐›ฝ๐‘˜ ๐‘ฅ๐‘˜0
6-3 Multiple Regression
Confidence Intervals on the Mean Response
and Prediction Intervals
The response at the point of interest is
๐‘Œ0 = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ10 + ๐›ฝ2 ๐‘ฅ20 + โ‹ฏ + ๐›ฝ๐‘˜ ๐‘ฅ๐‘˜0 + ๐œ–
and the corresponding predicted value is
๐‘Œ0 = ๐œ‡๐‘Œ|๐‘ฅ10,๐‘ฅ20,โ‹ฏ,๐‘ฅ๐‘˜0 = ๐›ฝ0 +๐›ฝ1 ๐‘ฅ10 + ๐›ฝ2 ๐‘ฅ20 + โ‹ฏ + ๐›ฝ๐‘˜ ๐‘ฅ๐‘˜0
The prediction error is ๐‘Œ0 โˆ’ ๐‘Œ0 , and the standard deviation of this
prediction error is
๐œŽ 2 + ๐‘ ๐‘’ ๐œ‡๐‘Œ|๐‘ฅ10,๐‘ฅ20,โ‹ฏ,๐‘ฅ๐‘˜0
2
6-3 Multiple Regression
6-3.2 Inferences in Multiple Regression
Confidence Intervals on the Mean Response
and Prediction Intervals
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Residual Analysis
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Residual Analysis
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Residual Analysis
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Residual Analysis
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Residual Analysis
0 < โ„Ž๐‘–๐‘— โ‰ค 1
Because the โ„Ž๐‘–๐‘— โ€™s are always between zero and unity, a studentized residual is always
larger than the corresponding standardized residual. Consequently, studentized residuals
are a more sensitive diagnostic when looking for outliers.
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Influential Observations
โ€ข
โ€ข
โ€ข
The disposition of points in the x-space is
important in determining the properties of the
model in R2, the regression coefficients, and the
magnitude of the error mean squares.
A large value of Di implies that the ith points is
influential.
A value of Di>1 would indicate that the point is
influential.
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
6-3 Multiple Regression
Example 6-7
OPTIONS NOOVP NODATE NONUMBER LS=140;
DATA ex67;
INPUT strength length height @@;
label strength='Pull Strength' length='Wire length' height='Die Height';
CARDS;
9.95 2 50 24.45 8 110 31.75 11 120 35 10 550
25.02 8 295 16.86 4 200 14.38 2 375 9.6 2 52
24.35 9 100 27.5 8 300 17.08 4 412 37 11 400
41.95 12 500 11.66 2 360 21.65 4 205 17.89 4 400
69 20 600 10.3 1 585 34.93 10 540 46.59 15 250
44.88 15 290 54.12 16 510 56.63 17 590 22.13 6 100
21.15 5 400
PROC SGSCATTER data=ex67;
MATRIX STRENGTH LENGTH HEIGHT;
TITLE 'Scatter Plot Matrix for Wire Bond Data';
PROC REG data=ex67;
PLOT npp.*Residual.; /* Normal Probability Plot */
MODEL strength=length height/xpx r CLB CLM CLI;
PLOT RESIDual.*Pred.; /* Residual Plot */
TITLE 'Multiple Regression';
PLOT Residual.*length;
DATA EX67N;
PLOT Residual.*height;
INPUT LENGTH HEIGHT @@;
DATALINES;
11 35 5 20
DATA EX67N1;
SET EX67 EX67N;
PROC REG DATA=EX67N1;
MODEL STRENGTH=LENGTH HEIGHT/CLM CLI;
TITLE 'CIs FOR MEAN RESPONSE AND FUTURE OBSERVATION';
RUN; QUIT;
6-3 Multiple Regression
6-3 Multiple Regression
The REG Procedure
Model: MODEL1
Model Crossproducts X'X X'Y Y'Y
Variable
Label
Intercept
length
height
strength
Intercept
Wire length
Die Height
Intercept
length
height
strength
25
206
8294
725.82
206
2396
77177
8008.47
8294
77177
3531848
274816.71
725.82
8008.47
274816.71
27178.5316
--------------------------------------------------------------------------------------------------------------- ---------------------------------------The REG Procedure
Model: MODEL1
Dependent Variable: strength
Number of Observations Read
Number of Observations Used
25
25
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
2
22
24
5990.77122
115.17348
6105.94470
2995.38561
5.23516
Root MSE
Dependent Mean
Coeff Var
2.28805
29.03280
7.88090
R-Square
Adj R-Sq
F Value
Pr > F
572.17
<.0001
0.9811
0.9794
Parameter Estimates
Variable
Label
Intercept
length
height
Intercept
Wire length
Die Height
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
2.26379
2.74427
0.01253
1.06007
0.09352
0.00280
2.14
29.34
4.48
0.0441
<.0001
0.0002
95% Confidence Limits
0.06535
2.55031
0.00672
4.46223
2.93823
0.01833
6-3 Multiple Regression
Multiple Regression
The REG Procedure
Model: MODEL1
Dependent Variable: strength
Output Statistics
Dependent Predicted
Std Error
Obs Variable
Value Mean Predict
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
9.9500
24.4500
31.7500
35.0000
25.0200
16.8600
14.3800
9.6000
24.3500
27.5000
17.0800
37.0000
41.9500
11.6600
21.6500
17.8900
69.0000
10.3000
34.9300
46.5900
44.8800
54.1200
56.6300
22.1300
21.1500
8.3787
25.5960
33.9541
36.5968
27.9137
15.7464
12.4503
8.4038
28.2150
27.9763
18.4023
37.4619
41.4589
12.2623
15.8091
18.2520
64.6659
12.3368
36.4715
46.5598
47.0609
52.5613
56.3078
19.9822
20.9963
0.9074
0.7645
0.8620
0.7303
0.4677
0.6261
0.7862
0.9039
0.8185
0.4651
0.6960
0.5246
0.6553
0.7689
0.6213
0.6785
1.1652
1.2383
0.7096
0.8780
0.8238
0.8432
0.9771
0.7557
0.6176
95% CL Mean
6.4968
24.0105
32.1665
35.0821
26.9437
14.4481
10.8198
6.5291
26.5175
27.0118
16.9588
36.3739
40.0999
10.6678
14.5206
16.8448
62.2494
9.7689
34.9999
44.7389
45.3524
50.8127
54.2814
18.4149
19.7153
10.2606
27.1815
35.7417
38.1114
28.8836
17.0448
14.0807
10.2784
29.9125
28.9408
19.8458
38.5498
42.8179
13.8568
17.0976
19.6592
67.0824
14.9048
37.9431
48.3807
48.7694
54.3099
58.3342
21.5494
22.2772
95% CL Predict
3.2740
20.5930
28.8834
31.6158
23.0704
10.8269
7.4328
3.3018
23.1754
23.1341
13.4425
32.5936
36.5230
7.2565
10.8921
13.3026
59.3409
6.9414
31.5034
41.4773
42.0176
47.5042
51.1481
14.9850
16.0813
13.4834
30.5990
39.0248
41.5778
32.7569
20.6660
17.4677
13.5058
33.2546
32.8184
23.3621
42.3301
46.3948
17.2682
20.7260
23.2014
69.9909
17.7323
41.4396
51.6423
52.1042
57.6183
61.4675
24.9794
25.9112
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
Residual
1.5713
-1.1460
-2.2041
-1.5968
-2.8937
1.1136
1.9297
1.1962
-3.8650
-0.4763
-1.3223
-0.4619
0.4911
-0.6023
5.8409
-0.3620
4.3341
-2.0368
-1.5415
0.0302
-2.1809
1.5587
0.3222
2.1478
0.1537
0
115.17348
156.16295
Std Error Student
Residual Residual
2.100
2.157
2.119
2.168
2.240
2.201
2.149
2.102
2.137
2.240
2.180
2.227
2.192
2.155
2.202
2.185
1.969
1.924
2.175
2.113
2.135
2.127
2.069
2.160
2.203
0.748
-0.531
-1.040
-0.736
-1.292
0.506
0.898
0.569
-1.809
-0.213
-0.607
-0.207
0.224
-0.280
2.652
-0.166
2.201
-1.059
-0.709
0.0143
-1.022
0.733
0.156
0.995
0.0698
Cook's
D
-2-1 0 1 2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|*
*|
**|
*|
**|
|*
|*
|*
***|
|
*|
|
|
|
|*****
|
|****
**|
*|
|
**|
|*
|
|*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.035
0.012
0.060
0.021
0.024
0.007
0.036
0.020
0.160
0.001
0.013
0.001
0.001
0.003
0.187
0.001
0.565
0.155
0.018
0.000
0.052
0.028
0.002
0.040
0.000
6-3 Multiple Regression
6-3 Multiple Regression
6-3 Multiple Regression
CIs FOR MEAN RESPONSE AND FUTURE OBSERVATION
The REG Procedure
Model: MODEL1
Dependent Variable: strength Pull Strength
Number of Observations Read
Number of Observations Used
Number of Observations with Missing Values
27
25
2
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
2
22
24
5990.77122
115.17348
6105.94470
2995.38561
5.23516
Root MSE
Dependent Mean
Coeff Var
2.28805
29.03280
7.88090
R-Square
Adj R-Sq
F Value
Pr > F
572.17
<.0001
0.9811
0.9794
Parameter Estimates
Variable
Label
Intercept
length
height
Intercept
Wire length
Die Height
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
2.26379
2.74427
0.01253
1.06007
0.09352
0.00280
2.14
29.34
4.48
0.0441
<.0001
0.0002
6-3 Multiple Regression
CIs FOR MEAN RESPONSE AND FUTURE OBSERVATION
The REG Procedure
Model: MODEL1
Dependent Variable: strength Pull Strength
Output Statistics
Obs
Dependent
Variable
Predicted
Value
Std Error
Mean Predict
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
9.9500
24.4500
31.7500
35.0000
25.0200
16.8600
14.3800
9.6000
24.3500
27.5000
17.0800
37.0000
41.9500
11.6600
21.6500
17.8900
69.0000
10.3000
34.9300
46.5900
44.8800
54.1200
56.6300
22.1300
21.1500
.
.
8.3787
25.5960
33.9541
36.5968
27.9137
15.7464
12.4503
8.4038
28.2150
27.9763
18.4023
37.4619
41.4589
12.2623
15.8091
18.2520
64.6659
12.3368
36.4715
46.5598
47.0609
52.5613
56.3078
19.9822
20.9963
32.8892
16.2357
0.9074
0.7645
0.8620
0.7303
0.4677
0.6261
0.7862
0.9039
0.8185
0.4651
0.6960
0.5246
0.6553
0.7689
0.6213
0.6785
1.1652
1.2383
0.7096
0.8780
0.8238
0.8432
0.9771
0.7557
0.6176
1.0620
0.9286
95% CL Mean
6.4968
24.0105
32.1665
35.0821
26.9437
14.4481
10.8198
6.5291
26.5175
27.0118
16.9588
36.3739
40.0999
10.6678
14.5206
16.8448
62.2494
9.7689
34.9999
44.7389
45.3524
50.8127
54.2814
18.4149
19.7153
30.6867
14.3099
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
10.2606
27.1815
35.7417
38.1114
28.8836
17.0448
14.0807
10.2784
29.9125
28.9408
19.8458
38.5498
42.8179
13.8568
17.0976
19.6592
67.0824
14.9048
37.9431
48.3807
48.7694
54.3099
58.3342
21.5494
22.2772
35.0918
18.1615
0
115.17348
156.16295
95% CL Predict
3.2740
20.5930
28.8834
31.6158
23.0704
10.8269
7.4328
3.3018
23.1754
23.1341
13.4425
32.5936
36.5230
7.2565
10.8921
13.3026
59.3409
6.9414
31.5034
41.4773
42.0176
47.5042
51.1481
14.9850
16.0813
27.6579
11.1147
13.4834
30.5990
39.0248
41.5778
32.7569
20.6660
17.4677
13.5058
33.2546
32.8184
23.3621
42.3301
46.3948
17.2682
20.7260
23.2014
69.9909
17.7323
41.4396
51.6423
52.1042
57.6183
61.4675
24.9794
25.9112
38.1206
21.3567
Residual
1.5713
-1.1460
-2.2041
-1.5968
-2.8937
1.1136
1.9297
1.1962
-3.8650
-0.4763
-1.3223
-0.4619
0.4911
-0.6023
5.8409
-0.3620
4.3341
-2.0368
-1.5415
0.0302
-2.1809
1.5587
0.3222
2.1478
0.1537
.
.
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Multicollinearity
Multicollinearity is a catch-all phase referring to problems caused by the
independent variables being correlated with each other. This can cause a number
of problems
1) Individual F-tests can be non-significant for important variables. The sign of
a ๐›ฝ๐‘— can be flopped. Recall, the partial slopes measure the change in Y for a
unit change in the ๐‘‹๐‘— holding the other Xโ€™s constant. If two Xโ€™s are highly
correlated, this interpretation doesnโ€™t do much good.
2) The MSE can be inflated. Also the SEโ€™s of the partial slopes are inflated.
3) ๐‘… 2 < ๐‘Ÿ๐‘Œ๐‘‹1 2 + ๐‘Ÿ๐‘Œ๐‘‹2 2 + โ‹ฏ + ๐‘Ÿ๐‘Œ๐‘‹๐‘ƒ 2
4) Removing one X from the model may make another more significant or less
significant.
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Variance Inflation Factor
The Quantity 1/(1 โˆ’ ๐‘… 2๐‘‹๐‘—โˆ™ ๐‘‹1 โ‹ฏ๐‘‹๐‘—+1 ๐‘‹๐‘—โˆ’1 โˆ™โˆ™๐‘‹๐‘ ) called the variance inflation factor is
denoted as VIF(Xj). The larger the value of VIF(Xj), the more the multicollinearity
and the larger the standard error of the ๐›ฝ๐‘— due to having Xj in the model. A common
rule of thumb is that if VIF(Xj)>5 then multicollinearity is high. Also 10 has been
proposed (see Kutner book referenced below) as a cut off value.
Mallowโ€™s CP
Another measure of the amount of multicollinearity is Mallowโ€™s CP.
Assume we have a total of r variables. Suppose we fit a model with only p of the r
variables. Let SSEP be the sums of squares error from the p variable model and
MSE the mean square error from the model with all r variables. Then
๐ถ๐‘ƒ = ๐‘†๐‘†๐ธ๐‘ƒ (๐‘€๐‘†๐ธ โˆ’ ๐‘› โˆ’ 2๐‘ )
We want CP to be near p+1 for a good model.
6-3 Multiple Regression
6-3.3 Checking Model Adequacy
Multicollinearity
6-3 Multiple Regression
Consider the Full Model
๐ป0 : ๐›ฝ1 = โ‹ฏ = ๐›ฝ5 = 0 vs: ๐ป1 : ๐‘๐‘œ๐‘ก ๐‘Ž๐‘™๐‘™ ๐‘ง๐‘’๐‘Ÿ๐‘œ
98.01% of the variability in the Yโ€™s is explained by the relation to the
Xโ€™s. The adjusted R2 is 0.9746 which is very close to the R2 value. This
indicates no serious problems with the number of independent variables.
Possible multicollinearity between units, area and size since they have
large correlations. Age and parking have low correlations with price so
may not be needed.
6-3 Multiple Regression
Example
OPTIONS NOOVP NODATE NONUMBER LS=100;
DATA appraise;
INPUT price units age size parking area cond$ @@;
CARDS;
90300 4 82 4635 0 4266
F 384000 20 13 17798 0 14391 G
157500 5 66 5913 0 6615 G 676200 26 64 7750 6 34144 E
165000 5 55 5150 0 6120 G 300000 10 65 12506 0 14552 G
108750 4 82 7160 0 3040 G 276538 11 23 5120 0 7881 G
420000 20 18 11745 20 12600 G 950000 62 71 21000 3 39448 G
560000 26 74 11221 0 30000 G 268000 13 56 7818 13 8088 F
290000 9 76 4900 0 11315 E 173200 6 21 5424 6 4461 G
323650 11 24 11834 8 9000 G 162500 5 19 5246 5 3828 G
353500 20 62 11223 2 13680 F 134400 4 70 5834 0 4680 E
187000 8 19 9075 0 7392 G 93600 4 82 6864 0 3840 F
110000 4 50 4510 0 3092 G 573200 14 10 11192 0 23704 E
79300 4 82 7425 0 3876 F 272000 5 82 7500 0 9542 E
PROC CORR DATA=APPRAISE;
VAR PRICE UNITS AGE SIZE PARKING AREA;
TITLE 'CORRELATIONS OF VARIABLES IN MODEL';
PROC REG DATA=APPRAISE;
MODEL PRICE=UNITS AGE SIZE PARKING AREA/R VIF;
TITLE 'ALL VARIABLES IN MODEL';
PROC REG DATA=APPRAISE;
MODEL PRICE=UNITS AGE AREA/R INFLUENCE;
TITLE 'REDUCED MODEL';
RUN; QUIT;
6-3 Multiple Regression
CORRELATIONS OF VARIABLES IN MODEL
CORR ํ”„๋กœ์‹œ์ €
6 Variables:
price
units
age
size
parking area
๋‹จ์ˆœ ํ†ต๊ณ„๋Ÿ‰
๋ณ€์ˆ˜
price
units
age
size
parking
area
N
ํ‰๊ท 
ํ‘œ์ค€ํŽธ์ฐจ
ํ•ฉ
์ตœ์†Œ๊ฐ’
์ตœ๋Œ€๊ฐ’
24
24
24
24
24
24
296193
12.50000
52.75000
8702
2.62500
11648
214164
12.73475
26.43655
4221
5.01140
10170
7108638
300.00000
1266
208843
63.00000
279555
79300
4.00000
10.00000
4510
0
3040
950000
62.00000
82.00000
21000
20.00000
39448
ํ”ผ์–ด์Šจ ์ƒ๊ด€ ๊ณ„์ˆ˜, N = 24
H0: Rho=0 ๊ฐ€์ •ํ•˜์—์„œ Prob > |r|
price
units
age
size
parking
area
price
1.00000
0.92207
<.0001
-0.11118
0.6050
0.73582
<.0001
0.21385
0.3157
0.96784
<.0001
units
0.92207
<.0001
1.00000
-0.00982
0.9637
0.79583
<.0001
0.21290
0.3179
0.87622
<.0001
age
-0.11118
0.6050
-0.00982
0.9637
1.00000
-0.18563
0.3852
-0.36141
0.0827
0.03090
0.8860
size
0.73582
<.0001
0.79583
<.0001
-0.18563
0.3852
1.00000
0.15151
0.4797
0.66741
0.0004
parking
0.21385
0.3157
0.21290
0.3179
-0.36141
0.0827
0.15151
0.4797
1.00000
0.07830
0.7161
area
0.96784
<.0001
0.87622
<.0001
0.03090
0.8860
0.66741
0.0004
0.07830
0.7161
1.00000
6-3 Multiple Regression
ALL VARIABLES IN MODEL
The REG Procedure
Model: MODEL1
Dependent Variable: price
Number of Observations Read
Number of Observations Used
24
24
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
5
18
23
1.033962E12
20959224743
1.054921E12
2.067924E11
1164401375
Root MSE
Dependent Mean
Coeff Var
34123
296193
11.52063
R-Square
Adj R-Sq
F Value
Pr > F
177.60
<.0001
0.9801
0.9746
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Variance
Inflation
Intercept
units
age
size
parking
area
1
1
1
1
1
1
93629
4156.17223
-856.06670
0.88901
2675.62291
15.53982
29874
1532.28739
306.65871
2.96966
1626.23661
1.50259
3.13
2.71
-2.79
0.30
1.65
10.34
0.0057
0.0143
0.0121
0.7681
0.1173
<.0001
0
7.52119
1.29821
3.10362
1.31193
4.61289
6-3 Multiple Regression
ALL VARIABLES IN MODEL
The REG Procedure
Model: MODEL1
Dependent Variable: price
Output Statistics
Dependent Predicted
Std Error
Std Error Student
Obs Variable
Value Mean Predict Residual Residual Residual
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
90300
384000
157500
676200
165000
300000
108750
276538
420000
950000
560000
268000
290000
173200
323650
162500
353500
134400
187000
93600
110000
573200
79300
272000
110470
405080
165962
700437
167009
316800
93663
246679
421099
930242
614511
267139
246163
190788
290586
175673
351590
128242
233552
105832
119509
521561
106890
199161
12281
-20170
23185
-21080
9178
-8462
25152
-24237
10095
-2009
17858
-16800
13018
15087
19376
29859
25938
-1099
31527
19758
17207
-54511
18075 860.6816
11851
43837
14200
-17588
14788
33064
14612
-13173
10164
1910
9951
6158
13949
-46552
12433
-12232
12404
-9509
22525
51639
12957
-27590
13080
72839
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
31837 -0.634 |
25037 -0.842 |
32866 -0.257 |
23061 -1.051 |
32596 -0.0616 |
29077 -0.578 |
31543
0.478 |
28088
1.063 |
22173 -0.0496 |
13057
1.513 |
29467 -1.850 |
28943 0.0297 |
31999
1.370 |
31028 -0.567 |
30752
1.075 |
30836 -0.427 |
32575 0.0586 |
32640
0.189 |
31142 -1.495 |
31778 -0.385 |
31789 -0.299 |
25632
2.015 |
31567 -0.874 |
31517
2.311 |
0
20959224743
56380131094
Cook's
D
-2-1 0 1 2
*|
*|
|
**|
|
*|
|
|**
|
|***
***|
|
|**
*|
|**
|
|
|
**|
|
|
|****
*|
|****
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.010
0.101
0.001
0.219
0.000
0.021
0.006
0.090
0.001
2.225
0.194
0.000
0.043
0.011
0.045
0.007
0.000
0.001
0.075
0.004
0.002
0.522
0.021
0.153
6-3 Multiple Regression
6-3 Multiple Regression
6-3 Multiple Regression
We have some evidence of multicollinearity, thus we must consider
dropping some of the variables. Letโ€™s look at the individual tests of
๐ป0 : ๐›ฝ๐‘– = 0 vs: ๐ป1 : ๐›ฝ๐‘– โ‰  0, i=1, 2, โ‹ฏ , 5
These tests are summarized in the SAS output of PROC REG. Size is very
non-significant (p-value=0.7681) and parking is also not significant (pvalue=0.1173). There is evidence from the correlations that size is related
to both units and area, so removing this variable might remove much of
the multicollinearity. Parking just doesnโ€™t seem to explain much
variability in price.
Letโ€™s look at a 95% confidence interval for ๐›ฝ4 .
๐›ฝ4 ± ๐‘ก0.025;18 โˆ— ๐‘†๐ธ(๐›ฝ4 )
2675±(2.101)*(1626.24)
(โˆ’741.1, 6092.4)
6-3 Multiple Regression
A Test for the Significance of a Group
of Regressors (Partial F-Test)
Suppose that the full model has k regressors, and we are interested in testing
whether the last k-r of them can be deleted from the model. This smaller
model is called the reduced model. That is, the full model is
๐‘Œ = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ1 + ๐›ฝ2 ๐‘ฅ2 + โ‹ฏ + ๐›ฝ๐‘Ÿ ๐‘ฅ๐‘Ÿ + ๐›ฝ๐‘Ÿ+1 ๐‘ฅ๐‘Ÿ+1 + โ‹ฏ + ๐›ฝ๐‘˜ ๐‘ฅ๐‘˜ + ๐œ–
and the reduced model has ๐›ฝ๐‘Ÿ+1 = ๐›ฝ๐‘Ÿ+2=โ‹ฏ = ๐›ฝ๐‘˜ =0, so the reduced model is
๐‘Œ = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ1 + ๐›ฝ2 ๐‘ฅ2 + โ‹ฏ + ๐›ฝ๐‘Ÿ ๐‘ฅ๐‘Ÿ + ๐œ–
Then, to test the hypotheses
6-3 Multiple Regression
A Test for the Significance of a Group
of Regressors (Partial F-Test)
๐น=
๐‘†๐‘†๐ธ๐‘… โˆ’ ๐‘†๐‘†๐ธ๐น (๐‘˜ โˆ’ ๐‘Ÿ)
๐‘€๐‘†๐ธ๐น
where:
SSER = SSE for Reduced Model
SSEF = SSE for Full Model
๐‘˜ โˆ’ ๐‘Ÿ = number of ๐›ฝโ€™s in H0
For given ๐›ผ, we reject H0 if:
Partial F>tabled F
with dof = ๐‘˜ โˆ’ ๐‘Ÿ, numerator
๐‘› โˆ’ ๐‘, denominator
6-3 Multiple Regression
Testing ๐‘ฏ๐ŸŽ : ๐œท๐Ÿ‘ = ๐œท๐Ÿ’ = ๐ŸŽ
The Full model is
๐‘Œ = ๐›ฝ0 + ๐›ฝ1 ๐‘ˆ๐‘›๐‘–๐‘ก๐‘  + ๐›ฝ2 ๐ด๐‘”๐‘’ + ๐›ฝ3 ๐‘†๐‘–๐‘ง๐‘’ + ๐›ฝ4 ๐‘ƒ๐‘Ž๐‘Ÿ๐‘˜๐‘–๐‘›๐‘” + ๐›ฝ5 ๐ด๐‘Ÿ๐‘’๐‘Ž + ๐œ€
The Reduced model is
๐‘Œ = ๐›ฝ0 + ๐›ฝ1 ๐‘ˆ๐‘›๐‘–๐‘ก๐‘  + ๐›ฝ2 ๐ด๐‘”๐‘’ + ๐›ฝ5 ๐ด๐‘Ÿ๐‘’๐‘Ž + ๐œ€
From the SAS output we have
๐‘†๐‘†๐ธ๐น = 20,959,224,743, ๐‘‘๐‘“๐‘’๐น = 18, ๐‘€๐‘†๐ธ๐น = 1,164,401,375
๐‘†๐‘†๐ธ๐‘… = 24,111,264,632, ๐‘‘๐‘“๐‘’๐น = 20,
315203989
๐น=
2 = 0.135
1164401375
๐‘“0.05;2,18 = 3.55
No evidence to reject the null hypothesis.
6-3 Multiple Regression
Interpreting the ๐œท๐’‹ โ€™s
For the apartment appraisal problem we have ๐‘Œ = 296,193.3.
The ๐›ฝโ€™s are
๐›ฝ0 = 114,857.4
๐›ฝ2 = โˆ’1,054.0 (age)
๐›ฝ1 =5,012.6 (units)
๐›ฝ1 =14.96 (area)
If one extra unit is added (all other factors held constant) the value of
the complex will increase by $5,012.6. If the complex ages one more year
it will lose $1,054.0 in value (all other factors held constant). If the area is
increased by one square feet the value of the complex will increase by
$14.96 (all other factors held constant).
Notice the potential for multicollinearity. If one more unit is added, the
number of square feet would also increase. Thus the interdependency of
some of the variables makes the ๐›ฝโ€™s harder to interpret.
6-3 Multiple Regression
Notes on the Reduced Model
The MSE has increased in the reduced model (MSE=34,721) vs. the
full model(MSE=34,123), but the standard error of the individual ๐›ฝโ€™s
have all decreased. This is another indication that there was
multicollinearity in the full model. We will be able to do more
accurate influence in this reduced model.
The R2 and adjusted R2 have been decreased by only a small amount.
This justified dropping the two variables, also.
All the individual ๐›ฝโ€™s are significantly different from zero (all pvalues small). This indicates that we probably cannot remove further
variables without losing some information about the Yโ€™s.
6-3 Multiple Regression
Examining the Final Model
Some final checks on the model are:
1) Residual
2) Studentized (standardized) residuals
The studentized residuals should be between -2 and 2 around
95% of the time. If an excessive number of greater than 2 in
absolute value or if any one studentized residual is much greater
than 2 you should investigate closer.
3) Hat diagonals are the main diagonal element of the matrix
๐‘‹ ๐‘‹ โ€ฒ ๐‘‹ โˆ’1 ๐‘‹ โ€ฒ
We have already seen that ๐‘‹ โ€ฒ ๐‘‹ โˆ’1 is important. The diagonal elements as
well as the eigen values of this matrix contain much information. Each
diagonal corresponds to a particular observation. Look for values of the
diagonal that are greater than 1.
6-3 Multiple Regression
One More Diagnostic
4) DFFITS
This diagnostic investigates the influence of each observation on the value
of the parameters. The parameters are first fit with all observations, call
the parameter ๐›ฝ๐‘– .
Next the parameters are estimated using all but the jth observation. Call
these estimated ๐›ฝ๐‘–[๐‘—] . The DFFITS for the ith parameter and jth observation
is calculated as
๐ท๐น๐น๐ผ๐‘‡๐‘† =
๐›ฝ๐‘– โˆ’ ๐›ฝ๐‘–[๐‘—]
๐‘†๐ธ ๐›ฝ๐‘–
You look for values of DFFITS that are much larger than the other values.
This indicates that the observation is too influential in determining the
value of the parameters. A combined DFFITS can also be calculated
which looks at all the parameters at once.
6-3 Multiple Regression
REDUCED MODEL (NO SIZE AND PARKING)
The REG Procedure
Model: MODEL1
Dependent Variable: price
Number of Observations Read
Number of Observations Used
24
24
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
3
20
23
1.03081E12
24111264632
1.054921E12
3.436033E11
1205563232
Root MSE
Dependent Mean
Coeff Var
34721
296193
11.72249
R-Square
Adj R-Sq
F Value
Pr > F
285.01
<.0001
0.9771
0.9737
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
units
age
area
1
1
1
1
114857
5012.58292
-1054.84586
14.96564
17919
1183.19286
274.79652
1.48218
6.41
4.24
-3.84
10.10
<.0001
0.0004
0.0010
<.0001
6-3 Multiple Regression
REDUCED MODEL
The REG Procedure
Model: MODEL1
Dependent Variable: price
Output Statistics
Dependent Predicted
Std Error
Std Error Student
Obs Variable
Value Mean Predict Residual Residual Residual
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
90300
384000
157500
676200
165000
300000
108750
276538
420000
950000
560000
268000
290000
173200
323650
162500
353500
134400
187000
93600
110000
573200
79300
272000
112254
416767
169298
688661
173494
314198
93906
263679
384689
941108
616095
241992
249139
189543
279370
177167
354439
131108
245542
105878
128439
529231
106417
196225
12030
-21954
13928
-32767
9015
-11798
21982
-12461
8304
-8494
10357
-14198
12575
14844
11347
12859
13763
35311
31332
8892
17479
-56095
9249
26008
10066
40861
12261
-16343
10773
44280
12803
-14667
9992 -938.6096
9953
3292
11977
-58542
12192
-12278
9416
-18439
22052
43969
12177
-27117
12163
75775
32570 -0.674 |
31805 -1.030 |
33530 -0.352 |
26877 -0.464 |
33714 -0.252 |
33141 -0.428 |
32364
0.459 |
32815
0.392 |
31877
1.108 |
14961
0.594 |
30001 -1.870 |
33467
0.777 |
33230
1.230 |
32484 -0.503 |
33008
1.341 |
32275 -0.454 |
33252 -0.0282 |
33264 0.0990 |
32590 -1.796 |
32510 -0.378 |
33420 -0.552 |
26819
1.639 |
32516 -0.834 |
32521
2.330 |
Cook's
D
-2-1 0 1 2
*|
**|
|
|
|
|
|
|
|**
|*
***|
|*
|**
*|
|**
|
|
|
***|
|
*|
|***
*|
|****
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.015
0.051
0.002
0.036
0.001
0.004
0.008
0.005
0.057
0.387
0.297
0.012
0.035
0.009
0.048
0.008
0.000
0.000
0.109
0.005
0.006
0.454
0.024
0.190
6-3 Multiple Regression
REDUCED MODEL
The REG Procedure
Output Statistics
Obs RStudent
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-0.6646
-1.0319
-0.3440
-0.4544
-0.2459
-0.4195
0.4494
0.3834
1.1144
0.5845
-2.0062
0.7692
1.2466
-0.4935
1.3706
-0.4452
-0.0275
0.0965
-1.9118
-0.3694
-0.5419
1.7175
-0.8274
2.6607
Hat Diag
H
Cov
Ratio
DFFITS
0.1201
0.1609
0.0674
0.4008
0.0572
0.0890
0.1312
0.1068
0.1571
0.8143
0.2534
0.0710
0.0840
0.1247
0.0963
0.1360
0.0828
0.0822
0.1190
0.1233
0.0735
0.4034
0.1230
0.1227
1.2727
1.1764
1.2842
1.9623
1.2858
1.2989
1.3546
1.3328
1.1307
6.1574
0.7625
1.1690
0.9787
1.3330
0.9317
1.3632
1.3384
1.3350
0.6894
1.3609
1.2463
1.1553
1.2151
0.3943
-0.2455
-0.4519
-0.0925
-0.3716
-0.0606
-0.1311
0.1746
0.1326
0.4812
1.2240
-1.1689
0.2126
0.3776
-0.1863
0.4473
-0.1766
-0.0083
0.0289
-0.7026
-0.1385
-0.1527
1.4122
-0.3098
0.9951
------------------DFBETAS----------------Intercept
units
age
area
0.0255
-0.3370
-0.0163
0.1032
-0.0300
0.0062
-0.0130
0.1207
0.3411
-0.4218
0.4828
0.0686
-0.0751
-0.1807
0.4080
-0.1731
0.0001
0.0037
-0.6733
0.0130
-0.0977
0.5731
0.0292
-0.2172
Sum of Residuals
Sum of Squared Residuals
Predicted Residual SS (PRESS)
-0.0031
-0.1451
0.0211
0.2203
0.0120
0.0820
0.0243
0.0292
0.2414
0.8913
0.4972
0.1215
-0.1207
-0.0149
0.0441
-0.0080
-0.0053
-0.0018
0.0295
-0.0080
-0.0163
-0.9475
-0.0168
-0.4517
0
24111264632
37937505741
-0.1666
0.3432
-0.0367
-0.0267
-0.0045
-0.0353
0.1154
-0.0918
-0.3141
0.2392
-0.3232
0.0315
0.2293
0.1282
-0.3203
0.1242
-0.0025
0.0140
0.5377
-0.0934
0.0079
-0.8374
-0.2089
0.6229
0.0567
0.0915
-0.0002
-0.3226
0.0034
-0.0838
-0.0638
-0.0392
-0.1954
-0.4129
-0.8502
-0.1349
0.0981
0.0485
-0.0715
0.0421
0.0041
-0.0055
0.0515
0.0388
0.0616
1.1063
0.0855
0.3274
6-3 Multiple Regression
6-3 Multiple Regression