Transcript Document
6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression ๐ฆ๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ๐1 + ๐ฝ2 ๐ฅ๐2 +โโโ +๐ฝ๐ ๐ฅ๐๐ + ๐๐ = ๐ฝ0 + ๐ ๐=1 ๐ฝ๐ ๐ฅ๐๐ + ๐๐, ๐ = 1, 2, โฏ , ๐ โข The least squares function is given by โข The least squares estimates must satisfy 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression โข The least squares normal equations are โข The solution to the normal equations are the least squares estimators of the regression coefficients. 6-3 Multiple Regression XโX in Multiple Regression ๐ ๐ ๐ ๐1๐ ๐=1 ๐ ๐ ๐1๐ (๐ โฒ ๐) = ๐=1 ๐ โฎ ๐โฒ๐ โ1 ๐๐ 2 โฏ ๐=1 ๐=1 ๐ ๐=1 ๐1๐ ๐๐ ๐=1 ๐ ๐๐๐ 2 โฏ ๐=1 ๐ ๐1๐ ๐๐ ๐=1 ๐=1 ๐ โฎ ๐ ๐1๐ ๐๐๐ ๐=1 ๐ ๐๐ ๐1๐ ๐๐๐ โฑ ๐ ๐๐ ๐=1 ๐๐๐ ๐1๐ 2 ๐๐๐ ๐=1 ๐ ๐ ๐๐๐ ๐๐ ๐=1 ๐ ๐๐๐ ๐๐ ๐=1 ๐๐ด๐ (๐ฝ0 ) ๐ถ๐๐(๐ฝ0 , ๐ฝ1 ) โฏ ๐ถ๐๐( ๐ฝ , ๐ฝ ) ๐๐ด๐ ( ๐ฝ ) 0 1 1 = โฎ โฑ ๐ถ๐๐(๐ฝ0 , ๐ฝ๐ ) ๐ถ๐๐(๐ฝ1 , ๐ฝ๐ ) โฏ ๐๐ 2 ๐=1 ๐ถ๐๐(๐ฝ0 , ๐ฝ๐ ) ๐ถ๐๐(๐ฝ1 , ๐ฝ๐ ) โฎ ๐๐ด๐ (๐ฝ๐ ) 6-3 Multiple Regression 6-3 Multiple Regression 6-3 Multiple Regression 6-3 Multiple Regression 6-3 Multiple Regression 6-3 Multiple Regression 6-3.1 Estimation of Parameters in Multiple Regression 6-3 Multiple Regression Adjusted R2 We can adjust the R2 to take into account the number of regressors in the model: ๐ 2 = ๐ด๐ท๐ฝ ๐ ๐๐ = 1 โ (1 โ ๐ 2 ) ๐โ1 ๐ โ (๐ + 1) (i) The ADJ RSQ does not always increase, like R2, as k increases. ADJ RSQ is especially preferred to R2 if k/n is a large fraction (greater than 10%). If k/n is small, then both measures are almost identical. (ii) Always: ADJ RSQโค ๐ 2 โค 1 (iii) R2 = 1โ SSE/SS(TOTAL) ADJ RSQ = 1 โ MSE/MS(TOTAL) where MS(TOTAL)=SS(TOTAL)/(nโ1) = sample variance of y. 6-3 Multiple Regression 6-3 Multiple Regression 6-3.2 Inferences in Multiple Regression Test for Significance of Regression 6-3 Multiple Regression 6-3.2 Inferences in Multiple Regression Inference on Individual Regression Coefficients ๐ป0 : ๐ฝ๐ = 0 ๐ฃ๐ . ๐ป1 : ๐ฝ๐ โ 0 โขThis is called a partial or marginal test 6-3 Multiple Regression 6-3.2 Inferences in Multiple Regression Confidence Intervals on the Mean Response and Prediction Intervals ๐๐|๐ฅ10,๐ฅ20,โฏ,๐ฅ๐0 = ๐ฝ0 +๐ฝ1 ๐ฅ10 + ๐ฝ2 ๐ฅ20 + โฏ + ๐ฝ๐ ๐ฅ๐0 6-3 Multiple Regression Confidence Intervals on the Mean Response and Prediction Intervals The response at the point of interest is ๐0 = ๐ฝ0 + ๐ฝ1 ๐ฅ10 + ๐ฝ2 ๐ฅ20 + โฏ + ๐ฝ๐ ๐ฅ๐0 + ๐ and the corresponding predicted value is ๐0 = ๐๐|๐ฅ10,๐ฅ20,โฏ,๐ฅ๐0 = ๐ฝ0 +๐ฝ1 ๐ฅ10 + ๐ฝ2 ๐ฅ20 + โฏ + ๐ฝ๐ ๐ฅ๐0 The prediction error is ๐0 โ ๐0 , and the standard deviation of this prediction error is ๐ 2 + ๐ ๐ ๐๐|๐ฅ10,๐ฅ20,โฏ,๐ฅ๐0 2 6-3 Multiple Regression 6-3.2 Inferences in Multiple Regression Confidence Intervals on the Mean Response and Prediction Intervals 6-3 Multiple Regression 6-3.3 Checking Model Adequacy Residual Analysis 6-3 Multiple Regression 6-3.3 Checking Model Adequacy Residual Analysis 6-3 Multiple Regression 6-3.3 Checking Model Adequacy Residual Analysis 6-3 Multiple Regression 6-3.3 Checking Model Adequacy Residual Analysis 6-3 Multiple Regression 6-3.3 Checking Model Adequacy Residual Analysis 0 < โ๐๐ โค 1 Because the โ๐๐ โs are always between zero and unity, a studentized residual is always larger than the corresponding standardized residual. Consequently, studentized residuals are a more sensitive diagnostic when looking for outliers. 6-3 Multiple Regression 6-3.3 Checking Model Adequacy Influential Observations โข โข โข The disposition of points in the x-space is important in determining the properties of the model in R2, the regression coefficients, and the magnitude of the error mean squares. A large value of Di implies that the ith points is influential. A value of Di>1 would indicate that the point is influential. 6-3 Multiple Regression 6-3.3 Checking Model Adequacy 6-3 Multiple Regression Example 6-7 OPTIONS NOOVP NODATE NONUMBER LS=140; DATA ex67; INPUT strength length height @@; label strength='Pull Strength' length='Wire length' height='Die Height'; CARDS; 9.95 2 50 24.45 8 110 31.75 11 120 35 10 550 25.02 8 295 16.86 4 200 14.38 2 375 9.6 2 52 24.35 9 100 27.5 8 300 17.08 4 412 37 11 400 41.95 12 500 11.66 2 360 21.65 4 205 17.89 4 400 69 20 600 10.3 1 585 34.93 10 540 46.59 15 250 44.88 15 290 54.12 16 510 56.63 17 590 22.13 6 100 21.15 5 400 PROC SGSCATTER data=ex67; MATRIX STRENGTH LENGTH HEIGHT; TITLE 'Scatter Plot Matrix for Wire Bond Data'; PROC REG data=ex67; PLOT npp.*Residual.; /* Normal Probability Plot */ MODEL strength=length height/xpx r CLB CLM CLI; PLOT RESIDual.*Pred.; /* Residual Plot */ TITLE 'Multiple Regression'; PLOT Residual.*length; DATA EX67N; PLOT Residual.*height; INPUT LENGTH HEIGHT @@; DATALINES; 11 35 5 20 DATA EX67N1; SET EX67 EX67N; PROC REG DATA=EX67N1; MODEL STRENGTH=LENGTH HEIGHT/CLM CLI; TITLE 'CIs FOR MEAN RESPONSE AND FUTURE OBSERVATION'; RUN; QUIT; 6-3 Multiple Regression 6-3 Multiple Regression The REG Procedure Model: MODEL1 Model Crossproducts X'X X'Y Y'Y Variable Label Intercept length height strength Intercept Wire length Die Height Intercept length height strength 25 206 8294 725.82 206 2396 77177 8008.47 8294 77177 3531848 274816.71 725.82 8008.47 274816.71 27178.5316 --------------------------------------------------------------------------------------------------------------- ---------------------------------------The REG Procedure Model: MODEL1 Dependent Variable: strength Number of Observations Read Number of Observations Used 25 25 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 2 22 24 5990.77122 115.17348 6105.94470 2995.38561 5.23516 Root MSE Dependent Mean Coeff Var 2.28805 29.03280 7.88090 R-Square Adj R-Sq F Value Pr > F 572.17 <.0001 0.9811 0.9794 Parameter Estimates Variable Label Intercept length height Intercept Wire length Die Height DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1 2.26379 2.74427 0.01253 1.06007 0.09352 0.00280 2.14 29.34 4.48 0.0441 <.0001 0.0002 95% Confidence Limits 0.06535 2.55031 0.00672 4.46223 2.93823 0.01833 6-3 Multiple Regression Multiple Regression The REG Procedure Model: MODEL1 Dependent Variable: strength Output Statistics Dependent Predicted Std Error Obs Variable Value Mean Predict 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 9.9500 24.4500 31.7500 35.0000 25.0200 16.8600 14.3800 9.6000 24.3500 27.5000 17.0800 37.0000 41.9500 11.6600 21.6500 17.8900 69.0000 10.3000 34.9300 46.5900 44.8800 54.1200 56.6300 22.1300 21.1500 8.3787 25.5960 33.9541 36.5968 27.9137 15.7464 12.4503 8.4038 28.2150 27.9763 18.4023 37.4619 41.4589 12.2623 15.8091 18.2520 64.6659 12.3368 36.4715 46.5598 47.0609 52.5613 56.3078 19.9822 20.9963 0.9074 0.7645 0.8620 0.7303 0.4677 0.6261 0.7862 0.9039 0.8185 0.4651 0.6960 0.5246 0.6553 0.7689 0.6213 0.6785 1.1652 1.2383 0.7096 0.8780 0.8238 0.8432 0.9771 0.7557 0.6176 95% CL Mean 6.4968 24.0105 32.1665 35.0821 26.9437 14.4481 10.8198 6.5291 26.5175 27.0118 16.9588 36.3739 40.0999 10.6678 14.5206 16.8448 62.2494 9.7689 34.9999 44.7389 45.3524 50.8127 54.2814 18.4149 19.7153 10.2606 27.1815 35.7417 38.1114 28.8836 17.0448 14.0807 10.2784 29.9125 28.9408 19.8458 38.5498 42.8179 13.8568 17.0976 19.6592 67.0824 14.9048 37.9431 48.3807 48.7694 54.3099 58.3342 21.5494 22.2772 95% CL Predict 3.2740 20.5930 28.8834 31.6158 23.0704 10.8269 7.4328 3.3018 23.1754 23.1341 13.4425 32.5936 36.5230 7.2565 10.8921 13.3026 59.3409 6.9414 31.5034 41.4773 42.0176 47.5042 51.1481 14.9850 16.0813 13.4834 30.5990 39.0248 41.5778 32.7569 20.6660 17.4677 13.5058 33.2546 32.8184 23.3621 42.3301 46.3948 17.2682 20.7260 23.2014 69.9909 17.7323 41.4396 51.6423 52.1042 57.6183 61.4675 24.9794 25.9112 Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS) Residual 1.5713 -1.1460 -2.2041 -1.5968 -2.8937 1.1136 1.9297 1.1962 -3.8650 -0.4763 -1.3223 -0.4619 0.4911 -0.6023 5.8409 -0.3620 4.3341 -2.0368 -1.5415 0.0302 -2.1809 1.5587 0.3222 2.1478 0.1537 0 115.17348 156.16295 Std Error Student Residual Residual 2.100 2.157 2.119 2.168 2.240 2.201 2.149 2.102 2.137 2.240 2.180 2.227 2.192 2.155 2.202 2.185 1.969 1.924 2.175 2.113 2.135 2.127 2.069 2.160 2.203 0.748 -0.531 -1.040 -0.736 -1.292 0.506 0.898 0.569 -1.809 -0.213 -0.607 -0.207 0.224 -0.280 2.652 -0.166 2.201 -1.059 -0.709 0.0143 -1.022 0.733 0.156 0.995 0.0698 Cook's D -2-1 0 1 2 | | | | | | | | | | | | | | | | | | | | | | | | | |* *| **| *| **| |* |* |* ***| | *| | | | |***** | |**** **| *| | **| |* | |* | | | | | | | | | | | | | | | | | | | | | | | | | | 0.035 0.012 0.060 0.021 0.024 0.007 0.036 0.020 0.160 0.001 0.013 0.001 0.001 0.003 0.187 0.001 0.565 0.155 0.018 0.000 0.052 0.028 0.002 0.040 0.000 6-3 Multiple Regression 6-3 Multiple Regression 6-3 Multiple Regression CIs FOR MEAN RESPONSE AND FUTURE OBSERVATION The REG Procedure Model: MODEL1 Dependent Variable: strength Pull Strength Number of Observations Read Number of Observations Used Number of Observations with Missing Values 27 25 2 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 2 22 24 5990.77122 115.17348 6105.94470 2995.38561 5.23516 Root MSE Dependent Mean Coeff Var 2.28805 29.03280 7.88090 R-Square Adj R-Sq F Value Pr > F 572.17 <.0001 0.9811 0.9794 Parameter Estimates Variable Label Intercept length height Intercept Wire length Die Height DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1 2.26379 2.74427 0.01253 1.06007 0.09352 0.00280 2.14 29.34 4.48 0.0441 <.0001 0.0002 6-3 Multiple Regression CIs FOR MEAN RESPONSE AND FUTURE OBSERVATION The REG Procedure Model: MODEL1 Dependent Variable: strength Pull Strength Output Statistics Obs Dependent Variable Predicted Value Std Error Mean Predict 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 9.9500 24.4500 31.7500 35.0000 25.0200 16.8600 14.3800 9.6000 24.3500 27.5000 17.0800 37.0000 41.9500 11.6600 21.6500 17.8900 69.0000 10.3000 34.9300 46.5900 44.8800 54.1200 56.6300 22.1300 21.1500 . . 8.3787 25.5960 33.9541 36.5968 27.9137 15.7464 12.4503 8.4038 28.2150 27.9763 18.4023 37.4619 41.4589 12.2623 15.8091 18.2520 64.6659 12.3368 36.4715 46.5598 47.0609 52.5613 56.3078 19.9822 20.9963 32.8892 16.2357 0.9074 0.7645 0.8620 0.7303 0.4677 0.6261 0.7862 0.9039 0.8185 0.4651 0.6960 0.5246 0.6553 0.7689 0.6213 0.6785 1.1652 1.2383 0.7096 0.8780 0.8238 0.8432 0.9771 0.7557 0.6176 1.0620 0.9286 95% CL Mean 6.4968 24.0105 32.1665 35.0821 26.9437 14.4481 10.8198 6.5291 26.5175 27.0118 16.9588 36.3739 40.0999 10.6678 14.5206 16.8448 62.2494 9.7689 34.9999 44.7389 45.3524 50.8127 54.2814 18.4149 19.7153 30.6867 14.3099 Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS) 10.2606 27.1815 35.7417 38.1114 28.8836 17.0448 14.0807 10.2784 29.9125 28.9408 19.8458 38.5498 42.8179 13.8568 17.0976 19.6592 67.0824 14.9048 37.9431 48.3807 48.7694 54.3099 58.3342 21.5494 22.2772 35.0918 18.1615 0 115.17348 156.16295 95% CL Predict 3.2740 20.5930 28.8834 31.6158 23.0704 10.8269 7.4328 3.3018 23.1754 23.1341 13.4425 32.5936 36.5230 7.2565 10.8921 13.3026 59.3409 6.9414 31.5034 41.4773 42.0176 47.5042 51.1481 14.9850 16.0813 27.6579 11.1147 13.4834 30.5990 39.0248 41.5778 32.7569 20.6660 17.4677 13.5058 33.2546 32.8184 23.3621 42.3301 46.3948 17.2682 20.7260 23.2014 69.9909 17.7323 41.4396 51.6423 52.1042 57.6183 61.4675 24.9794 25.9112 38.1206 21.3567 Residual 1.5713 -1.1460 -2.2041 -1.5968 -2.8937 1.1136 1.9297 1.1962 -3.8650 -0.4763 -1.3223 -0.4619 0.4911 -0.6023 5.8409 -0.3620 4.3341 -2.0368 -1.5415 0.0302 -2.1809 1.5587 0.3222 2.1478 0.1537 . . 6-3 Multiple Regression 6-3.3 Checking Model Adequacy Multicollinearity Multicollinearity is a catch-all phase referring to problems caused by the independent variables being correlated with each other. This can cause a number of problems 1) Individual F-tests can be non-significant for important variables. The sign of a ๐ฝ๐ can be flopped. Recall, the partial slopes measure the change in Y for a unit change in the ๐๐ holding the other Xโs constant. If two Xโs are highly correlated, this interpretation doesnโt do much good. 2) The MSE can be inflated. Also the SEโs of the partial slopes are inflated. 3) ๐ 2 < ๐๐๐1 2 + ๐๐๐2 2 + โฏ + ๐๐๐๐ 2 4) Removing one X from the model may make another more significant or less significant. 6-3 Multiple Regression 6-3.3 Checking Model Adequacy Variance Inflation Factor The Quantity 1/(1 โ ๐ 2๐๐โ ๐1 โฏ๐๐+1 ๐๐โ1 โโ๐๐ ) called the variance inflation factor is denoted as VIF(Xj). The larger the value of VIF(Xj), the more the multicollinearity and the larger the standard error of the ๐ฝ๐ due to having Xj in the model. A common rule of thumb is that if VIF(Xj)>5 then multicollinearity is high. Also 10 has been proposed (see Kutner book referenced below) as a cut off value. Mallowโs CP Another measure of the amount of multicollinearity is Mallowโs CP. Assume we have a total of r variables. Suppose we fit a model with only p of the r variables. Let SSEP be the sums of squares error from the p variable model and MSE the mean square error from the model with all r variables. Then ๐ถ๐ = ๐๐๐ธ๐ (๐๐๐ธ โ ๐ โ 2๐ ) We want CP to be near p+1 for a good model. 6-3 Multiple Regression 6-3.3 Checking Model Adequacy Multicollinearity 6-3 Multiple Regression Consider the Full Model ๐ป0 : ๐ฝ1 = โฏ = ๐ฝ5 = 0 vs: ๐ป1 : ๐๐๐ก ๐๐๐ ๐ง๐๐๐ 98.01% of the variability in the Yโs is explained by the relation to the Xโs. The adjusted R2 is 0.9746 which is very close to the R2 value. This indicates no serious problems with the number of independent variables. Possible multicollinearity between units, area and size since they have large correlations. Age and parking have low correlations with price so may not be needed. 6-3 Multiple Regression Example OPTIONS NOOVP NODATE NONUMBER LS=100; DATA appraise; INPUT price units age size parking area cond$ @@; CARDS; 90300 4 82 4635 0 4266 F 384000 20 13 17798 0 14391 G 157500 5 66 5913 0 6615 G 676200 26 64 7750 6 34144 E 165000 5 55 5150 0 6120 G 300000 10 65 12506 0 14552 G 108750 4 82 7160 0 3040 G 276538 11 23 5120 0 7881 G 420000 20 18 11745 20 12600 G 950000 62 71 21000 3 39448 G 560000 26 74 11221 0 30000 G 268000 13 56 7818 13 8088 F 290000 9 76 4900 0 11315 E 173200 6 21 5424 6 4461 G 323650 11 24 11834 8 9000 G 162500 5 19 5246 5 3828 G 353500 20 62 11223 2 13680 F 134400 4 70 5834 0 4680 E 187000 8 19 9075 0 7392 G 93600 4 82 6864 0 3840 F 110000 4 50 4510 0 3092 G 573200 14 10 11192 0 23704 E 79300 4 82 7425 0 3876 F 272000 5 82 7500 0 9542 E PROC CORR DATA=APPRAISE; VAR PRICE UNITS AGE SIZE PARKING AREA; TITLE 'CORRELATIONS OF VARIABLES IN MODEL'; PROC REG DATA=APPRAISE; MODEL PRICE=UNITS AGE SIZE PARKING AREA/R VIF; TITLE 'ALL VARIABLES IN MODEL'; PROC REG DATA=APPRAISE; MODEL PRICE=UNITS AGE AREA/R INFLUENCE; TITLE 'REDUCED MODEL'; RUN; QUIT; 6-3 Multiple Regression CORRELATIONS OF VARIABLES IN MODEL CORR ํ๋ก์์ 6 Variables: price units age size parking area ๋จ์ ํต๊ณ๋ ๋ณ์ price units age size parking area N ํ๊ท ํ์คํธ์ฐจ ํฉ ์ต์๊ฐ ์ต๋๊ฐ 24 24 24 24 24 24 296193 12.50000 52.75000 8702 2.62500 11648 214164 12.73475 26.43655 4221 5.01140 10170 7108638 300.00000 1266 208843 63.00000 279555 79300 4.00000 10.00000 4510 0 3040 950000 62.00000 82.00000 21000 20.00000 39448 ํผ์ด์จ ์๊ด ๊ณ์, N = 24 H0: Rho=0 ๊ฐ์ ํ์์ Prob > |r| price units age size parking area price 1.00000 0.92207 <.0001 -0.11118 0.6050 0.73582 <.0001 0.21385 0.3157 0.96784 <.0001 units 0.92207 <.0001 1.00000 -0.00982 0.9637 0.79583 <.0001 0.21290 0.3179 0.87622 <.0001 age -0.11118 0.6050 -0.00982 0.9637 1.00000 -0.18563 0.3852 -0.36141 0.0827 0.03090 0.8860 size 0.73582 <.0001 0.79583 <.0001 -0.18563 0.3852 1.00000 0.15151 0.4797 0.66741 0.0004 parking 0.21385 0.3157 0.21290 0.3179 -0.36141 0.0827 0.15151 0.4797 1.00000 0.07830 0.7161 area 0.96784 <.0001 0.87622 <.0001 0.03090 0.8860 0.66741 0.0004 0.07830 0.7161 1.00000 6-3 Multiple Regression ALL VARIABLES IN MODEL The REG Procedure Model: MODEL1 Dependent Variable: price Number of Observations Read Number of Observations Used 24 24 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 5 18 23 1.033962E12 20959224743 1.054921E12 2.067924E11 1164401375 Root MSE Dependent Mean Coeff Var 34123 296193 11.52063 R-Square Adj R-Sq F Value Pr > F 177.60 <.0001 0.9801 0.9746 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Variance Inflation Intercept units age size parking area 1 1 1 1 1 1 93629 4156.17223 -856.06670 0.88901 2675.62291 15.53982 29874 1532.28739 306.65871 2.96966 1626.23661 1.50259 3.13 2.71 -2.79 0.30 1.65 10.34 0.0057 0.0143 0.0121 0.7681 0.1173 <.0001 0 7.52119 1.29821 3.10362 1.31193 4.61289 6-3 Multiple Regression ALL VARIABLES IN MODEL The REG Procedure Model: MODEL1 Dependent Variable: price Output Statistics Dependent Predicted Std Error Std Error Student Obs Variable Value Mean Predict Residual Residual Residual 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 90300 384000 157500 676200 165000 300000 108750 276538 420000 950000 560000 268000 290000 173200 323650 162500 353500 134400 187000 93600 110000 573200 79300 272000 110470 405080 165962 700437 167009 316800 93663 246679 421099 930242 614511 267139 246163 190788 290586 175673 351590 128242 233552 105832 119509 521561 106890 199161 12281 -20170 23185 -21080 9178 -8462 25152 -24237 10095 -2009 17858 -16800 13018 15087 19376 29859 25938 -1099 31527 19758 17207 -54511 18075 860.6816 11851 43837 14200 -17588 14788 33064 14612 -13173 10164 1910 9951 6158 13949 -46552 12433 -12232 12404 -9509 22525 51639 12957 -27590 13080 72839 Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS) 31837 -0.634 | 25037 -0.842 | 32866 -0.257 | 23061 -1.051 | 32596 -0.0616 | 29077 -0.578 | 31543 0.478 | 28088 1.063 | 22173 -0.0496 | 13057 1.513 | 29467 -1.850 | 28943 0.0297 | 31999 1.370 | 31028 -0.567 | 30752 1.075 | 30836 -0.427 | 32575 0.0586 | 32640 0.189 | 31142 -1.495 | 31778 -0.385 | 31789 -0.299 | 25632 2.015 | 31567 -0.874 | 31517 2.311 | 0 20959224743 56380131094 Cook's D -2-1 0 1 2 *| *| | **| | *| | |** | |*** ***| | |** *| |** | | | **| | | |**** *| |**** | | | | | | | | | | | | | | | | | | | | | | | | 0.010 0.101 0.001 0.219 0.000 0.021 0.006 0.090 0.001 2.225 0.194 0.000 0.043 0.011 0.045 0.007 0.000 0.001 0.075 0.004 0.002 0.522 0.021 0.153 6-3 Multiple Regression 6-3 Multiple Regression 6-3 Multiple Regression We have some evidence of multicollinearity, thus we must consider dropping some of the variables. Letโs look at the individual tests of ๐ป0 : ๐ฝ๐ = 0 vs: ๐ป1 : ๐ฝ๐ โ 0, i=1, 2, โฏ , 5 These tests are summarized in the SAS output of PROC REG. Size is very non-significant (p-value=0.7681) and parking is also not significant (pvalue=0.1173). There is evidence from the correlations that size is related to both units and area, so removing this variable might remove much of the multicollinearity. Parking just doesnโt seem to explain much variability in price. Letโs look at a 95% confidence interval for ๐ฝ4 . ๐ฝ4 ± ๐ก0.025;18 โ ๐๐ธ(๐ฝ4 ) 2675±(2.101)*(1626.24) (โ741.1, 6092.4) 6-3 Multiple Regression A Test for the Significance of a Group of Regressors (Partial F-Test) Suppose that the full model has k regressors, and we are interested in testing whether the last k-r of them can be deleted from the model. This smaller model is called the reduced model. That is, the full model is ๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ1 + ๐ฝ2 ๐ฅ2 + โฏ + ๐ฝ๐ ๐ฅ๐ + ๐ฝ๐+1 ๐ฅ๐+1 + โฏ + ๐ฝ๐ ๐ฅ๐ + ๐ and the reduced model has ๐ฝ๐+1 = ๐ฝ๐+2=โฏ = ๐ฝ๐ =0, so the reduced model is ๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ1 + ๐ฝ2 ๐ฅ2 + โฏ + ๐ฝ๐ ๐ฅ๐ + ๐ Then, to test the hypotheses 6-3 Multiple Regression A Test for the Significance of a Group of Regressors (Partial F-Test) ๐น= ๐๐๐ธ๐ โ ๐๐๐ธ๐น (๐ โ ๐) ๐๐๐ธ๐น where: SSER = SSE for Reduced Model SSEF = SSE for Full Model ๐ โ ๐ = number of ๐ฝโs in H0 For given ๐ผ, we reject H0 if: Partial F>tabled F with dof = ๐ โ ๐, numerator ๐ โ ๐, denominator 6-3 Multiple Regression Testing ๐ฏ๐ : ๐ท๐ = ๐ท๐ = ๐ The Full model is ๐ = ๐ฝ0 + ๐ฝ1 ๐๐๐๐ก๐ + ๐ฝ2 ๐ด๐๐ + ๐ฝ3 ๐๐๐ง๐ + ๐ฝ4 ๐๐๐๐๐๐๐ + ๐ฝ5 ๐ด๐๐๐ + ๐ The Reduced model is ๐ = ๐ฝ0 + ๐ฝ1 ๐๐๐๐ก๐ + ๐ฝ2 ๐ด๐๐ + ๐ฝ5 ๐ด๐๐๐ + ๐ From the SAS output we have ๐๐๐ธ๐น = 20,959,224,743, ๐๐๐๐น = 18, ๐๐๐ธ๐น = 1,164,401,375 ๐๐๐ธ๐ = 24,111,264,632, ๐๐๐๐น = 20, 315203989 ๐น= 2 = 0.135 1164401375 ๐0.05;2,18 = 3.55 No evidence to reject the null hypothesis. 6-3 Multiple Regression Interpreting the ๐ท๐ โs For the apartment appraisal problem we have ๐ = 296,193.3. The ๐ฝโs are ๐ฝ0 = 114,857.4 ๐ฝ2 = โ1,054.0 (age) ๐ฝ1 =5,012.6 (units) ๐ฝ1 =14.96 (area) If one extra unit is added (all other factors held constant) the value of the complex will increase by $5,012.6. If the complex ages one more year it will lose $1,054.0 in value (all other factors held constant). If the area is increased by one square feet the value of the complex will increase by $14.96 (all other factors held constant). Notice the potential for multicollinearity. If one more unit is added, the number of square feet would also increase. Thus the interdependency of some of the variables makes the ๐ฝโs harder to interpret. 6-3 Multiple Regression Notes on the Reduced Model The MSE has increased in the reduced model (MSE=34,721) vs. the full model(MSE=34,123), but the standard error of the individual ๐ฝโs have all decreased. This is another indication that there was multicollinearity in the full model. We will be able to do more accurate influence in this reduced model. The R2 and adjusted R2 have been decreased by only a small amount. This justified dropping the two variables, also. All the individual ๐ฝโs are significantly different from zero (all pvalues small). This indicates that we probably cannot remove further variables without losing some information about the Yโs. 6-3 Multiple Regression Examining the Final Model Some final checks on the model are: 1) Residual 2) Studentized (standardized) residuals The studentized residuals should be between -2 and 2 around 95% of the time. If an excessive number of greater than 2 in absolute value or if any one studentized residual is much greater than 2 you should investigate closer. 3) Hat diagonals are the main diagonal element of the matrix ๐ ๐ โฒ ๐ โ1 ๐ โฒ We have already seen that ๐ โฒ ๐ โ1 is important. The diagonal elements as well as the eigen values of this matrix contain much information. Each diagonal corresponds to a particular observation. Look for values of the diagonal that are greater than 1. 6-3 Multiple Regression One More Diagnostic 4) DFFITS This diagnostic investigates the influence of each observation on the value of the parameters. The parameters are first fit with all observations, call the parameter ๐ฝ๐ . Next the parameters are estimated using all but the jth observation. Call these estimated ๐ฝ๐[๐] . The DFFITS for the ith parameter and jth observation is calculated as ๐ท๐น๐น๐ผ๐๐ = ๐ฝ๐ โ ๐ฝ๐[๐] ๐๐ธ ๐ฝ๐ You look for values of DFFITS that are much larger than the other values. This indicates that the observation is too influential in determining the value of the parameters. A combined DFFITS can also be calculated which looks at all the parameters at once. 6-3 Multiple Regression REDUCED MODEL (NO SIZE AND PARKING) The REG Procedure Model: MODEL1 Dependent Variable: price Number of Observations Read Number of Observations Used 24 24 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 3 20 23 1.03081E12 24111264632 1.054921E12 3.436033E11 1205563232 Root MSE Dependent Mean Coeff Var 34721 296193 11.72249 R-Square Adj R-Sq F Value Pr > F 285.01 <.0001 0.9771 0.9737 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Intercept units age area 1 1 1 1 114857 5012.58292 -1054.84586 14.96564 17919 1183.19286 274.79652 1.48218 6.41 4.24 -3.84 10.10 <.0001 0.0004 0.0010 <.0001 6-3 Multiple Regression REDUCED MODEL The REG Procedure Model: MODEL1 Dependent Variable: price Output Statistics Dependent Predicted Std Error Std Error Student Obs Variable Value Mean Predict Residual Residual Residual 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 90300 384000 157500 676200 165000 300000 108750 276538 420000 950000 560000 268000 290000 173200 323650 162500 353500 134400 187000 93600 110000 573200 79300 272000 112254 416767 169298 688661 173494 314198 93906 263679 384689 941108 616095 241992 249139 189543 279370 177167 354439 131108 245542 105878 128439 529231 106417 196225 12030 -21954 13928 -32767 9015 -11798 21982 -12461 8304 -8494 10357 -14198 12575 14844 11347 12859 13763 35311 31332 8892 17479 -56095 9249 26008 10066 40861 12261 -16343 10773 44280 12803 -14667 9992 -938.6096 9953 3292 11977 -58542 12192 -12278 9416 -18439 22052 43969 12177 -27117 12163 75775 32570 -0.674 | 31805 -1.030 | 33530 -0.352 | 26877 -0.464 | 33714 -0.252 | 33141 -0.428 | 32364 0.459 | 32815 0.392 | 31877 1.108 | 14961 0.594 | 30001 -1.870 | 33467 0.777 | 33230 1.230 | 32484 -0.503 | 33008 1.341 | 32275 -0.454 | 33252 -0.0282 | 33264 0.0990 | 32590 -1.796 | 32510 -0.378 | 33420 -0.552 | 26819 1.639 | 32516 -0.834 | 32521 2.330 | Cook's D -2-1 0 1 2 *| **| | | | | | | |** |* ***| |* |** *| |** | | | ***| | *| |*** *| |**** | | | | | | | | | | | | | | | | | | | | | | | | 0.015 0.051 0.002 0.036 0.001 0.004 0.008 0.005 0.057 0.387 0.297 0.012 0.035 0.009 0.048 0.008 0.000 0.000 0.109 0.005 0.006 0.454 0.024 0.190 6-3 Multiple Regression REDUCED MODEL The REG Procedure Output Statistics Obs RStudent 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 -0.6646 -1.0319 -0.3440 -0.4544 -0.2459 -0.4195 0.4494 0.3834 1.1144 0.5845 -2.0062 0.7692 1.2466 -0.4935 1.3706 -0.4452 -0.0275 0.0965 -1.9118 -0.3694 -0.5419 1.7175 -0.8274 2.6607 Hat Diag H Cov Ratio DFFITS 0.1201 0.1609 0.0674 0.4008 0.0572 0.0890 0.1312 0.1068 0.1571 0.8143 0.2534 0.0710 0.0840 0.1247 0.0963 0.1360 0.0828 0.0822 0.1190 0.1233 0.0735 0.4034 0.1230 0.1227 1.2727 1.1764 1.2842 1.9623 1.2858 1.2989 1.3546 1.3328 1.1307 6.1574 0.7625 1.1690 0.9787 1.3330 0.9317 1.3632 1.3384 1.3350 0.6894 1.3609 1.2463 1.1553 1.2151 0.3943 -0.2455 -0.4519 -0.0925 -0.3716 -0.0606 -0.1311 0.1746 0.1326 0.4812 1.2240 -1.1689 0.2126 0.3776 -0.1863 0.4473 -0.1766 -0.0083 0.0289 -0.7026 -0.1385 -0.1527 1.4122 -0.3098 0.9951 ------------------DFBETAS----------------Intercept units age area 0.0255 -0.3370 -0.0163 0.1032 -0.0300 0.0062 -0.0130 0.1207 0.3411 -0.4218 0.4828 0.0686 -0.0751 -0.1807 0.4080 -0.1731 0.0001 0.0037 -0.6733 0.0130 -0.0977 0.5731 0.0292 -0.2172 Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS) -0.0031 -0.1451 0.0211 0.2203 0.0120 0.0820 0.0243 0.0292 0.2414 0.8913 0.4972 0.1215 -0.1207 -0.0149 0.0441 -0.0080 -0.0053 -0.0018 0.0295 -0.0080 -0.0163 -0.9475 -0.0168 -0.4517 0 24111264632 37937505741 -0.1666 0.3432 -0.0367 -0.0267 -0.0045 -0.0353 0.1154 -0.0918 -0.3141 0.2392 -0.3232 0.0315 0.2293 0.1282 -0.3203 0.1242 -0.0025 0.0140 0.5377 -0.0934 0.0079 -0.8374 -0.2089 0.6229 0.0567 0.0915 -0.0002 -0.3226 0.0034 -0.0838 -0.0638 -0.0392 -0.1954 -0.4129 -0.8502 -0.1349 0.0981 0.0485 -0.0715 0.0421 0.0041 -0.0055 0.0515 0.0388 0.0616 1.1063 0.0855 0.3274 6-3 Multiple Regression 6-3 Multiple Regression