Transcript Stats 845
Summary of the Statistics used in Multiple Regression The Least Squares Estimates: ˆ , ˆ , ˆ , ... , ˆ 0 1 2 p - The values that minimize n RSS i1 n 2 yi yˆ i i1 2 yi 0 1 xi1 2 xi2 ... p xip . The Analysis of Variance Table Entries a) Adjustedn Total Sum of Squares (SSTotal) SSTotal y y . d.f. n 1 _ 2 i i1 b) Residual Sum of Squares (SSError) n RSS SSError 2 yi yˆ i . d.f. n p 1 i1 c) Regression Sum of Squares (SSReg) n SSReg SS 1 , 2 , ... , p Note: _ 2 ˆ yi y . d.f. p i1 n i1 n _ 2 y i y n _ 2 yˆ i y i1 i.e. SSTotal = SSReg +SSError i1 2 yi yˆ i . The Analysis of Variance Table Source Sum of Squares d.f. Regression Error SSReg SSError Total SSTotal Mean Square p SSReg/p = MSReg n-p-1 SSError/(n-p-1) =MSError = s2 n-1 F MSReg/s2 Uses: 1. To estimate s2 (the error variance). - Use s2 = MSError to estimate s2. 2. To test the Hypothesis H0: 1 = 1 = 2= ... = p = 0. Use the test statistic F = MSReg/ s2 = [(1/p)SSReg]/[(1/(n-p-1))SSError] . - Reject H0 if F > Fa(p,n-p-1). 3. To compute other statistics that are useful in describing the relationship between Y (the dependent variable) and X1, X2, ... ,Xp (the independent variables). a)R2 = the coefficient of determination = SSReg/SSTotal n yˆ i y = i 1 2 n 2 y y i i 1 = the proportion of variance in Y explained by X1, X2, ... ,Xp 1 - R2 = the proportion of variance in Y that is left unexplained by X1, X2, ... , Xp = SSError/SSTotal. b) Ra2 = "R2 adjusted" for degrees of freedom. = 1 -[the proportion of variance in Y that is left unexplained by X1, X2,... , Xp adjusted for d.f.] = 1 - [(1/(n-p-1))SSError]/[(1/(n-1))SSTotal] . = 1 - [(n-1)SSError]/[(n-p-1)SSTotal] . = 1 - [(n-1)/(n-p-1)] [1 - R2 ]. c) R= R2 = the Multiple correlation coefficient of Y with X1, X2, ... ,Xp = SS Re g SS Total = the maximum correlation between Y and a linear combination of X1, X2, ... ,Xp Comment: The statistics F, R2, Ra2 and R are equivalent statistics. Properties of the Least Squares Estimators: ˆ , ˆ , ˆ , ... , ˆ 0 1 2 p 1. Normally distributed (If there error terms are Normally distributed) 2. Unbiased Estimators of the Linear Parameters 0, 1, 2, ... p. 3. Minimum Variance (Minimum Standard Error) of all Unbiased Estimators of the Linear Parameters 0, 1, 2, ... p. Comments: ˆ , S.E. ˆ s ˆ depends on The standard error of i i i 1. The Error Variance s2 (and s). 2. sXi, the standard deviation of Xi (the ith independent variable). 3. The sample size n. 4. The correlations between all pairs of variables. The standard error of ˆ i , S.E. ˆ i sˆ i • • • • decreases as s decreases. decreases as sXi increases. decreases as n increases. increases as the correlation between pairs of independent variables increases. – In fact the standard error of the least squares estimates can be extremely high if there is a high correlation between one of the independent variables and a linear combination of the remaining independent variables. (the problem of Multicollinearity). The Covariance Matrix,Correlation and XTX inverse matrix The Covariance Matrix S.E. ˆ 2 Covˆ 0 ,ˆ 1 ... Covˆ 0 ,ˆ p 0 S.E. ˆ 1 2 ... Covˆ 1 ,ˆ p ... 2 ˆ S.E. p where ˆ , ˆ r s ˆ s ˆ r S.E. ˆ S.E. ˆ cov i j ij i j ij i j ˆ and ˆ. and where r correlation between ij i j The Correlation Matrix r01 ... r0p 1 ... r1p 1 The XTX inverse matrix a a a0p 00 01 a11 a1p . app If we multiply each entry in the XTX inverse matrix by s2 = MSError this matrix turns into the covariance matrix for : ˆ , ˆ , ˆ , ... , ˆ 0 1 2 p Thus 2 ˆ S.E. i s2 aii and Covˆ i ,ˆ j s2 aij . These matrices can be used to compute standard Errors for linear combinations of the regression coefficients Namely ˆ c ˆ c ˆ ˆ c L 0 0 1 1 p p S .E. Lˆ sLˆ n 2 ˆ )] 2 2 c c cov( ˆ , ˆ ) c [ S . E .( i i j i i j s i 0 i j n c i 0 n 2 i c i 0 s 2 c c r s s 2 ˆi 2 i i j i j a 2 c c a ii i j i ˆi ij j ij ˆ j ˆ ˆ i ˆ j , then For example if L S.E. ˆ i ˆ j sˆ iˆ j 2 2 S.E. ˆ i S.E. ˆ j 2 11covˆ i ,ˆ j S.E. ˆ i 2 S.E. ˆ j 2 2 covˆ i ,ˆ j s2ˆ s2ˆ 2 rij sˆ sˆ i j s aii ajj 2aij i j An Example Suppose one is interested in how the cost per month (Y) of heating a plant is determined the average atmospheric temperature in the Month (X1) and the number of operating days in the month (X2). The data on these variables was collected for n = 25 months selected at random and is given on the following page. Y = cost per month of heating a plant X1 = average atmospheric temperature in the month X2 = the number of operating days for the plant in the month. Month Y X1 X2 The Least Squares Estimates: 1 1098 35.3 20 2 1113 29.7 20 3 1251 30.8 23 4 840 58.8 20 5 927 61.4 21 6 873 71.3 22 7 636 74.4 11 8 850 76.7 23 9 782 70.7 21 10 914 57.5 20 11 824 46.4 20 12 1219 28.9 21 13 1188 28.1 21 Constant 14 957 39.1 19 X1 15 1094 46.8 23 X2 16 958 48.5 20 17 1009 59.3 22 18 811 70.0 22 19 683 70.0 22 20 888 74.5 23 21 768 72.1 20 22 847 58.1 21 23 886 44.6 20 24 1036 33.4 20 25 1108 28.6 Estimate Standard Error Constant 912.6 110.28 X1 -7.24 0.80 X2 20.29 4.577 The Covariance Matrix Constant X1 X2 Constant 12162 X1 -49.203 .63390 X2 -464.36 .76796 20.947 Constant X1 X2 1.000 -.1764 -.0920 1.000 .0210 The Correlation Matrix 1.000 The XTX Inverse matrix Constant Constant 22 2.778747 X1 X2 -0.011242 -0.106098 X1 0.14207x10- 0.175467x10-3 X2 3 0.478599 The Analysis of Variance Table Source Regression Error Total df 2 22 24 SS MS 541871 270936 96287 4377 638158 F 61.899 Summary Statistics (R2, Radjusted2 = Ra2 and R) R2 = 541871/638158 = .8491 (explained variance in Y - 84.91 %) Ra2 = 1 - [1 - R2][(n-1)/(n-p-1)] = 1 - [1 - .8491][24/22] = .8354 (83.54 %) R = .8491 =.9215 = Multiple correlation coefficient 1400 1200 C O S T 1000 800 600 20 30 40 50 TEMP 60 70 80 1400 1200 C O S T 1000 800 600 10 15 20 DAYS 25 Three-dimensional Scatter-plot of Cost, Temp and Days. Example Motor Vehicle example Variables 1. (Y) mpg – Mileage 2. (X1) engine – Engine size. 3. (X2) horse – Horsepower. 4. (X3) weight – Weight. Select Analysis->Regression->Linear To print the correlation matrix or the covariance matrix of the estimates select Statistics Check the box for the covariance matrix of the estimates. Here is the table giving the estimates and their standard errors. Coefficientsa Model 1 (Constant) ENGINE HORSE WEIGHT Unstandardized Coefficients B Std. Error 44.015 1.272 -5.53E-03 .007 -5.56E-02 .013 -4.62E-03 .001 a. Dependent Variable: MPG Standardi zed Coefficien ts Beta -.074 -.273 -.504 t 34.597 -.786 -4.153 -6.186 Sig . .000 .432 .000 .000 Here is the table giving the correlation matrix and covariance matrix of the regression estimates: Coefficient Correlationsa Model 1 Correlations Covariances WEIGHT HORSE ENGINE WEIGHT HORSE ENGINE WEIGHT 1.000 -.129 -.725 5.571E-07 -1.29E-06 -3.81E-06 HORSE -.129 1.000 -.518 -1.29E-06 1.794E-04 -4.88E-05 ENGINE -.725 -.518 1.000 -3.81E-06 -4.88E-05 4.941E-05 a. Dependent Variable: MPG What is missing in SPSS is covariances and correlations with the intercept estimate (constant). This can be found by using the following trick 1. Introduce a new variable (called constnt) 2. The new “variable” takes on the value 1 for all cases Select Transform->Compute The following dialogue box appears Type in the name of the target variable - constnt Type in ‘1’ for the Numeric Expression This variable is now added to the data file Add this new variable (constnt) to the list of independent variables Under Options make sure the box – Include constant in equation – is unchecked The coefficient of the new variable will be the constant. Here are the estimates of the parameters with their standard errors Coefficientsa,b Model 1 ENGINE HORSE WEIGHT CONSTNT Unstandardized Coefficients B Std. Error -5.53E-03 .007 -5.56E-02 .013 -4.62E-03 .001 44.015 1.272 Standardi zed Coefficien ts Beta -.049 -.250 -.577 1.781 t -.786 -4.153 -6.186 34.597 Sig . .432 .000 .000 .000 a. Dependent Variable: MPG b. Linear Regression throug h the Origin Note the agreement with parameter estimates and their standard errors as previously calculated. Here is the correlation matrix and the covariance matrix of the estimates. Coefficient Correlationsa,b Model 1 Correlations Covariances CONSTNT ENGINE HORSE WEIGHT CONSTNT ENGINE HORSE WEIGHT CONSTNT 1.000 .761 -.318 -.824 1.619 6.808E-03 -5.427E-03 -7.821E-04 a. Dependent Variable: MPG b. Linear Regression throug h the Origin ENGINE .761 1.000 -.518 -.725 6.808E-03 4.941E-05 -4.88E-05 -3.81E-06 HORSE -.318 -.518 1.000 -.129 -5.43E-03 -4.88E-05 1.794E-04 -1.29E-06 WEIGHT -.824 -.725 -.129 1.000 -7.82E-04 -3.81E-06 -1.29E-06 5.571E-07 Testing for Hypotheses related to Multiple Regression. The General Linear Hypothesis h111 + h122 + h133 +... + h1pp = h1 h211 + h222 + h233 +... + h2pp = h2 ... hq11 + hq22 + hq33 +... + hqpp = hq where h11,h12, h13, ... , hqp and h1,h2, h3, ... , hq are known coefficients. H0: Examples 1. 2. 3. 4. 5. 6. H0: 1 = 0 H0: 1 = 0, 2 = 0, 3 = 0 H0: 1 = 2 H0: 1 = 2 , 3 = 4 H0: 1 = 1/2(2 + 3) H0: 1 = 1/2(2 + 3), 3 = 1/3(4 + 5 + 6) The Complete Model Y = 0 + 1X1 + 2X2 + 3X3 +... + pXp+ e The Reduced Model The model implied by H0. You are interested in knowing whether the complete model can be simplified to the reduced model. Testing the General Linear Hypothesis The F-test for H0 is performed by carrying out two runs of a multiple regression package. Run 1: Fit the complete model. Resulting in the following Anova Table: Source Regression Residual (Error) Total df p n-p-1 n-1 Sum of Squares SSReg SSError SSTotal Run 2: Fit the reduced model (q parameters eliminated) Resulting in the following Anova Table: Source Regression Residual (Error) Total df p-q n-p+q-1 n-1 Sum of Squares SS1Reg SS1Error SSTotal The Test: The Test is carried out using the Test Statistic F 1 q Reduction in the Residual Sum of Squares Residual Mean Square for Complete model SSH0 2 s 1 q where SSH0 = SS1Error- SSError= SSReg- SS1Reg and s2 = SSError/(n-p-1). The test statistic, F, has an F-distribution with n1 = q d.f. in the numerator and n2 = n – p - 1 d.f. in the denominator if H0 is true. Distribution when H0 is true 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 The Critical Region Reject H0 if F > Fa(q, n – p – 1) Fa(q, n – p – 1) The Anova Table for the Test: Source Regression df p-q Sum of Squares SS1Reg Mean Square [1/(p-q)]SS1Reg F MS1Reg/s2 q SSH0 (1/q)SSH0 MSH0/s2 (for the reduced model) Departure from H0 Residual (Error) Total n-p-1 SSError n-1 SSTotal s2 Some Examples: Four independent Variables X1 , X2 , X3, X4 The Complete Model Y = 0 + 1X1 + 2X2 + 3X3 + 4X4+ e 1) a) H0: 3 = 0, 4 = 0 (q = 2) b) The Reduced Model: Y = 0 + 1X1 + 2X2 + e Dependent Variable: Y Independent Variables: X1 , X2 2) a) H0: 3 = 4.5, 4 = 8.0 (q = 2) b) The Reduced Model: Y – 4.5X3 – 8.0X4 = 0 + 1X1 + 2X2 + e Dependent Variable: Y – 4.5X3 – 8.0X4 Independent Variables: X1 , X2 Example Motor Vehicle example Variables 1. (Y) mpg – Mileage 2. (X1) engine – Engine size. 3. (X2) horse – Horsepower. 4. (X3) weight – Weight. Suppose we want to test: H0: 1 = 0 against HA: 1 ≠ 0 i.e. engine size(engine) has no effect on mileage(mpg). The Full model: Y = 0 + 1 X1 + 2 X2 + 1 X3 + e (mpg) (engine) (horse) (weight) The reduced model: Y = 0 + 2 X2 + 1 X3 + e The ANOVA Table for the Full model: ANOVAb Model 1 Reg ression Residual Total Sum of Squares 16098.158 7720.836 23818.993 df 3 388 391 Mean Square 5366.053 19.899 a. Predictors: (Constant), WEIGHT, HORSE, ENGINE b. Dependent Variable: MPG F 269.664 Sig . .000a The ANOVA Table for the Reduced model: ANOVAb Model 1 Reg ression Residual Total Sum of Squares 16085.855 7733.138 23818.993 df 2 389 391 Mean Square 8042.928 19.880 F 404.583 Sig . .000a a. Predictors: (Constant), WEIGHT, HORSE b. Dependent Variable: MPG The reduction in the residual sum of squares = 7733.138452 - 7720.835649 = 12.30280251 The ANOVA Table for testing H0: 1 = 0 against HA: 1 ≠ 0 Regression 1=1 =0 0 Residual Total Sum of Squares df Mean Square F Sig. 16085.85502 2 8042.927509 404.18628 0.0000 12.30280251 1 12.30280251 0.6182605 0.4322 7720.835649 388 19.89906095 23818.99347 391 Now suppose we want to test: H0: 1 = 0, 2 = 0 against HA: 1 ≠ 0 or 2 ≠ 0 i.e. engine size (engine) and horsepower (horse) have no effect on mileage (mpg). The Full model: Y = 0 + 1 X1 + 2 X2 + 1 X3 + e (mpg) (engine) (horse) (weight) The reduced model: Y = 0 + 1 X3 + e The ANOVA Table for the Full model ANOVAb Model 1 Reg ression Residual Total Sum of Squares 16098.158 7720.836 23818.993 df 3 388 391 Mean Square 5366.053 19.899 a. Predictors: (Constant), WEIGHT, HORSE, ENGINE b. Dependent Variable: MPG F 269.664 Sig . .000a The ANOVA Table for the Reduced model: ANOVAb Model 1 Reg ression Residual Total Sum of Squares 15519.970 8299.023 23818.993 df 1 390 391 Mean Square 15519.970 21.280 F 729.337 Sig . .000a a. Predictors: (Constant), WEIGHT b. Dependent Variable: MPG The reduction in the residual sum of squares = 8299.023 - 7720.835649 = 578.1875392 The ANOVA Table for testing H0: 1 = 0, 2 = 0 against HA: 1 ≠ 0 or 2 ≠ 0 Sum of Squares df Mean Square F Sig. Regression 15519.97028 1 15519.97028 779.93481 0.0000 0, 22 = 00 11== 0, 578.1875392 2 289.0937696 14.528011 0.0000 Residual 7720.835649 388 19.89906095 Total 23818.99347 391