Transcript Slide 1
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company Objectives (IPS chapter 10.2) Inference for regression—more details Analysis of variance for regression Calculations for regression inference Inference for correlation Analysis of variance for regression The regression model is: Data = fit + residual y i = (b 0 + b 1 x i ) + (ei) where the ei are independent and normally distributed N(0,s), and s is the same for all values of x. It resembles an ANOVA, which also assumes equal variance, where SST = SS model + DFT = DF model + SS error DF error and For a simple linear relationship, the ANOVA tests the hypotheses H0: β1 = 0 versus Ha: β1 ≠ 0 by comparing MSM (model) to MSE (error): F = MSM/MSE When H0 is true, F follows the F(1, n − 2) distribution. The p-value is P(> F). The ANOVA test and the two-sided t-test for H0: β1 = 0 yield the same p-value. Software output for regression may provide t, F, or both, along with the p-value. ANOVA table Source Model Error Sum of squares SS 2 ˆ ( y y ) i ( y yˆ ) i Total 2 DF Mean square MS F P-value 1 SSG/DFG MSG/MSE Tail area above F n−2 SSE/DFE i ( yi y)2 n−1 SST = SSM + SSE DFT = DFM + DFE The standard deviation of the sampling distribution, s, for n sample data points is calculated from the residuals ei = yi – ŷi s 2 2 e i n2 2 ˆ ( y y ) i i n2 SSE MSE DFE s is an unbiased estimate of the regression standard deviation σ. Coefficient of determination, r2 The coefficient of determination, r2, square of the correlation coefficient, is the percentage of the variance in y (vertical scatter from the regression line) that can be explained by changes in x. r 2 = variation in y caused by x (i.e., the regression line) total variation in observed y values around the mean 2 ˆ ( y y ) i SSM r 2 ( yi y ) SST 2 What is the relationship between the average speed a car is driven and its fuel efficiency? We plot fuel efficiency (in miles per gallon, MPG) against average speed (in miles per hour, MPH) for a random sample of 60 cars. The relationship is curved. When speed is log transformed (log of miles per hour, LOGMPH) the new scatterplot shows a positive, linear relationship. Calculations for regression inference To estimate the parameters of the regression, we calculate the standard errors for the estimated regression coefficients. The standard error of the least-squares slope β1 is: SEb1 s 2 ( x x ) i i The standard error of the intercept β0 is: SEb0 1 s n x2 ( xi xi ) 2 To estimate or predict future responses, we calculate the following standard errors The standard error of the mean response µy is: The standard error for predicting an individual response ŷ is: 1918 flu epidemics 1918 influenza epidemic Date # Cases # Deaths 17 ee k 15 ee k 13 ee k 11 9 ee k ee k 7 w ee k 5 w ee k 3 w ee k w ee k 1 1918 influenza epidemic w w w w 10000 800 9000 700 8000 600 # Cases # Deaths 7000 500 6000 5000 400 4000 The line graph suggests that about 7 to 8% of300 those 3000 200 2000 diagnosed with the flu died within about a week of 100 1000 0 0 the diagnosis. We look at the relationship between w 0 0 130 552 738 414 198 90 56 50 71 137 178 194 290 310 149 800 700 600 500 400 300 200 100 0 number of deaths in a given week and the number of w ee k 1 w ee k 3 w ee k 5 w ee k 7 w ee k 9 w ee k 11 w ee k 13 w ee k 15 w ee k 17 36 531 4233 8682 7164 2229 600 164 57 722 1517 1828 1539 2416 3148 3465 1440 Incidence week 1 week 2 week 3 week 4 week 5 week 6 week 7 week 8 week 9 week 10 week 11 week 12 week 13 week 14 week 15 week 16 week 17 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 # deaths reported # cases diagnosed 1918 influenza epidemic new diagnosed cases one week earlier. # Cases # Deaths r = 0.91 1918 flu epidemic: Relationship between the number of deaths in a given week and the number of new diagnosed cases one week earlier. MINITAB - Regression Analysis: FluDeaths1 versus FluCases0 The regression equation is FluDeaths1 = 49.3 + 0.0722 FluCases0 Predictor Coef Constant 49.29 FluCases 0.072222 S = 85.07 s MSE SE Coef SEb 0 0.008741 SE b1 29.85 R-Sq = 83.0% T P 1.65 0.121 8.26 0.000 R-Sq(adj) = 81.8% r2 = SSM / SST Analysis of Variance Source Regression DF 1 SS P-value for H0: β = 0; Ha: β ≠ 0 MS F P 68.27 0.000 Residual Error 14 494041 SSM 494041 101308 7236 Total 15 595349 SST MSE s 2 Inference for correlation To test for the null hypothesis of no linear association, we have the choice of also using the correlation parameter ρ. When x is clearly the explanatory variable, this test is equivalent to testing the hypothesis H0: β = 0. b1 r sy sx When there is no clear explanatory variable (e.g., arm length vs. leg length), a regression of x on y is not any more legitimate than one of y on x. In that case, the correlation test of significance should be used. Technically, in that case, the test is a test of independence much like we saw in an earlier chapter on contingency tables. The test of significance for ρ uses the one-sample t-test for: H0: ρ = 0. We compute the t statistics for sample size n and correlation coefficient r. This calculation turns out To be identical to the tstatistic based on slope t r n2 1 r2