Lecture 9 Simple Linear Regression ANOVA for regression (10.2) Analysis of variance for regression The regression model is: Data = fit + residual y i = (b.
Download ReportTranscript Lecture 9 Simple Linear Regression ANOVA for regression (10.2) Analysis of variance for regression The regression model is: Data = fit + residual y i = (b.
Lecture 9 Simple Linear Regression ANOVA for regression (10.2) Analysis of variance for regression The regression model is: Data = fit + residual y i = (b 0 + b 1 x i ) + (ei) where the ei are independent and normally distributed N(0,s), and s is the same for all values of x. Sums of squares measure the variation present in responses. It can be partitioned as: SST = SS model + DFT = DF model + SS error DF error For a simple linear relationship, the ANOVA tests the hypotheses H0: β1 = 0 versus Ha: β1 ≠ 0 by comparing MSM (model) to MSE (error): F = MSM/MSE When H0 is true, F follows the F(1, n − 2) distribution. The p-value is P(> F). The ANOVA test and the two-sided t-test for H0: β1 = 0 yield the same p-value. Software output for regression may provide t, F, or both, along with the p-value. ANOVA table Source Model Error Sum of squares SS 2 ˆ ( y y ) i ( y yˆ ) i Total 2 DF Mean square MS F P-value 1 SSG/DFG MSG/MSE Tail area above F n−2 SSE/DFE i ( yi y)2 n−1 SST = SSM + SSE DFT = DFM + DFE The standard deviation of the sampling distribution, s, for n sample data points is calculated from the residuals ei = yi – ŷi s 2 2 e i n2 2 ˆ ( y y ) i i n2 SSE MSE DFE s is an unbiased estimate of the regression standard deviation σ. Coefficient of determination, r2 The coefficient of determination, r2, square of the correlation coefficient, is the percentage of the variance in y (vertical scatter from the regression line) that can be explained by changes in x. r 2 = variation in y caused by x (i.e., the regression line) total variation in observed y values around the mean 2 ˆ ( y y ) i SSM r 2 ( yi y ) SST 2 What is the relationship between the average speed a car is driven and its fuel efficiency? We plot fuel efficiency (in miles per gallon, MPG) against average speed (in miles per hour, MPH) for a random sample of 60 cars. The relationship is curved. When speed is log transformed (log of miles per hour, LOGMPH) the new scatterplot shows a positive, linear relationship. SST (sum of squares total) is the sum of the two R-squared is the ratio: SSM/SST=494/552 In this case both tests check the same thing that is why the p-value is identical Calculations for regression inference To estimate the parameters of the regression, we calculate the standard errors for the estimated regression coefficients. The standard error of the least-squares slope β1 is: SEb1 s 2 ( x x ) i i The standard error of the intercept β0 is: SEb0 1 s n x2 ( xi xi ) 2 To estimate or predict future responses, we calculate the following standard errors The standard error of the mean response µy is: The standard error for predicting an individual response ŷ is: 1918 flu epidemics 1918 influenza epidemic Date # Cases # Deaths 17 ee k 15 ee k 13 ee k 11 9 ee k ee k 7 w ee k 5 w ee k 3 w ee k w ee k 1 1918 influenza epidemic w w w w 10000 800 9000 700 8000 600 # Cases # Deaths 7000 500 6000 5000 400 4000 The line graph suggests that about 7 to 8% of300 those 3000 200 2000 diagnosed with the flu died within about a week of 100 1000 0 0 the diagnosis. We look at the relationship between w 0 0 130 552 738 414 198 90 56 50 71 137 178 194 290 310 149 800 700 600 500 400 300 200 100 0 number of deaths in a given week and the number of w ee k 1 w ee k 3 w ee k 5 w ee k 7 w ee k 9 w ee k 11 w ee k 13 w ee k 15 w ee k 17 36 531 4233 8682 7164 2229 600 164 57 722 1517 1828 1539 2416 3148 3465 1440 Incidence week 1 week 2 week 3 week 4 week 5 week 6 week 7 week 8 week 9 week 10 week 11 week 12 week 13 week 14 week 15 week 16 week 17 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 # deaths reported # cases diagnosed 1918 influenza epidemic new diagnosed cases one week earlier. # Cases # Deaths SEb 0 SEb1 s MSE R2 = SSM / SST P-value for H0: β = 0; Ha: β ≠ 0 MSE s 2 SSM SST 595349 Inference for correlation To test for the null hypothesis of no linear association, we have the choice of also using the correlation parameter ρ. When x is clearly the explanatory variable, this test is equivalent to testing the hypothesis H0: β = 0. b1 r sy sx When there is no clear explanatory variable (e.g., arm length vs. leg length), a regression of x on y is not any more legitimate than one of y on x. In that case, the correlation test of significance should be used. When both x and y are normally distributed H0: ρ = 0 tests for no association of any kind between x and y—not just linear associations. The test of significance for ρ uses the one-sample t-test for: H0: ρ = 0. We compute the t statistics for sample size n and correlation coefficient r. The p-value is the area under t (n – 2) for values of T as extreme as t or more in the direction of Ha: t r n2 1 r2 Relationship between average car speed and fuel efficiency r p-value n There is a significant correlation (r is not 0) between fuel efficiency (MPG) and the logarithm of average speed (LOGMPH).