Transcript Slide 1
Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 4: Inference about regression Priyantha Wijayatunga, Department of Statistics, Umeå University [email protected] These materials are altered ones from copyrighted lecture slides (© 2009 W.H. Freeman and Company) from the homepage of the book: The Practice of Business Statistics Using Data for Decisions :Second Edition by Moore, McCabe, Duckworth and Alwan. Inference about the regression model and using the model Reference to the book: Chapter 10.1, 10.2 and 10.3) Statistical model for simple linear regression Estimating the regression parameters and standard errors Conditions for regression inference Confidence intervals and significance tests Inference about correlation Confidence and prediction intervals Analysis of variance for regression and coefficient of determination Error Variable: Required Conditions Linear regrssion model: Y 0 1 X e The error e is a critical part of the regression model. Four requirements involving the distribution of e must be satisfied. The probability distribution of e is normal. The mean of e is zero: E(e) = 0. The standard deviation of e is s for all values of x. The set of errors associated with different values of y are all independent. Our estimated model: Yˆ b0 b1 X yˆ 0.125 x 41.4 The data in a scatterplot is a random sample from a population that may exhibit a linear relationship between x and y. Different sample different plot. Now we want to describe the population mean response my as a function of the explanatory variable x: my = 0 + 1x. And to assess whether the observed relationship is statistically significant (not entirely explained by chance events due to random sampling). Statistical model for simple linear regression In the population, the linear regression equation is my = 0 + 1x. Sample data then fits the model: Data = fit + residual y i = ( 0 + 1 x i ) + (ei) where the ei are independent and normally distributed N(0,s). Linear regression assumes equal variance of y (s is the same for all values of x). Estimating the regression parameters my = 0 + 1x The intercept 0, the slope 1, and the standard deviation s of y are the unknown parameters of the regression model. We rely on the random sample data to provide unbiased estimates of these parameters. The value of ŷ from the least-squares regression line is really a prediction of the mean value of y (my) for a given value of x. The least-squares regression line (ŷ = b0 + b1x) obtained from sample data is the best estimate of the true population regression line (my = 0 + 1x). ŷ unbiased estimate for mean response my b0 unbiased estimate for intercept 0 b1 unbiased estimate for slope 1 Regression standard error: s (Standard error of estimate) The standard deviation of the error variables shows the dispersion around the true line for a certain x. If s is big we have big dispersion around the true line. If s is small the observations tend to be close to the line. Then, the model fits the data well. Therefore, we can, use s as a measure of the suitability of using a linear model. However we often do not know s Therefore we use s to estimate it The population standard deviation s for y at any given value of x represents the spread of the normal distribution of the ei around the mean my . The regression standard error, s, for n sample data points is calculated from the residuals (yi – ŷi): s 2 residual n2 2 ˆ ( y y ) i i n2 s is an unbiased estimate of the regression standard deviation s. s s Conditions for regression inference The observations are independent (error term in each observation should be independent from each other) The relationship is indeed linear. The standard deviation of y, σ, is the same for all values of x. The response y varies normally around its mean. That is, error term should be normally distributed with mean zero Using residual plots to check for regression validity The residuals (y − ŷ) give useful information about the contribution of individual data points to the overall pattern of scatter. We view the residuals in a residual plot: If residuals are scattered randomly around 0 with uniform variation, it indicates that the data fits a linear model, and has normally distributed residuals for each value of x and constant standard deviation σ. Residuals are randomly scattered good! Curved pattern the relationship is not linear. Change in variability across plot σ not equal for all values of x. What is the relationship between the average speed a car is driven and its fuel efficiency? We plot fuel efficiency (in miles per gallon, MPG) against average speed (in miles per hour, MPH) for a random sample of 60 cars. The relationship is curved. When speed is log transformed (log of miles per hour, LOGMPH) the new scatterplot shows a positive, linear relationship. Residual plot: The spread of the residuals is reasonably random—no clear pattern. The relationship is indeed linear. But we see one low residual (3.8, −4) and one potentially influential point (2.5, 0.5). Normal quantile plot for residuals: The plot is fairly straight, supporting the assumption of normally distributed residuals. Data okay for inference. Checking normality of residuals Most of the departure from the required conditions can be diagnosed by the residual analysis Standardiz ed residual For our case Residual mean of the residual standard deviation of the residual Standardiz ed i th residual ei s Food company... –a 1st data: when ADVER= 276, SALES= 115.0 predicted SALES= 118.0087 residual= 115.0 –118.0087= –3.008726 Standardiz ed residual – 3.008726 0.1869935 16.09 Checking normality Non–normality of the residuals can be checked by making a histogram on residuals 1.5 1.0 0.5 0.0 Frequency 2.0 2.5 3.0 Food Company Example -30 -20 -10 0 Residual 10 20 30 Heteroscedasticity Variance of the errors is not constant (Violation of the requirement) Homoscedasticity Non–independnece of error variable Residual 0 -20 -10 Check: plot the residuals against predicted values of Y by the model 10 20 Variance of the errors is constant (No violation of the requirement) 120 130 140 150 160 predicted SALES values 170 180 Confidence interval for regression parameters Estimating the regression parameters 0, 1 is a case of one-sample inference with unknown population variance s. We rely on the t distribution, with n – 2 degrees of freedom. A level C confidence interval for the slope, 1, is: b1 ± t* SEb1 A level C confidence interval for the intercept, 0 : b0 ± t* SEb0 t* is the t critical for the t (n – 2) distribution with area C between –t* and +t*. Significance test for the slope We can test the hypothesis H0: 1 = 0 versus a 1 or 2 sided alternative. We calculate t = b1 / SEb1 which has the t (n – 2) distribution to find the p-value of the test. Note: Software typically provides two-sided p-values. Standard errors of b1 and b0 To estimate the parameters of the regression, we calculate the standard errors for the estimated regression coefficients. The standard error of the least-squares slope β1 is: SE b1 s 2 ( x x ) i The standard error of the intercept β0 is: 1 x2 SEb 0 s n ( xi x ) 2 Testing the hypothesis of no relationship We may look for evidence of a significant relationship between variables x and y in the population from which our data were drawn. For that, we can test the hypothesis that the regression slope parameter β is equal to zero. H0: β1 = 0 vs. H0: β1 ≠ 0 s y Testing H0: β1 = 0 also allows us to test the hypothesis of slope b1 r sx no correlation between x and y in the population. Note: A test of hypothesis for 0 is irrelevant (0 is often not even achievable). Example: ADVER and SALES SˆALES 99.273 0.06806 ADVER i xi yi xi x 2 1 2 3 4 5 6 7 8 9 276 552 720 648 336 396 1056 1188 372 115.0 135.6 153.6 117.6 106.8 150.0 164.4 190.8 136.8 115600 4096 10816 1024 78400 48400 193600 327184 59536 Total 5544 1270.6 838656 Mean 616 s x2 141.178 1 xi x 2 838656 104832 n 1 8 yˆ i yi yˆ i 2 118.0087 136.8165 148.2648 143.3584 122.0974 126.1860 171.1613 180.1563 124.5506 9.052434 1.479981 28.464554 663.494881 234.009911 567.104757 45.714584 113.288358 150.048384 1812.658 Regression standard error (Standard error of the estimate) SSE 1812.658 s 16.09196 n2 9-2 Standard deviation of b1 s 16.09196 SEb1 0.01757183 2 (9 -1)104832 (n 1)sx H 0 : 1 0 H1 : 1 0 b1 0.06806 T statistic t 3.87 sb1 0.01757183 If H0 is true the test statistic has a t–dist with df=9–2 =7 P–value > 2 x 0.005: Reject H0 at 0.01 level of significance Example: ADVER and SALES SPSS output Coefficientsa Unstandardized Coefficients Model 1 (Constant) Standardized Coefficients B 99,201 Std. Error 12,080 ,068 ,018 ADVER Beta ,826 t 8,212 Sig. ,000 3,878 ,006 a. Dependent Variable: SALES Coefficientsa 95,0% Confidence Interval for B Model 1 (Constant) ADVER a. Dependent Variable: SALES Lower Bound 70,635 Upper Bound 127,767 ,027 ,110 Using technology Computer software runs all the computations for regression analysis. Here is some software output for the car speed/gas efficiency example. SPSS Slope Intercept p-value for tests of significance Confidence intervals The t-test for regression slope is highly significant (p < 0.001). There is a significant relationship between average car speed and gas efficiency. Excel “intercept”: intercept “logmph”: slope SAS P-value for tests of significance confidence intervals Inference for correlation To test for the null hypothesis of no linear association, we have the choice of also using the correlation parameter ρ. When x is clearly the explanatory variable, this test is equivalent to testing the hypothesis H0: β1 = 0. b1 r sy sx When there is no clear explanatory variable (e.g., arm length vs. leg length), a regression of x on y is not any more legitimate than one of y on x. In that case, the correlation test of significance should be used. When both x and y are normally distributed H0: ρ = 0 tests for no association of any kind between x and y—not just linear associations. The test of significance for ρ uses the one-sample t-test for: H0: ρ = 0. We compute the t statistics for sample size n and correlation coefficient r. The p-value is the area under t (n – 2) for values of T as extreme as t or more in the direction of Ha: t r n2 1 r2 Relationship between average car speed and fuel efficiency r p-value n There is a significant correlation (r is not 0) between fuel efficiency (MPG) and the logarithm of average speed (LOGMPH). Example: ADVER and SALES Correlation coefficient = r = 0.8258 To test H0: ρ = 0 Ha: ρ ≠ 0 Value of the tests statistic n2 92 tr 0.8258 3.87 2 2 1 r 1 0.8258 P–value from the t–distribution with df 7: P–value > 2 x 0.005, Do not accept H0 Correlations ADVER ADVER Pearson Correlation SALES 1 ,826** Sig. (2-tailed) SALES N Pearson Correlation Sig. (2-tailed) ,006 9 ,826** 9 1 ,006 N 9 **. Correlation is significant at the 0.01 level (2-tailed). 9 Confidence Interval for Regression Response Using inference, we can also calculate a confidence interval for the population mean μy of all responses y when x takes the value x* (within the range of data tested): This interval is centered on ŷ, the unbiased estimate of μy. The true value of the population mean μy at a given value of x will indeed be within our confidence interval in C% of all intervals calculated from many different random samples. The level C confidence interval for the mean response μy at a given value x* of x is: t* is the t critical for the t (n – 2) y^± tn − 2 * SEm^ distribution with area C between –t* and +t*. A separate confidence interval is calculated for μy along all the values that x takes. Graphically, the series of confidence intervals is shown as a continuous interval on either side of ŷ. 95% confidence interval for my Prediction Interval for Regression Response One use of regression is for predicting the value of y, ŷ, for any value of x within the range of data tested: ŷ = b0 + b1x. But the regression equation depends on the particular sample drawn. More reliable predictions require statistical inference: To estimate an individual response y for a given value of x, we use a prediction interval. If we randomly sampled many times, there would be many different values of y obtained for a particular x following N(0, σ) around the mean response µy. The level C prediction interval for a single observation on y when x takes the value x* is: t* is the t critical for the t (n – 2) ŷ ± t*n − 2 SEŷ distribution with area C between –t* and +t*. The prediction interval represents mainly the error from the normal 95% prediction distribution of the residuals ei. interval for ŷ Graphically, the series confidence intervals is shown as a continuous interval on either side of ŷ. The confidence interval for μy contains, with C% confidence, the population mean μy of all responses at a particular value of x. The prediction interval contains C% of all the individual values taken by y at a particular value of x. 95% prediction interval for ŷ 95% confidence interval for my Estimating my uses a smaller confidence interval than estimating an individual in the population (sampling distribution narrower than population distribution). To estimate or predict future responses, we calculate the following standard errors. The standard error of the mean response µy is: The standard error for predicting an individual response ŷ is: 1918 flu epidemics 1918 influenza epidemic Date # Cases # Deaths 17 ee k 15 ee k 13 ee k 11 9 ee k ee k 7 w ee k 5 w ee k 3 w ee k w ee k 1 1918 influenza epidemic w w w w 10000 800 9000 700 8000 600 # Cases # Deaths 7000 500 6000 The line graph suggests that 7 to 9% of those 5000 400 4000 300 of diagnosed with the flu died within about a week 3000 200 2000 their diagnosis. 100 1000 0 0 w 0 0 130 552 738 414 198 90 56 50 71 137 178 194 290 310 149 800 700 600 500 400 300 200 100 0 We look at the relationship between the number of w ee k 1 w ee k 3 w ee k 5 w ee k 7 w ee k 9 w ee k 11 w ee k 13 w ee k 15 w ee k 17 36 531 4233 8682 7164 2229 600 164 57 722 1517 1828 1539 2416 3148 3465 1440 Incidence week 1 week 2 week 3 week 4 week 5 week 6 week 7 week 8 week 9 week 10 week 11 week 12 week 13 week 14 week 15 week 16 week 17 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 deaths in a given week and the number of new diagnosed# Cases cases one# week Deaths earlier. # deaths reported # cases diagnosed 1918 influenza epidemic 1918 flu epidemic: Relationship between the number of r = 0.91 deaths in a given week and the number of new diagnosed cases one week earlier. EXCEL Regression Statistics Multiple R 0.911 R Square 0.830 Adjusted R Square 0.82 Standard Error 85.07 s Observations 16.00 Coefficients Intercept 49.292 FluCases0 0.072 b1 St. Error 29.845 0.009 SEb1 t Stat 1.652 8.263 P-value Lower 95% Upper 95% 0.1209 (14.720) 113.304 0.0000 0.053 0.091 P-value for H0: β1 = 0 P-value very small reject H0 β1 significantly different from 0 There is a significant relationship between the number of flu cases and the number of deaths from flu a week later. SPSS CI for mean weekly death count one week after 4000 flu cases are diagnosed: µy within about 300–380. Prediction interval for a weekly death count one week after 4000 flu cases are diagnosed: ŷ within about 180–500 deaths. Least squares regression line 95% prediction interval for ŷ 95% confidence interval for my What is this? A 90% prediction interval for the height (above) and a 90% prediction interval for the weight (below) of male children, ages 3 to 18. Analysis of variance and coefficient of determination The regression model Overall variability in y The error Variation in the dependent variable Y = variation explained by the independent variable + unexplained variation (by random error) SST=SSR+SSE The greater the explained independent variable, better the model Co-efficient of determination is a measure of explanatory power of the model Analysis of variance for regression The regression model is: Data = fit + residual y i = ( 0 + 1 x i ) + (ei) where the ei is independent and normally distributed N(0,s), and s is the same for all values of x. For data: y y 2 y yˆ 2 yˆ y 2 yˆ y 2 y yˆ 2 i i i i i i i SST SSR SSE SST yi y 2 SSR yˆ i yˆ Sometimes SSR is called SSM 2 SSE yi yˆ i 2 Analysis of variance for regression The regression model is: Data = fit + residual y i = ( 0 + 1 x i ) + (ei) where the ei is independent and normally distributed N(0,s), and s is the same for all values of x. It resembles an ANOVA, which also assumes equal variance, where SST = SS model + SS error DFT = DF model + DF error and For a simple linear relationship, the ANOVA tests the hypotheses H0: β1 = 0 versus Ha: β1 ≠ 0 by comparing MSM (model) to MSE (error): F = MSM/MSE When H0 is true, F follows the F(1, n − 2) distribution. The p-value is P(F > f ). The ANOVA test and the two-sided t-test for H0: β1 = 0 yield the same p-value. Software output for regression may provide t, F, or both, along with the p-value. ANOVA table Source Sum of squares SS Model (Regression) 2 ˆ ( y y ) i Error (Residual) ( y yˆ ) Total ( y y) i 2 DF Mean square MS F P-value 1 SSM/DFM MSM/MSE Tail area above F n−2 SSE/DFE i n−1 2 i SST = SSM + SSE DFT = DFM + DFE The standard deviation of the sampling distribution, s, for n sample data points is calculated from the residuals ei = yi – ŷi s 2 2 e i n2 2 ˆ ( y y ) i i n2 SSE MSE DFE s is an unbiased estimate of the regression standard deviation σ. F–Distribution: Critical values (partly) Coefficient of determination, r2 The coefficient of determination, r2, square of the correlation coefficient, is the percentage of the variance in y (vertical scatter from the regression line) that can be explained by changes in x. r 2 = variation in y caused by x (i.e., the regression line) total variation in observed y values around the mean r2 2 ˆ ( y y ) i 2 ( y y ) i SSM SST Food Company example: Call X=ADVER and Y=SALES i xi yi 1 2 3 4 5 6 7 8 9 276 552 720 648 336 396 1056 1188 372 115.0 135.6 153.6 117.6 106.8 150.0 164.4 190.8 136.8 Total 5544 1270.6 Mean 616 yi y yi y 2 -26.177778 685.27605 -5.577778 31.11160 12.422222 154.31160 -23.577778 555.91160 -34.377778 1181.83160 8.822222 77.83160 23.222222 539.27160 49.622222 2462.36494 -4.377778 19.16494 yˆ i 118.0087 136.8165 148.2648 143.3584 122.0974 126.1860 171.1613 180.1563 124.5506 5707.076 yi yˆ i 2 9.052434 1.479981 28.464554 663.494881 234.009911 567.104757 45.714584 113.288358 150.048384 1812.658 141.178 SST yi y 5707.076 2 SSE yi yˆi 1812.658 2 SSR yˆi yˆ SST SSE 2 SST SSR SSE R2 1 SSE 1812.658 1 0.6823841 SST 5707.076 Food company example: ANOVAb Model 1 Regression Sum of Squares 3894,418 df Mean Square 1 3894,418 Residual 1812,658 7 Total 5707,076 8 a. Predictors: (Constant), ADVER b. Dependent Variable: SALES 258,951 F 15,039 Sig. ,006a What is the relationship between the average speed a car is driven and its fuel efficiency? We plot fuel efficiency (in miles per gallon, MPG) against average speed (in miles per hour, MPH) for a random sample of 60 cars. The relationship is curved. When speed is log transformed (log of miles per hour, LOGMPH) the new scatterplot shows a positive, linear relationship. Using software: SPSS r2 =SSM/SST = 494/552 ANOVA and t-test give same p-value. 1918 flu epidemics 1918 influenza epidemic Date # Cases # Deaths 17 ee k 15 ee k 13 ee k 11 9 ee k ee k 7 w ee k 5 w ee k 3 w ee k w ee k 1 1918 influenza epidemic w w w w 10000 800 9000 700 8000 600 # Cases # Deaths 7000 500 6000 5000 400 4000 The line graph suggests that about 7 to 8% of300 those 3000 200 2000 diagnosed with the flu died within about a week of 100 1000 0 diagnosis. We look at the relationship between 0 their w 0 0 130 552 738 414 198 90 56 50 71 137 178 194 290 310 149 800 700 600 500 400 300 200 100 0 the number of deaths in a given week and the w ee k 1 w ee k 3 w ee k 5 w ee k 7 w ee k 9 w ee k 11 w ee k 13 w ee k 15 w ee k 17 36 531 4233 8682 7164 2229 600 164 57 722 1517 1828 1539 2416 3148 3465 1440 Incidence week 1 week 2 week 3 week 4 week 5 week 6 week 7 week 8 week 9 week 10 week 11 week 12 week 13 week 14 week 15 week 16 week 17 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 # deaths reported # cases diagnosed 1918 influenza epidemic number of new diagnosed cases one week earlier. # Cases # Deaths r = 0.91 1918 flu epidemic: Relationship between the number of deaths in a given week and the number of new diagnosed cases one week earlier. MINITAB - Regression Analysis: FluDeaths1 versus FluCases0 The regression equation is FluDeaths1 = 49.3 + 0.0722 FluCases0 Predictor Coef Constant 49.29 FluCases 0.072222 S = 85.07 s MSE SE Coef SEb 0 0.008741 SE b1 29.85 R-Sq = 83.0% T P 1.65 0.121 8.26 0.000 R-Sq(adj) = 81.8% r2 = SSM / SST Analysis of Variance Source Regression DF 1 SS P-value for H0: β = 0; Ha: β ≠ 0 MS F P 68.27 0.000 Residual Error 14 494041 SSM 494041 101308 7236 Total 15 595349 SST MSE s 2