Transcript Slide 1
SIMPLE LINEAR REGRESSION AND CORRELATION Prepared by: Jackie Zerrle David Fried Chun-Hui Chung Weilai Zhou Shiyhan Zhang Alex Fields Yu-Hsun Cheng Roosevelt Moreno AMS 572.1 DATA ANALYSIS, FALL 2007. What is Regression Analysis? A statistical methodology to estimate the relationship of a response variable to a set of predictor variables. It is a tool for the investigation of relationships between variables. Often used in economics – supply and demand. How does one aspect of the economy affect other parts? Was proposed by German mathematician Gauss. Linear Regression The simplest relationship between x ( the predictor variable) and Y (the response variable) is linear. Yi 0 1xi i, (i 1, 2,..., n). is a random error with E( i) 0 and Var ( i) i E (Yi ) i 0 1xi represents the true but unknown mean of Y. This relationship is the true regression line. 2 Simple Linear Regression Model Simple Linear Regression Model 4 Basic Assumptions: 1. The mean of Y i is a linear function of xi . 2. The Y i have a common variance , which is 2 the same for all values of x. 3. The errors 4. The errors are normally distributed. i are independent. i Example -- Sales vs. Advertising- Information was given such as the cost of advertising and the sales that occurred as a result. Make a scatter plot To get a good fit, however, we will use the Least Squares (LS) method. Example -- Sales vs. Advertising-Data Sales($000,000s) ( yi ) Advertising ($000s) ( xi ) 28 71 14 31 19 50 21 60 16 35 Example -- Sales vs. Advertising- Try to fit a straight line : y 0 1x Where o = 2.5 and 28 14 1 3.5 71 31 Look at the deviations between the observed values and the points from the line: yi ( 0 1xi ) (i 1, 2,..., n) Example -- Sales vs. Advertising— Scatter Plot with a Trial Straight Line fit http://learning.mazoo.net/archives/000899.html Least Squares (Cont…) Deviations should be as small as possible. Sum of the squared deviations: n Q yi ( 0 1xi ) 2 i 1 In our example, Q=7.87 Least Squares estimates: 0 and 1 minimize Q and are denoted by 0 and 1 Least Squares Estimates To find ˆ and 0 derivatives of Q. ˆ1 , take the first partial n Q 2 [ yi ( 0 1 xi )] 0 i 1 n Q 2 xi [ yi ( 0 1 xi )] 1 i 1 Normal Equations We then set these partial derivatives equal to zero and simplify. These are our normal equations: n n i 1 i 1 0 n 1 xi yi n n n 0 xi 1 x xi yi i 1 i 1 2 i i 1 Normal Equations Solve for 0 and 1 : n 0 n n n i 1 n i 1 n i 1 ( x )( yi ) ( xi )( xi yi ) 2 i i 1 n xi2 ( xi ) 2 i 1 1 i 1 n n n i 1 i 1 i 1 n xi yi ( xi )( yi ) n n n x ( xi ) 2 i 1 2 i i 1 These formulas can be simplified to: n n n 1 n S xy ( xi x )( yi y ) xi yi ( xi )( yi ) n i 1 i 1 i 1 i 1 n n n 1 S xx ( xi x ) 2 xi2 ( xi ) 2 n i 1 i 1 i 1 n n n 1 S yy ( yi y ) 2 yi2 ( yi ) 2 n i 1 i 1 i 1 Sxy gives the sum of cross-products of the x’s and Y’s around their respective means. Sxx and Syy give the sums of squares of the differences between the yi and xi and xi , and the yi , respectively. These expressions can be simplified to: 0 y 1 x 1 S xy S xx The least squares (LS) line, which is an estimate of the true regression line is: ˆ ˆ ˆ y 0 1 x Find the equation of the line for the number of sales due to increased advertising xi 247 , yi 98 , 2 x i 13327 , 2 y i 2038 , x y 5192 i i and n=5 which allows us to get x 49.4 ; y 19.6 S xy xi yi ( xi )( yi ) 15 247 98 350.80 n i 1 i 1 i 1 n 1 n 1 n n n S xx xi ( xi ) 13327 (247) 1125.20 n i 1 i 1 2 2 1 5 2 Example -- Sales vs. Advertising- The slope and intercept estimates are: 350.80 ˆ 1 0.31andˆ0 19.6 0.31 49.4 4.2 1125.20 The equation of the LS line is: y 4.29 0.31x Coefficient of Determination and Coefficient of Correlation yˆi 0 ˆ1xi i 1,2,.....n Residuals are used to evaluate the goodness of fit of the LS line: ei yi (ˆ0 ˆ1xi ),i 1,2,.....n Q min ei 2 Error sum of squares (SSE): Q min ei2 Qmin also equals: n y y i 1 i 2 2 y yi S yy i 1 i 1 n n 2 i 1 n This is the total sum of squares (SST). Total Sum of Squares: n n n n i 1 i 1 i 1 i 1 SST ( yi y )2 ( yˆi y )2 ( yi yˆi )2 2 ( yi yˆi )( yˆi y ) SSR SSE Regression Sum of Squares: SST SSR SSE SSR SSE r 1 SST SST 2 0 , where is the ratio. Sales vs. Advertising Coefficient of Determination and Correlation Calculate r2 and r using our data. n n 1 SST S yy yi2 ( yi2 ) 2038 15 (98)2 117.2 n i 1 i 1 Next calculate SSR SSR =SST SSE 117.2 7.87 109.33 Then, 109.33 r 0.933andr 0.933 0.966 117.2 2 Since 96.6% of the variation in sales is accounted for by linear regression on advertising, the relationship between the two is strongly linear with a positive slope. Estimation of 2 Variance 2 measures the scatter of the Y i around their means. The unbiased estimate of the variance is given by: n 2 e i SSE s n2 n2 2 i 1 Sales vs. Advertising Estimation of 2 Find the estimate of 2 using our past results SSE = 7.87 and n-2=3; so, 7.87 2 s 2.62 3 The estimate of is: s 2.62 $1.62 or $162 Statistical Inference on 0 ⅰ. Point Estimator ⅱ. Confidence Interval ⅲ. Test and Distributions of 0 and 1 2 x i 2 ˆ 0 ~ N 0, nS xx Point estimators xi 2 SE ( 0 ) s nS xx 2 ˆ 1 ~ N 1, Sxx s SE (1 ) S xx 100(1 -)% IC 0 tn2, SE 0 , 1 tn2, SE 1 2 2 Hypothesis test P.Q. ˆ 0 0 ~ tn 2 SE(ˆ 0) Hypothesis: H0 : 1 0 1 vs. Ha : 1 10 ˆ 1 1 ~ tn 2 SE( ˆ 1) Test Statistics: t0 1 0 1 SE ( 1 ) t0 1 SE ( 1 ) Reject region: reject Ho at if: t0 tn2, / 2 Analysis of Variance for Simple Linear Regression The analysis of variance (ANOVA) is a statistical technique to decompose the total variability in the yi’s into separate variance components associated with specific sources. Mean square is a sum of squares divided by its d.f. Mean Square Regression SSR MSR= 1 Mean Square Error SSE MSE= n2 The ratio of MSR to MSE provides an equivalent to test the significance of the linear relationship between x and y: 2 2 MSR SSR ˆ Sxx ˆ 1 ˆ1 H 0 2 2 t ~ F 1, n 2 2 ˆ MSE s s s / Sxx SE ( 1) 2 1 ANOVA table Source of Variation (Source) Sum of Squares (SS) Degrees of Freedom (d.f.) Regression SSR 1 Error SSE n–2 Total SST n-1 Mean Square (MS) MSR SSR 1 MSE SSE n2 F F MSR MSE Prediction of Future observations Suppose we fix x at specified value x* How do we predict the value of the r.v. Y*? Point Estimator: * * Y 0 1x * Prediction Intervals (PI) The Confidence Intervals for Y* and E(Y*) are called Prediction Intervals. Formulas for a 100(1-α)% PI: * tn 2, / 2 s 1 n x * x Y * tn 2, / 2 s 1 s MSE 2 S XX 1 n * * tn 2, / 2 s x * x S XX 1 n x * x S XX 2 Y * Y * tn 2, / 2 s 1 2 1 n x * x S XX 2 Cautions about making predictions Note that the PI will be shortest when x* is equal to the sample mean. The farther away x* is from the sample mean the longer the PI will be. Extrapolation beyond the range of the data is highly imprecise and should be avoided. Example 10.8 Calculate a 95% PI for the mean groove depth of the population of all tires and for the groove depth of a single tire with a mileage of 25,000 (based on the date from earlier sections). In previous examples, we already measured the following quantities: x 25 0 1 x* 178.62 * s 19.02 ; x 16 n9 ; S XX 960 ; t7,0.025 2.365 Example 10.8 (continued) Now we simply plug these numbers into our formulas 95% PI for E(Y*): 178.62 2.365 19.02 2 (25 16) 1 9 960 [158.73,198.51] 95% PI for Y*: 2 (25 16) 178.62 2.365 19.02 1 19 [129.44, 227.80] 960 Calibration (Inverse Regression) Suppose we are given μ*=E(Y*), and we want an estimate of x*. We simply solve the linear regression formula for x* to obtain our point estimator: * 0 x* 1 Calculating the CI is more complicated and is not covered in this course. Example 10.9 Estimate the mean life of a tire at wearout (62.5 mils remaining). We want to estimate x* when μ*=62.5 From previous examples, we have calculated: 0 360.64 1 7.281 Plugging this data into our equation we get: 62.5 360.64 x* 1000 40, 947.67 7.281 REGRESION DIAGNOSTIC The four basic assumptions of linear regression need to be verified from data to assure that the analysis is valid. 1. The mean of 2. The Yi Yi is a linear function of have a common variance the same for all values of 3. The errors 4. The errors 2 xi ,which is x i are normally distributed. i are independent. Checking The Model Assumptions Checking for Linearity Checking for Constant Variance Checking for Normality Checking for Independence How to do this? If the model is correct, then the residuals ei yi yˆi can be viewed as the “estimates” of the random errors Residual plots are the primary tool. i‘s. Checking for linearity If regression of y on x should is linear, then the plot of exhibit random scatter around zero. ei vs. xi Example 10.10 i xi yi yˆi 1 0 394.33 360.64 33.69 2 4 329.50 331.51 -2.01 3 8 291.00 302.39 -11.39 4 12 255.17 273.27 -18.10 5 16 229.33 244.15 -14.82 6 20 204.83 215.02 -10.19 7 24 179.00 185.90 -6.90 8 28 163.83 156.78 7.05 The plot is clearly parabolic. The linear regression does not fit the data adequately. Maybe we can try a second degree model: 9 32 150.33 127.66 22.67 y 0 1 x 2 x 2 ei Checking for Constant Variance Plot ei vs. yˆi . Since the yˆi are linear functions of xi , we can also plot ei vs. xi. If the constant variance assumption is correct, Var (ei ) The plot of ei vs. yˆi would be like 2 Checking for normality Making a normal plot 1. The normal plot requires that the observations form a random sample with a common mean and variance. 2. The yi do not form such a random sample, E(Yi ) i depend on and hence are not equal. 3. Residuals using to make normal plot (They have a zero mean and an approximately constant variance. xi Checking for normality Example 10.10 Checking for Independence A well-known statistical test is the Durbin-Watson Test n d 2 ( e e ) u u 1 u 2 n 2 e u u 1 1. When d is more near 2, residuals are more independent. 2. When d is more near 0, residuals are more positively correlated. 3. When d is more near 4, residuals are more negatively correlated. CHECKING FOR OUTLIERS AND INFLUENTIAL OBSERVATIONS Checking for Outliers Standard residuals ei e SE (ei ) * i ei ei , i 1, 2,..., n. 2 s 1 ( xi - x ) s 1 n S xx If ei* 2, then the corresponding observation may be regarded an outlier. Checking for Influential Observations An influential observation is not necessarily an outlier. An observation can be influential because it has an extreme x-value, an extreme y-value, or both. How can we identify influential observations? Leverage yˆi can be expressed as a linear combination of all the y j as follows: n yˆi hij y j , j 1 n h i 1 ii k 1 where the hij are some functions of the x's. We call hij as the leverage. A rule of thumb is to regard any hii 2(k 1) / n as high leverage. The observation with high leverage is an influential observation. In this chapter, k 1, and so hii 4 / n is regarded as high leverage. 1 ( xi x ) 2 The formula for hii for k 1 is given by hii n S xx How to Deal with Outliers and Influential Observations? Detect outliers and influential observations If they are erroneous observations or not Discard these observations Include them in the analysis Do Analysis Two separate analyses may be done, one with and one without the outliers and influential observations. Example 10.12 No. 1 2 3 4 5 6 7 8 9 10 11 X 8 8 8 8 8 8 8 19 8 8 8 Y 6.28 5.76 7.71 8.84 8.47 7.04 5.25 12.50 5.56 7.91 6.89 ei* -0.341 -1.067 0.582 1.735 1.300 0.031 -1.624 0 -1.271 0.757 -0.089 hii 0.1 0.1 0.1 0.1 0.1 0.1 0.1 1 0.1 0.1 0.1 DATA TRANSFORMATIONS Linearizing Transformations Simple Functional relationship i.e power form: y x ln y ln x ln ln x then y ln y and x ln x produce 0 ln and 1 DATA TRANSFORMATIONS Linearizing Transformations Simple Functional relationship i.e exponential form: x y e x ln y ln e ln x then y ln y and x x produce 0 ln and 1 DATA TRANSFORMATIONS Linearizing Transformations y x y y log x y x3 y 1x y2 x log y x y3 x 1 y x y x y x2 x y x y log x y x2 y 1x y2 x3 y x y3 x y2 x y x y3 x DATA TRANSFORMATIONS Linearizing Transformations Ex. 10.13 (Tire tread wear vs. Mileage: Exponential Model) y x DATA TRANSFORMATIONS Linearizing Transformations Ex. 10.13 (Tire tread wear vs. Mileage: Exponential Model) y x DATA TRANSFORMATIONS Linearizing Transformations Ex. 10.13 (Tire tread wear vs. Mileage: Exponential Model) y x DATA TRANSFORMATIONS Variance Stabilizing Transformations Based on two-term Taylor-series approximations Given relationship between mean and variance: 2 ( ) The following transformation makes variances approximately equal, even if means differ : Y f ( x) f (u ) (u ) 1 2 DATA TRANSFORMATIONS Variance Stabilizing Transformations Delta Method Var h(Y ) h( ) 2 g 2 ( ) Let: h( ) 2 then g 2 ( ) 1 h( ) h( ) gd( ) consequently 2 Var Y g ( ) E Y h( y ) gdy( y ) 1 g ( ) DATA TRANSFORMATIONS Variance Stabilizing Transformation Example 1 Var Y c2 2 c 0 here Example 2 Var Y c2 c 0 here g ( ) c then g ( ) c then h( y ) dy cy 1c dyy 1c ln y h( y ) c dyy 1c dy y 2 c y CORRELATION ANALYSIS Background on correlation A number of different correlation coefficients are used for different situations. The best known is the Pearson product-moment correlation coefficient, which is obtained by dividing the covariance of the two variables by the product of their standard deviations. Despite its name, it was first introduced by Francis Galton. CORRELATION ANALYSIS •When it is not clear which is the predictor variable and which is the response variable? •When both variables are random? Bivariate Normal Distribution Correlation: a measurement of how closely two variables share a linear relationship. Or the measure of independence. corr(X,Y) Cov(X,Y) Var(X)Var(Y) If =0 , uncorrelated, that implies independence, but does not guarantee it If =-1 or +1, it represents perfect association Useful when it is not possible to determine which variable is the predictor and which is the response. Health vs Wealth. Which is predictor? Which is response? Bivariate Normal Distribution p.d.f. of (X,Y) f ( x, y ) 1 2 X Y 1 2 e 2 x 2 x X y Y y Y 1 X 2 2 2(1 ) X X Y Y Properties p.d.f is defined at -1<<1 Undefined if =±1 and is called degenerate. The marginal p.d.f of x is N (X , X2 ) The marginal p.d.f of Y is N (Y , Y2 ) Bivariate Normal Distribution How to calculate Let f(X,Y) has a covariance = X2 A X Y X2 det X Y X Y Y2 X Y 2 2 2 2 2 2 2 2 1 X Y X Y X Y 2 Y where f ( x) N (X ,X2 ) ; f (Y ) N (Y ,Y2 ) Calculation f ( X ,Y ) 1 2 N 2 e 2 1 x X y Y A 2 det A wwhere N=2 since it is bivariate bi=2, thus: f ( X ,Y ) 1 2 N 2 2 X Y2 1 2 e 2 1 x X y Y A 2 wwhere 2 1 Y A1 2 2 X Y 1 2 X Y X Y X2 Calculation (cont…) f ( x, y ) f ( x, y ) 1 2 X Y 1 2 1 2 X Y 1 2 e 2 x 2 2 y 2 2 x y 1 X X Y X Y X Y Y 2 2 2 2(1 ) X Y e x 2 2 x y y 2 1 X X Y Y 2 2 2 X Y 2(1 ) X Y Statistical Inference on the Correlation Coefficient ρ We can derive a test on the correlation coefficient in the same way that we have been doing in class. Assumptions X, Y are from the bivariate normal distribution R: sample estimate of the population correlation coefficient ρ Start with point estimator n R i 1 i X )(Yi Y ) n n i 1 i 1 ( X i X )2 (Yi Y )2 Get the pivotal quantity (X The distribution of R is quite complicated T: transform the point estimator into a p.q. T R n2 1 R2 Do we know everything about the p.q.? Yes: T ~ tn-2 under H0 : ρ=0 Derivation of T are these equivalent? r n 2 ? ˆ1 t 2 SE ( ˆ ) 1 r Therefore, we can use t as a statistic for testing against the null hypothesis H0: β1=0 Equivalently, we can test against H0: ρ=0 1 substitute: s S S xx r ˆ1 x ˆ1 xx ˆ1 sy S yy SST SSE (n 2) s 2 1 r SST SST then: 2 t ˆ1 S xx SST ˆ1 ˆ1 (n 2) SST s 2 (n 2) s SE ( ˆ1 ) S xx yes, they are equivalent. Exact Statistical Inference on ρ Test H0 : ρ=0 vs. Ha : ρ≠0 H0 : 1 0vs.Ha : 1 0 Test Statistics t r n2 1 r2 Y Y E (Y x) Y X x X X tn 2 S S S xx r ˆ1 x ˆ1 xx ˆ1 Sy S yy SST (n 2) s 2 1 r SST 2 t r n2 1 r2 ˆ1 S xx ˆ1 SST (n 2) SST (n 2) s 2 s S xx tn 2 Exact Statistical Inference on ρ (Cont.) Reject Region : Reject H0 if t0 > tn-2 Example: The times for 25 soft drink deliveries (y) monitored as a function of delivery volume (x) is shown in table next page. Testing the null hypothesis that the correlation coefficient is equal to 0. Exact Statistical Inference on ρ Data Y X Y X Y X Y X Y X 7 16.68 7 18.11 16 40.33 10 29.00 10 17.90 3 11.50 2 8.00 10 21.00 6 15.35 26 52.32 3 12.03 7 17.83 4 13.50 7 19.00 9 18.75 4 14.88 30 79.24 6 19.75 3 9.50 8 19.83 6 13.75 5 21.50 9 24.00 17 35.10 4 10.75 Exact Statistical Inference on ρ Solution The sample correlation coefficient is (X n R i X )(Y Y ) i 2473.34 i 1 n n (X X ) i 2 i 1 (Y Y ) 2 1136.57 * 5784.54 i i 1 for α = .01, Reject H0 t0 r n2 1 r 2 17.56 t0 0.96 * 25 2 1 0.96 17.56 2 t23,0.005 2.807 0.96 Approximate Statistical Inference on ρ There is no exact method of testing ρ vs an arbitrary ρ0 Distribution of R is very complicated T ~ tn-2 only when ρ = 0 To test ρ vs an arbitrary ρ0 use Fisher’s Normal approximation 1 1 1 1 R tanh 1 R 12 ln N 2 ln , 1 n 3 1 R Transform the sample estimate 1 1 0 1 1 r ln , , under H 0 , ~ N 2 ln 1 n 3 1 r 0 1 2 Approximate Statistical Inference on ρ Test : H 0 : 0 vs. H1 : 0 1 r ln 1 r 1 2 1 0 H 0 : 0 ln vs. H1 : 0 1 0 1 2 Z statistic: z0 n 3 0 reject H0 if |z0| > zα/2 Sample estimater: CI: 1 1 z /2 n3 n3 e2l 1 e2u 1 2u 2l e 1 e 1 z /2 Approximate Statistical Inference on ρ Code: Approximate Statistical Inference on ρ Output: Approximate Statistical Inference on ρ Retaking the previous example: The times for 25 soft drink deliveries (y) monitored as a function of delivery volume (x) is shown in table next page. Testing the null hypothesis that the correlation coefficient is equal to 0. SAS coding for last example data corr_bev; input y x; datalines; 7 16.68 3 11.5 3 12.03 4 14.88 6 13.75 7 18.11 2 8.00 7 17.83 30 79.24 5 21.5 16 40.33 10 21.00 4 13.5 6 19.75 9 24.00 10 29.00 6 15.35 7 19.00 3 9.50 17 35.1 10 17.90 26 52.32 9 18.75 8 19.83 4 10.75 ; run; proc gplot data=corr_bev; plot y*x; run; proc corr data=corr_bev outp=corr; var x y; run; SAS analysis for last example SAS analysis for last example Pitfalls of Regression and Correlation Analysis Correlation and causation Coincidental data Baldness and lawyers Lurking variables(third unobserved variable) Good mood cause good health Relationship between eating and weight, with unobserved variable of heredity(metabolism,and illness). Restricted range IQ, school performance (elementary school to college) college lower IQ’s are less common so there would clearly be a decrease in the range. Pitfalls of Regression and Correlation Analysis Correlation and linearity The correlation value may not be enough to evaluate a relationship, especially in the case where the assumption of normality is incorrect. This image created by Francis Anscombe, has common mean (7.5), standard deviation(4.12), correlation (.81) and regression line y=3+.5x SIMPLE LINEAR REGRESSION AND CORRELATION Prepared by: Jackie Zerle David Fried Chun-Hui Chung Weilai Zhou Shiyhan Zhang Alex Fields Yu-Hsun Cheng Roosevelt Moreno AMS 572.1 DATA ANALYSIS, FALL 2007.