Transcript Slide 1
Simple Linear Regression AMS 572 11/29/2010 2/69 Outline 1. Brief History and Motivation – Zhen Gong 2. Simple Linear Regression Model – Wenxiang Liu 3. Ordinary Least Squares Method – Ziyan Lou 4. Goodness of Fit of LS Line – Yixing Feng 5. OLS Example – Lingbin Jin 6. Statistical Inference on Parameters – Letan Lin 7. Statistical Inference Example – Emily Vo 8. Regression Diagnostics– Yang Liu 9. Correlation Analysis – Andrew Candela 10. Implementation in SAS – Joseph Chisari 3/69 Brief History and Introduction Legendre published the The Karl method Pearson was andextended Udny In 1809, Gauss earliest form of published by Yule extended Galton it toin a the more theFrancis same method. regression, which was the 19th century general statistical to describe context a method of least squares in biological around 20th phenomenon. century. 1805. 4/69 Motivation for Regression Analysis • Regression analysis is a statistical methodology to estimate the relationship of a response variable to a set of predictor variable. • When there is just one predictor variable, we will use simple linear regression. When there are two or more predictor variables, we use multiple linear regression Prediction for response variable New observed predictor value Predict Y, based on X 5/69 Motivation for Regression Analysis 2010 Camry: Horsepower at 6000 rpm: 169 Highway gasoline consumption: 0.03125 gallon per mile 2010 Milan: Horsepower at 6000 rpm: 175 Highway gasoline consumption: 0.0326 gallon per mile 2010 Fusion: Horsepower at 6000 rpm: 263 Highway gasoline consumption: ? Response variable (Y): Highway gasoline consumption Predictor variable (X): Horsepower at 6000 rpm 6/69 Simple Linear Regression Model • A summary of the relationship between a dependent variable (or response variable) Y and an independent variable (or covariate variable) X. • Y is assumed to be a random variable while, even if X is a random variable, we condition on it (assume it is fixed). Essentially, we are interested in knowing the behavior of Y given we know X = x. 7/69 Good Model • Regression models attempt to minimize the distance measured vertically between the observation point and the model line (or curve). • The length of the line segment is called residual, modeling error, or simply error. • The negative and positive errors should cancel out ⇒ Zero overall error Many lines will satisfy this criterion. 8/69 Good Model 9/69 Probabilistic Model • In simple linear regression, the population regression line was given by E(Y) = β0+β1x • The actual values of Y are assumed to be the sum of the mean value, E(Y), and a random error term, ∊: Y = E(Y) + ∊ = β0+β1x + ∊ • At any given value of x, the dependent variable Y ~ N (β0+β1x , σ2) 10/69 Least Squares (LS) Fit Boiling Point of Water in the Alps Pressure Boiling Pt Pressure Boiling Pt 20.79 194.5 24.01 201.3 20.79 194.3 25.14 203.6 22.40 197.9 26.57 204.6 22.67 198.4 28.49 209.5 23.15 199.4 27.76 208.6 23.35 199.9 29.04 210.7 23.89 200.9 29.88 211.9 23.99 201.1 30.06 212.2 24.02 201.4 11/69 Least Squares (LS) Fit Find a line that represent the ”best” linear relationship: 12/69 Least Squares (LS) Fit • Problem: the data does not go through a line yi 0 1xi i 1,2,......n •Find the line that minimizes the sum: n 2 Q yi 0 1 xi i 1 • We are looking for the line that minimizes e( x) yi ( 01 xi ) 2 i 13/69 Least Squares (LS) Fit • To get the parameters that make the sum of square difference become minimum, take partial derivative for each parameter and equate it with zero. 2 y x i 0 1 i Q 0 1 1 2 y x i 0 1 i Q 0 0 0 2 yi 0 1 xi xi 0 2 yi 0 1 xi 1 0 0 xi 1 xi 2 xi yi 0 n 1 xi yi xi yi 0 xi 1 xi 2 0 yi 0 n 1 xi 0 14/69 Least Squares (LS) Fit • Solve the equations and we get n 0 n n n i 1 n i 1 n i 1 ( x )( yi ) ( xi )( xi yi ) 2 i i 1 n xi2 ( xi ) 2 i 1 1 i 1 n n n i 1 i 1 i 1 n xi yi ( xi )( yi ) n n i 1 i 1 n xi2 ( xi ) 2 15/69 Least Squares (LS) Fit • To simplify, we introduce n n n 1 n S xy ( xi x )( yi y ) xi yi ( xi )( yi ) n i 1 i 1 i 1 i 1 n n 1 n S xx ( xi x ) x ( xi ) 2 n i 1 i 1 i 1 2 2 i n n n 1 S yy ( yi y ) 2 yi2 ( yi ) 2 n i 1 i 1 i 1 0 y 1 x1 Sxy Sxx • The resulting equation y 0 1 x is known as the least squares line, which is an estimate of the true regression line. 16/69 Goodness of Fit of the LS Line The fitted values is yˆi 0 ˆ1 xi The residuals ei yi (ˆ0 ˆ1xi ) are used to evaluate the goodness of fit of the LS Line. 17/69 Goodness of Fit of the LS Line The error sum of squares SSE= The total sum of squares SST= The regression sum of squares n n n n SST ( yi y ) ( yˆi y ) ( yi yˆi ) 2 ( yi yˆi )( yˆi y ) 2 i 1 2 i 1 2 i 1 SSR i 1 SSE SST=SSR+SSE 0 18/70 Goodness of Fit of the LS Line • The coefficient of determination is always between 0 and 1 • The sample correlation coefficient between X and Y is For the simple linear regression, 19/69 Estimation of the variance The variance measures the scatter of the around their means An unbiased estimate of is given by n 2 e i SSE s n2 n2 2 This estimate of i 1 has n-2 degrees of freedom. 20/69 Implementing OLS method to Problem 10.4 n Q [ yi ( 0 1 xi )] 2 i 1 OLS method: The time between eruptions of Old Faithful geyser in Yellowstone National Park is random but is related to the duration of the last eruption. The table below shows these times for 21 consecutive eruptions. Obs No. Last Nex t Obs No. Last Nex t Obs No. Last Ne xt 1 2.0 50 8 2.8 57 15 4.0 77 2 1.8 57 9 3.3 72 16 4.0 70 3 3.7 55 10 3.5 62 17 1.7 43 4 2.2 47 11 3.7 63 18 1.8 48 5 2.1 53 12 3.8 70 19 4.9 70 6 2.4 50 13 4.5 85 20 4.2 79 7 2.6 62 14 4.7 75 21 4.3 72 21/69 Implementing OLS method to Problem 10.4 A scatter plot of Next vs. LAST 22/69 Implementing OLS method to Problem 10.4 x 3.238 21 S xx ( xi x )2 22.230 i 1 21 y=62.714 21 SSE ( yi yˆi )2 713.687 i 1 21 S yy ( yi y )2 2844.286 SSR ( yˆi y )2 2130.599 S xy ( xi x )( yi y ) 217.629 SST S yy 2844.286 i 1 21 i 1 i 1 ˆ1 S xy / S xx 9.790 ˆ0 y ˆ1x 31.013 23/69 Implementing OLS method to Problem 10.4 ˆ =ˆ0 ˆ1x y When x=3, y=60 r SSR / SST 0.865 We could say that Last is a good predictor of Next 24/69 Statistical Inference Statistical Inference on 0 and 1 Final Result ˆ0 and ˆ1 are normally distributed. E(ˆ0 ) 0 SD( ˆ0 ) . x 2 i nSxx ˆ0 0 ~ N (0,1) SD ( ˆ0 ) E(ˆ1 ) 1 ˆ SD( 1 ) S xx ˆ1 1 ~ N (0,1) SD( ˆ1 ) 25/69 Statistical Inference on 0 and 1 . Derivation Set xi ’s as fixed and use ( xi x) xi nx 0 ( x x )( Y Y ) ( x x ) Y Y ( x x ) ˆ i 1 i S xx ( xi x )Y S xx i 1 n ˆ0 Y ˆ1x i i S xx i 26/69 Statistical Inference on 0 and 1 Derivation ( xi x ) E (Yi ) ˆ E ( 1 ) S xx i 1 n . ( xi x ) E ( 0 1 xi ) S xx i 1 n n ( xi x ) ( xi x ) xi 0 1 S xx S xx i 1 i 1 n n 1 n ( xi x ) xi ( xi x ) x S xx i 1 i 1 1 ( xi x ) 2 1 S xx i 1 n 2 xi x ˆ Var (Yi ) Var ( 1 ) i 1 S xx n xi x i 1 S xx n 2 2 2 S xx 2 n 2 ( x x ) i i 1 2 S xx S xx 2 2 S xx 27/69 Statistical Inference on 0 and 1 Derivation . E ( ˆ0 ) E (Y ˆ1 x ) E (Y ) E ( ˆ ) x i 1 n E ( 0 1 xi ) n 0 1 x i 0 n n 1 x 1 x Var ( ˆ0 ) Var (Y ˆ1 x ) 2 Var (Y ) x Var ( ˆ ) 1 x S xx n 2 2 2 2 x n x ) x x ( i i 2 nS xx 2 xi 2 nSxx 28/70 Statistical Inference on 0 and 1 2 Since (n 22) S SSE ~ n2 2 2 SE( ˆ 0) s 2 x i nSxx SE(ˆ1) s S xx . Pivotal Quantities (P.Q.): ˆ 0 0 ˆ 1 1 ~ tn 2 ~ tn 2 SE(ˆ 0) SE( ˆ 1) Confidence Intervals (CI’s): . ˆ 0 tn 2, / 2SE(ˆ 0) ˆ1 tn 2, / 2SE(ˆ1 ) 29/69 Statistical Inference on 0 and 1 Hypothesis tests: 0 0 H : vs . H : 0 1 1 0 1 1 . ˆ1 10 tn2, / 2 Reject H 0 at level if t0 SE( ˆ ) 1 A useful application is to show whether there is a linear relationship between x and y H0 : 1 0 vs. H0 : 1 0 Reject H 0 at level if t0 ˆ1 SE( ˆ1 ) One-side alternative hypotheses can be tested using one-side t-test. t n 2, / 2 30/69 Analysis of Variance (ANOVA) Mean Square: A sum of squares divided by its degrees of freedom. SSR MSR 1 and SSE MSE n2 MSR SSR ˆ S xx ˆ1 2 2 s/ S MSE s s xx 2 1 f1,n2, t 2 ˆ1 t02 F0 SE( ˆ ) 1 2 n2, / 2 2 31/69 Analysis of Variance (ANOVA) ANOVA Table: Source of Variation (Source) Sum of Squares (SS) Degrees of Freedom (d.f.) Regression SSR 1 Error SSE n-2 Total SST n-1 Mean Square (MS) SSR 1 SSE MSE n2 MSR F F MSR MSE 32/69 Statistical Inference Example – Testing for Linear Relationship • Problem 10.4 At α = 0.05, is there a linear trend between the time to the NEXT eruption and the duration of the LAST eruption? H 0 : 1 0 vs. H1 : 1 0 Reject H0 if t tn2, /2 where t 1 SE 1 33/70 Statistical Inference – Hypothesis Testing Solution: S xy 217.629 1 9.790 7.531 B1 9.790 t S xx 22.230 SE 1 1.2999 n SSE yi yi i 1 2 713.687 SSE 713.689 s 6.129 n2 19 SE 1 s S xx 6.129 1.2999 22.230 tn2, /2 t19,0.025 2.093 7.531 2.093 We reject H0 and therefore conclude That there is a linear relationship between NEXT and LAST. 34/69 Statistical Inference Example Confidence and Prediction Intervals • Problem 10.11 from Tamane & Dunlop Statistics and Data Analysis 10.11 (a) Calculate a 95% PI for the time to the next eruption if the last eruption lasted 3 minutes. 35/69 Problem 10.11 – Prediction Interval Solution: The formula for a 100(1-α)% PI for a future * Y observation is given by * 2 * 1 ( x x) Y tn2, /2 s 1 n S xx 36/69 Problem 10.11 - Prediction Interval B1 S xy S xx 9.790 B0 y B1 x 31.013 * Y B0 B1x 31.013 9.790(3) 60.385 * SSE s 6.129 n2 tn2, /2 t19,0.025 2.093 * 1 ( x* x)2 Y tn2, /2 s 1 n S xx 60.385 1 (3 3.238)2 (2.093)(6.129) 1 21 22.230 [47.238,73.529] 37/69 Problem 10.11 - Confidence Interval 10.11(b) Calculate a 95% CI for the mean time to the next eruption for a last eruption lasting 3 minutes. Compare this confidence interval with the PI obtained in (a) 38/70 Problem 10.11 - Confidence Interval Solution: The formula for a 100(1-α)% CI for * is given by * 1 ( x* x)2 tn2, /2 s n S xx * where B 0 B1 x The 95% CI is [57.510, 63.257] The CI is shorter than the PI * 39/69 Regression Diagnostics Checking the Model Assumptions 1. E (Yi ) is a linear function of xi 2 2. Var (Yi ) is the same for all xi 3. The errors i are normally distributed 4. The errors i are independent(for time series data) Checking for Outliers and Influential Observations 40/69 Checking the Model Assumptions • Residuals: ei yi yˆi e • i can be viewed as the “estimates” of random errors i 's 1 ( xi x ) 2 ei ~ N (0, 1 ) n S xx 2 2 41/69 Checking for Linearity • If regression of y on x is linear, then the plot of ei vs. xi should exhibit random scatter around zero 42/69 Checking for Linearity Tire Wear Data y i 400 xi yi yˆ i ei 1 0 394.33 360.64 33.69 2 4 329.50 331.51 -2.01 3 8 302.39 302.39 -11.39 4 12 273.27 273.27 -18.10 5 16 244.15 244.15 -14.82 6 20 215.02 215.02 -10.19 7 24 185.90 185.90 -6.90 8 28 156.78 156.78 7.05 9 32 127.66 127.66 22.67 350 300 250 200 150 0 5 10 15 20 x 25 30 35 43/69 Checking for Linearity 40 Residual 30 20 10 0 -10 -20 0 5 10 15 20 x 25 30 35 i Tire Wear Data ei xi y i yˆ i 1 0 394.33 360.64 33.69 2 4 329.50 331.51 -2.01 3 8 302.39 302.39 -11.39 4 12 273.27 273.27 -18.10 5 16 244.15 244.15 -14.82 6 20 215.02 215.02 -10.19 7 24 185.90 185.90 -6.90 8 28 156.78 156.78 7.05 9 32 127.66 127.66 22.67 44/69 Checking for Linearity • Data Transformation x x2 x3 x x y y y log y 1/ y x y log x y 1 / x y x log y x 1/ y x log x y y 1 / x y2 x x y3 x x2 x3 x x y2 y y y y2 y3 45/69 Checking for Constant Variance • If the constant variance assumption is correct, the dispersion of the ei ' s is approximately constant with respect to the yˆi ' s 46/69 Checking for Constant Variance Example from textbook 10.21 0.3 e 0.2 0.1 Residual 0 -0.1 -0.2 -0.3 -0.4 0 0.5 1 1.5 yˆ 2 2.5 47/69 Checking for Normality • We can use residuals to make a normal plot Normal Probability Plot 0.99 0.98 0.90 0.75 Probability Example from textbook 10.21 Normal plot of residuals 0.95 0.50 0.25 0.10 0.05 0.02 0.01 -0.3 -0.25 -0.2 -0.15 -0.1 -0.05 Data 0 0.05 0.1 0.15 0.2 48/69 Checking for Outliers Definition: An outlier is an observation that does not follow the general pattern of the relationship between y and x • A large residual indicates an outlier!! ei ei SE (ei ) * ei 1 ( xi x ) s 1 n S xx 2 ei s ei 2 * 49/69 Checking for Influential Observations An observation can be influential because it has an extreme x-value, an y-value, or both • A large hii indicates an influential observation!! hii yˆi hij y j 1 ( xi x ) hii n S xx hii 2(k 1) / n k: # of predictors n j 1 2 50/69 Checking for Influential Observations 90 80 70 60 50 40 30 20 10 0 2 4 6 8 10 12 14 16 18 20 51/69 Why use Correlation analysis? • If the nature of the relationship between X and Y is not known, we can investigate the correlation between them without making any assumptions of causality. • In order to do this, assume (X,Y) follows the bivariate normal distribution. 52/69 The Bivariate Normal Distribution • (X,Y) has the following distribution: 53/69 Why can we do this? • This assumption reduces to the probabilistic model for linear regression since the conditional distribution of Y given X=x is normal with the following parameters: • So when X=x the mean of Y is a linear function of x and the variance is constant w.r.t. x. 54/69 So what? • Under these assumptions we can use the data available to make inferences about ρ. • First we have to estimate ρ from the data. Define the sample correlation coefficient R: 55/69 How can we use this? • The exact distribution of R is very complicated, but we do have some options. • Under the null Hypothesis H0:ρ0=0 the distribution of R is simplified. An exact test exists in this case. • For arbitrary values of ρ0 we can approximate a function of R with a normal distribution thanks to R.A. Fisher. 56/69 Testing H0 : ρ0=0 • Under H0 the distribution of is t(n-2). This is kind of surprising, but think about it. The test statistic we used to test β10=0 is distributed as t(n-2) and ρ=0 if and only if β1=0. That the two test statistics are equivalent is shown on page 382-383 of the text. 57/70 Approximation of R • Fisher showed that for n even as small as 10 • Now we can test H0 : ρ= ρ0 vs. H1 : ρ ≠ ρ0 for arbitrary ρ0. We just compute: 58/69 Almost Finished! • We now have the tools necessary for inference on ρ. For a confidence interval for ρ compute: and solve for: 59/69 Correlation - Conclusion • When we are not sure of the relationship between X and Y assume (Xi,Yi) is an observation from a bivariate normal distribution. To test H0 : ρ= ρ0 vs H1 : ρ ≠ ρ0 at significance level α just compare : to But if ρ0 =0 compare to t(n-2,α) 60/69 SAS - Reg Procedure Proc Reg Data=Regression_Example; Title "Regresion Example"; Model Next = Last; Plot Next*Last; Plot Residual.*Predicted.; Output Out=Data_From_Regression Residual=R Predicted=PV; Run; 61/70 Proc Reg Output 0 Plot Next*Last 63/70 SAS - Plotting Regression Line Symbol1 Value=Dot C=blue I=R; Symbol2 Value=None C=red I=RLCLM95; Proc Gplot Data=Regression_Example; Title "Regression Line and CIs"; Plot Next*Last=1 Next*Last=2/Overlay; Run; 64/70 Plotting Regression Line 65/69 SAS - Checking Homoscedasticity Proc Reg Data=Regression_Example; Title "Regresion Example"; Model Next = Last; Plot Next*Last; Plot Residual.*Predicted.; Output Out=Data_From_Regression Residual=R Predicted=PV; Run; 9 Predicted.*Residual. 67/69 SAS - Checking Normality of Residuals Proc Reg Data=Regression_Example; Output Out=Data_From_Regression Residual=R Predicted=PV; Proc Univariate Data=Data_From_Regression Normal; Var R; qqplot R / Normal(Mu=est Sigma=est); Run; 68/69 Checking for Normality 69/69 Questions?