Transcript Document
Regression • For the purposes of this class: – Does Y depend on X? – Does a change in X cause a change in Y? – Can Y be predicted from X? • Y= mX + b Predicted values 180 Dependent Value Actual values 160 140 120 Overall Mean 100 30 40 50 60 Independent Value 70 80 When analyzing a regression-type data set, the first step is to plot the data: Y 35 114 45 120 55 150 65 140 75 166 55 138 180 Dependent Value (Y) X 160 140 120 100 30 40 50 60 70 Independent Value (X) The next step is to determine the line that ‘best fits’ these points. It appears this line would be sloped upward and linear (straight). 80 The line of best fit is the sample regression of Y on X, and its position is fixed by two results: Y = 1.24(X) + 69.8 Dependent Value 180 slope 160 Y-intercept 140 (55, 138) 120 Rise/Run 100 30 40 50 60 70 80 Independent Value 1) The regression line passes through the point (Xavg, Yavg). 2) Its slope is at the rate of “m” units of Y per unit of X, where m = regression coefficient (slope; y=mx+b) Testing the Regression Line for Significance • An F-test is used based on Model, Error, and Total SOS. – Very similar to ANOVA • Basically, we are testing if the regression line has a significantly different slope than a line formed by using just Y_avg. – If there is no difference, then that means that Y does not change as X changes (stays around the average value) • To begin, we must first find the regression line that has the smallest Error SOS. Error SOS The regression line should pass through the overall average with a slope that has the smallest Error SOS (Error SOS = the distance between each point and predicted line: gives an index of the variability of the data points around the predicted line). Dependent Value 180 160 140 138 overall average is the pivot point 120 100 30 40 50 55 60 Independent Value 70 80 For each X, we can predict Y: Y = 1.24(X) + 69.8 Error SOS is calculated as the sum of (YActual – YPredicted)2 This gives us an index of how scattered the actual observations are around the predicted line. The more scattered the points, the larger the Error SOS will be. This is like analysis of variance, except we are using the predicted line instead of the mean value. X Y_Actual Y_Pred SOSError 35 114 113.2 0.64 45 120 125.6 31.36 55 150 138 144 65 140 150.4 108.16 75 166 162.8 10.24 294.4 Total SOS • Calculated as the sum of (Y – Yavg)2 • Gives us an index of how scattered our data set is around the overall Y average. Dependent Value 180 Regression line not shown 160 140 Overall Y average 120 100 30 40 50 60 Independent Value 70 80 Total SOS gives us an index of how scattered the data points are around the overall average. This is calculated the same way for a single treatment in ANOVA. X Y_Actual Y Average SOSTotal 35 114 138 576 45 120 138 324 55 150 138 144 65 140 138 4 75 166 138 784 1832 What happens to Total SOS when all of the points are close to the overall average? What happens when the points form a non-horizontal linear trend? Model SOS • Calculated as the Sum of (YPredicted – Yavg)2 • Gives us an index of how far away the predicted values are from the overall average value Dependent Value 180 Distance between predicted Y and overall mean 160 140 120 100 30 40 50 60 Independent Value 70 80 Model SOS • Gives us an index of how far away the predicted values are from the overall average value X Y_Pred Y Average SOSModel 35 113.2 138 615.04 45 125.6 138 153.76 55 138 138 0 65 150.4 138 153.76 75 162.8 138 615.04 1537.6 • What happens to Model SOS when all of the predicted values are close to the average value? All Together Now!! X Y_Actual Y_Pred SOSError Y_Avg SOSTotal 35 114 113.2 0.64 138 576 615.04 45 120 125.6 31.36 138 324 153.76 55 150 138 144 138 144 0 65 140 150.4 108.16 138 4 153.76 75 166 162.8 10.24 138 784 615.04 1832 1537.6 294.4 SOSError = (Y_Actual – Y_Pred)2 SOSTotal = (Y_Actual –Y_ Avg) SOSModel = (Y_Pred – Y_Avg) 2 2 SOSModel Using SOS to Assess Regression Line • Model SOS gives us an index on how ‘different’ the predicted values are from the average values. – Bigger Model SOS = more different – Tells us how different a sloped line is from a line made up only of Y_avg. – Remember, the regression line will pass through the overall average point. • Error SOS gives us an index of how different the predicted values are from the actual values – More variability = larger Error SOS = large distance between predicted and actual values Magic of the F-test • The ratio of Model SOS to Error SOS (Model SOS divided by Error SOS) gives us an overall index (the F statistic) used to indicate the relative ‘difference’ between the regression line and a line with slope of zero (all values = Y_avg. – A large Model SOS and small Error SOS = a large F statistic. Why does this indicate a significant difference? – A small Model SOS and a large Error SOS = a small F statistic. Why does this indicate no significant difference?? • Based on sample size and alpha level (P-value), each F statistic has an associated P-value. – P < 0.05 (Large F statistic) there is a significant difference between the regression line a the Y_avg line. – P ≥ 0.05 (Small F statistic) there is NO significant difference between the regression line a the Y_avg line. Mean Model SOS = F Mean Error SOS Dependent Value 180 160 140 120 100 40 50 60 70 80 Independent Value 180 Dependent Value Basically, this is an index that tells us how different the regression line is from Y_avg, and the scatter of the data around the predicted values. 30 160 140 120 100 30 40 50 60 70 Independent Value 80 Y = 1.24(X) + 69.8 Dependent Value 180 slope 160 Y-intercept 140 120 Rise/Run 100 30 40 50 60 70 Independent Value Use regression line to predict a specific number or a specific change. 80 Correlation (r): Another measure of the mutual linear relationship between two variables. • ‘r’ is a pure number without units or dimensions • ‘r’ is always between –1 and 1 • Positive values indicate that y increases when x does and negative values indicate that y decreases when x increases. – What does r = 0 mean? • ‘r’ is a measure of intensity of association observed between x and y. – ‘r’ does not predict – only describes associations between variables 180 Dependent Variable Dependent Variable 180 r>0 160 140 120 r<0 140 120 100 100 30 40 50 60 70 30 80 180 160 140 120 100 40 50 60 Independent Variable 50 60 70 r is also called Pearson’s correlation coefficient. r=0 30 40 Independent Variable Inpendent Variable Dependent Variable 160 70 80 80 R-square • If we square r, we get rid of the negative value if it is negative and we get an index of how close the data points are to the regression line. • Allows us to decide how much confidence we have in making a prediction based on our model. • Is calculated as Model SOS / Total SOS r2 = Model SOS / Total SOS = Model SOS 180 Dependent Value = Total SOS 160 140 120 100 30 40 50 60 Independent Value 70 80 = Model SOS 180 Dependent Value = Total SOS R2 = 0.8393 160 r2 = Model SOS / Total SOS 140 numerator/denominator 120 100 30 40 50 60 70 80 Independent Value 1.2 R2 = 0.0144 1 0.8 Small numerator Big denominator 0.6 0.4 0.2 0 0 10 20 30 40 50 R-square and Prediction Confidence 1.2 1.2 2 R = 0.0144 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 10 20 30 40 50 0 60 1.2 R2 = 0.5537 10 20 30 40 50 60 40 50 60 1.2 1 1 R2 = 0.7605 R2 = 0.9683 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 10 20 30 40 50 60 0 10 20 30 Finally…….. • If we have a significant relationship (based on the p-value), we can use the r-square value to judge how sure we are in making a prediction.