Action Research Correlation and Regression INFO 515 Glenn Booker INFO 515 Lecture #7 Measures of Association Measures of association are used to determine how strong the relationship.
Download ReportTranscript Action Research Correlation and Regression INFO 515 Glenn Booker INFO 515 Lecture #7 Measures of Association Measures of association are used to determine how strong the relationship.
Action Research Correlation and Regression INFO 515 Glenn Booker INFO 515 Lecture #7 1 Measures of Association Measures of association are used to determine how strong the relationship is between two variables or measures, and how we can predict such a relationship Only applies for interval or ratio scale variables INFO 515 Everything this week only applies to interval or ratio scale variables! Lecture #7 2 Measures of Association For example, I have GRE and GPA scores for a random sample of graduate students INFO 515 How strong is the relationship between GRE scores and GPA? Do these variables relate to each other in some way? If there is a strong relationship, how well can we predict the values of one variable when values of the other variable are known? Lecture #7 3 Strength of Prediction Two techniques are used to describe the strength of a relationship, and predict values of one variable when another variable’s value is known INFO 515 Correlation: Describes the degree (strength) to which the two variables are related Regression: Used to predict the values of one variable when values of the other are known Lecture #7 4 Strength of Prediction Correlation and regression are linked -the ability to predict one variable when another variable is known depends on the degree and direction of the variables’ relationship in the first place INFO 515 We find correlation before we calculate regression So generating a regression without checking for a correlation first is pointless (though we’ll do both at once) Lecture #7 5 Correlation There are different types of statistical measures of correlation They give us a measure known as the correlation coefficient INFO 515 The most common procedure used is known as the Pearson’s Product Moment Correlation, or Pearson’s ‘r’ Lecture #7 6 Pearson’s ‘r’ Can only be calculated for interval or ratio scale data Its value is a real number from -1 to +1 Strength: As the value of ‘r’ approaches -1 or +1, the relationship is stronger. As the magnitude of ‘r’ approaches zero, we see little or no relationship INFO 515 Lecture #7 7 Pearson’s ‘r’ For example, ‘r’ might equal 0.89, -0.9, 0.613, or -0.3 Which would be the strongest correlation? Direction: Positive or negative correlation can not be distinguished from looking at ‘r’ INFO 515 Direction of correlation depends on the type of equation used, and the resulting constants obtained for it Lecture #7 8 Example of Relationships Positive direction -- as the independent variable increases, the dependent variable tends to increase: Student 1 2 3 4 5 INFO 515 GRE (X) 1500 1400 1250 1050 950 GPA1 (Y) 4.0 3.8 3.5 3.1 2.9 Lecture #7 9 Example of Relationships Negative direction -- as the dependent variable increases, the independent variable decreases: Student 1 2 3 4 5 INFO 515 GRE (X) 1500 1400 1250 1050 950 GPA2 (Y) 2.9 3.1 3.4 3.7 4.0 Lecture #7 10 Positive and Negative Correlation Data from slide 9 Data from slide 10 GPA1 Observed 4.00 GPA2 Observed 4.00 Linear Linear 3.80 3.80 3.60 3.60 3.40 3.40 3.20 3.20 3.00 3.00 2.80 2.80 900 1000 1100 1200 1300 1400 1500 900 1000 1100 1200 1300 1400 1500 GRE GRE Positive correlation, r = 1.0 Negative correlation, r = 1.0 Notice that high ‘r’ doesn’t tell whether the correlation is positive or negative! INFO 515 Lecture #7 11 *Important Note* An association value provided by a correlation analysis, such as Pearson’s ‘r’, tells us nothing about causation INFO 515 In this case, high GRE scores don’t necessarily cause high or low GPA scores, and vice versa Lecture #7 12 Significance of r We can test for the significance of r (to see whether our relationship is statistically significant) by consulting a table of critical values for r (Action Research p. 41/42) INFO 515 Table “VALUES OF THE CORRELATION COEFFICIENT FOR DIFFERENT LEVELS OF SIGNIFICANCE” Where df = (number of data pairs) – 2 Lecture #7 13 Significance of r We test the null hypothesis that the correlation between the two variables is equal to zero (there is no relationship between them) Reject the null hypothesis (H0) if the absolute value of r is greater than the critical r value INFO 515 Reject H0 if |r| > rcrit This is similar to evaluating actual versus critical ‘t’ values Lecture #7 14 Significance of r Example So if we had 20 pairs of data For two-tail 95% confidence (P=.05), the critical ‘r’ value at df=20-2=18 is 0.444 So reject the null hypothesis (hence correlation is statistically significant) if: r > 0.444 or r < -0.444 INFO 515 Lecture #7 15 Strength of “|r|” Absolute value of Pearson’s ‘r’ indicates the strength of a correlation 1.0 0.9 0.7 0.4 0.2 to to to to to 0.9: 0.7: 0.4: 0.2: 0.0: very strong correlation strong moderate to substantial moderate to low low to negligible correlation Notice that a correlation can be strong, but still not be statistically significant! (especially for small data sets) INFO 515 Lecture #7 16 *Important Notes* The stronger the r, the smaller the standard estimate of the error, the better the prediction! A significant r does not necessarily mean that you have a strong correlation INFO 515 A significant r means that whatever correlation you do have is not due to random chance Lecture #7 17 Coefficient of Determination By squaring r, we can determine the amount of variance the two variables share (called “explained variance”) R Square is the coefficient of determination So, an “R Square” of 0.94 means that 94% of the variance in the Y variable is explained by the variance of the X variable INFO 515 Lecture #7 18 What is R Squared? • • The Coefficient of determination, R2, is a measure of the goodness of fit R2 ranges from 0 to 1 • • INFO 515 R2 = 1 is a perfect fit (all data points fall on the estimated line or curve) R2 = 0 means that the variable(s) have no explanatory power Lecture #7 19 What is R Squared? • • Having R2 closer to 1 helps choose which regression model is best suited to a problem Having R2 actually equal zero is very difficult • INFO 515 A sample of ten random numbers from Excel still obtained an R2 of 0.006 Lecture #7 20 Scatter Plots It’s nice to use R2 to determine the strength of a relationship, but visual feedback helps verify whether the model fits the data well Also helps look for data fliers (outliers) A scatter plot (or scatter gram) allows us to compare any two interval or ratio scale variables, and see how data points are related to each other INFO 515 Lecture #7 21 Scatter Plots Scatter plots are two-dimensional graphs with an axis for each variable (independent variable X and dependent variable Y) To construct: place an * on the graph for each X and Y value from the data Seeing data this way can help choose the correct mathematical model for the data INFO 515 Lecture #7 22 Scatter Plots Y (Dep.) X=2 * Data point (2, 3) Y=3 (0, 0) INFO 515 X (Indep.) Lecture #7 23 Models Allow us to focus on select elements of the problem at hand, and ignore irrelevant ones May show how parts of the problem relate to each other May be expressed as equations, mappings, or diagrams May be chosen or derived before or after measurement (theory vs. empirical) INFO 515 Lecture #7 24 Modeling Often we look for a linear relationship – one described by fitting a straight line as well to the data as possible More generally, any equation could be used as the basis for regression modeling, or describing the relationship between two variables INFO 515 You could have Y = a*X**2 + b*ln(X) + c*sin(d*X-e) Lecture #7 25 Linear Model Y (Dep.) Y = m*X + b or Y = b0 + b1*X m = slope 1 unit of X b = Y axis intercept X (Indep.) INFO 515 Lecture #7 26 Linear Model Pearson’s ‘r’ for linear regression is calculated per (Action Research p. 29/30) Define: N = number of data pairs SX = Sum of all X values SX2 = Sum of all (X values squared) SY = Sum of all Y values SY2 = Sum of all (Y values squared) SXY = Sum of all (X values times Y values) Pearson’s r = [N*(SXY) – (SX)*(SY)] / sqrt[(N*(SX2) – (SX)^2)*(N*(SY2) – (SY)^2)] INFO 515 Lecture #7 27 Linear Model For the linear model, you could find the slope ‘m’ and Y-intercept ‘b’ from m = (r) * (standard deviation of Y) / (standard deviation of X) b = (mean of Y) – (m)*(mean of X) But it’s a lot easier to use SPSS’ slope=b1 and Y intercept = b0 INFO 515 Lecture #7 28 Regression Analysis Allows us to predict the likely value of one variable from knowledge of another variable The two variables should be fairly highly correlated (close to a straight line) The regression equation is a mathematical expression of the relationship between 2 variables on, for example, a straight line INFO 515 Lecture #7 29 Regression Equation Y = mX + b In this linear equation, you predict Y values (the dependent variable) from known values of X (the independent variable); this is called the regression of Y on X INFO 515 The regression equation is fundamentally an equation for plotting a straight line, so the stronger our correlation -- the closer our variables will fall to a straight line, and the better our prediction will be Lecture #7 30 Linear Regression y y^ y ^ y = a + b*x y = y^ + e x Choose “best” line by minimizing the sum of the squares of the vertical distances between the data points and the regression line INFO 515 Lecture #7 31 Standard Error of the Estimate Is the standard deviation of data around the regression line Tells how much the actual values of Y deviate from the predicted values of Y INFO 515 Lecture #7 32 Standard Error of the Estimate After you calculate the standard error of the estimate, you add and subtract the value from your predicted values of Y to get a % area around the regression line within which you would expect repeated actual values to occur or cluster if you took many samples (sort of like a sampling distribution for the mean….) INFO 515 Lecture #7 33 Standard Error of Estimate The Standard Error of Estimate for Y predicted by X is sy/x = sqrt[sum of(Y–predicted Y)2 /(N–2)] where ‘Y’ is each actual Y value ‘predicted Y’ is the Y value predicted by the linear regression ‘N’ is the number of data pairs For example on (Action Research p. 33/34), Sy/x = sqrt(2.641/(10-2)) = 0.574 INFO 515 Lecture #7 34 Standard Error of the Estimate So, if the standard error of the estimate is equal to 0.574, and if you have a predicted Y value of 4.560, then 68% of your actual values, with repeated sampling, would fall between 3.986 and 5.134 (predicted Y +/- 1 std error) INFO 515 The smaller the standard error, the closer your actual values are to the regression line, and the more confident you can be in your prediction Lecture #7 35 SPSS Regression Equations Instead of constants called ‘m’ and ‘b’, ‘b0’ and ‘b1’ are used for most equations The meaning of ‘b0’ and ‘b1’ varies, depending on the type of equation which is being modeled INFO 515 Can repress the use of ‘b0’ by unchecking “Include constant in equation” Lecture #7 36 SPSS Regression Models Linear model Y = b0 + b1*X Logarithmic model Y = b0 + b1*ln(X) where ‘ln’ = natural log Inverse model Y = b0 + b1/X Similar to the form X*Y = constant, which is a hyperbola INFO 515 Lecture #7 37 SPSS Regression Models Power model Y = b0*(X**b1) Compound model Y = b0*(b1**X) INFO 515 Where “**” indicates “to the power of” A variant of this is the Logistic model, which requires a constant input ‘u’ which is larger than Y for any actual data point Y = 1/[ 1/u + b0*(b1**X) ] Lecture #7 38 SPSS Regression Models “exp” means “e to the power of”; e = 2.7182818… Exponential model Y = b0*exp(b1*X) Other exponential functions INFO 515 S model Y = exp(b0 + b1/X) Growth model (is almost identical to the exponential model) Y = exp(b0 + b1*X) Lecture #7 39 SPSS Regression Models Polynomials beyond the Linear model (linear is a first order polynomial): Quadratic (second order) Y = b0 + b1*X + b2*X**2 Cubic (third order) Y = b0 + b1*X + b2*X**2 + b3*X**3 These are the only equations which use constants b2 & b3 Higher order polynomials require the Regression module of SPSS, which can do regression using any equation you enter INFO 515 Lecture #7 40 Y = whattheflock? To help picture these equations Make an X variable over some typical range (0 to 10 in a small increment, maybe 0.01) Define a Y variable Calculate the Y variable using Transform > Compute… and whatever equation you want to see INFO 515 Pick values for b0 and b1 that aren’t 0, 1, or 2 Have SPSS plot the results of a regression of Y vs X for that type of equation Lecture #7 41 How Apply This? Given a set of data containing two variables of interest, generate a scatter plot to get some idea of what the data looks like Choose which types of models are most likely to be useful For only linear models, use Analyze / Regression / Linear... INFO 515 Lecture #7 42 How Apply This? Select the Independent (X) and Dependent (Y) variables Rules may be applied to limit the scope of the analysis, e.g. gender=1 Dozens of other characteristics may also be obtained, which are beyond our scope here INFO 515 Lecture #7 43 How Apply This? Then check for the R Square value in the Model Summary Check the Coefficients to make sure they are all significant (e.g. Sig. < 0.050) If so, use the ‘b0’ and ‘b1’ coefficients from under the ‘B’ column (see Statistics for Software Process Improvement handout), plus or minus the standard errors “SE B” INFO 515 Lecture #7 44 Regression Example For example, go back to the “GSS91 political.sav” data set Generate a linear regression (Analyze > Regression > Linear) for ‘age’ as the Independent variable, and ‘partyid’ as the Dependent variable Notice that R2 and the ANOVA summary are given, with F and its significance INFO 515 Lecture #7 45 Regression Example Model Summary Model 1 R .075a R Sq uare .006 Adjusted R Sq uare .005 Std. Error of the Estimate 2.082 a. Predictors: (Constant), AGE OF RESPONDENT ANOVAb Model 1 Reg ression Residual Total Sum of Squares 36.235 6457.063 6493.298 df 1 1490 1491 Mean Square 36.235 4.334 F 8.361 Sig . .004a a. Predictors: (Constant), AGE OF RESPONDENT b. Dependent Variable: POLITICAL PARTY AFFILIATION INFO 515 Lecture #7 46 Regression Example The R Square of 0.006 means there is a very slight correlation (little strength) But the ANOVA Significance well under 0.050 confirms there is a statistically significant relationship here - it’s just a really weak one INFO 515 Lecture #7 47 Regression Example Output from Analyze > Regression > Linear Coefficientsa Model 1 (Constant) AGE OF RESPONDENT Unstandardized Coefficients B Std. Error 3.333 .148 -.009 .003 Standardized Coefficients Beta -.075 t 22.462 -2.892 Sig . .000 .004 a. Dependent Variable: POLITICAL PARTY AFFILIATION Output from Analyze > Regression > Curve Estimation Coefficients AGE OF RESPONDENT (Constant) INFO 515 Unstandardized Coefficients B Std. Error -.009 .003 3.333 .148 Lecture #7 Standardized Coefficients Beta -.075 t -2.892 22.462 Sig . .004 .000 48 Regression Example The heart of the regression analysis is in the Coefficients section We could look up ‘t’ on a critical values table, but it’s easier to: See if all values of Sig are < 0.050 - if they are, reject the null hypothesis, meaning there is a significant relationship INFO 515 If so, use the values under B for b0 and b1 If any coefficient has Sig > 0.050, don’t use that regression (coeff might be zero) Lecture #7 49 Regression Example The answer for “what is the effect of age on political view?” is that there is a very weak but statistically significant linear relationship, with a reduction of 0.009 (b1) political view categories per year INFO 515 From the Variable View of the data, since low values are liberal and large values conservative, this means that people tend to get slightly more liberal as they get older Lecture #7 50 Curve Estimation Example For the other regression options, choose Analyze / Regression / Curve Estimation… Define the Dependents (variable) and the Independent variable - note that multiple Dependents may be selected Check which math models you want used Display the ANOVA table for reference INFO 515 Lecture #7 51 Curve Estimation Example SPSS Tip: up to three regression models can be plotted at once, so don’t select more than that if you want a scatter plot to go with the data and the regressions For the same example just used, get a summary for the linear and quadratic models (Analyze > Regression > Curve Estimation) Find “R Square” for each model INFO 515 Generally pick the model with largest R Square Already saw Linear output, now see Quadratic Lecture #7 52 Curve Estimation Example For the quadratic regression, R Square is slightly higher, and the ANOVA is still significant Model Summary R .094 R Square .009 Adjusted R Square .008 Std. Error of the Estimate 2.079 The independent variable is AGE OF RESPONDENT. ANOVA Reg ression Residual Total Sum of Squares 57.801 6435.496 6493.298 df 2 1489 1491 Mean Square 28.901 4.322 F 6.687 Sig . .001 The independent variable is AGE OF RESPONDENT. INFO 515 Lecture #7 53 Curve Estimation Example The Quadratic coefficients are all significant at the 0.050 level Coefficients AGE OF RESPONDENT AGE OF RESPONDENT ** 2 (Constant) Unstandardized Coefficients B Std. Error -.048 .018 .000 .000 4.191 .412 Standardized Coefficients Beta -.410 t -2.691 Sig . .007 .341 2.234 .026 10.175 .000 Interpret as partyid = (4.191 +/- 0.412) + (-0.048 +/- 0.018)*age + (0.0003918+/- 0.0001754)*age**2 Edit the data table, then double click on the cells to get the values of b2 and its std error. INFO 515 Lecture #7 54 Curve Estimation Example The data set will be plotted as the Observed points, with the regression models shown for comparison Look to see which model most closely matches the data Look for regions of data which do or don’t match the model well (if any) INFO 515 Lecture #7 55 Curve Estimation Example <- quadratic <- linear INFO 515 Lecture #7 56 Curve Estimation Procedure See which models are significant (throw out the rest!) Compare the R Square values to see which provides the best fit Use the graph to verify visually that the correct model was chosen Use the model equation’s ‘B’ values and their standard errors to describe and predict the data’s behavior INFO 515 Lecture #7 57