Association between variables Two variables are associated if knowing the values of one variable helps a lot in predicting the values of.
Download ReportTranscript Association between variables Two variables are associated if knowing the values of one variable helps a lot in predicting the values of.
Association between variables Two variables are associated if knowing the values of one variable helps a lot in predicting the values of the other variable. If there’s a weak association, then knowing the values of one variable doesn’t help much in predicting the values of the other. Associations are tendencies, not ironclad rules. Association does not necessarily mean causation. What might be the association between IQ test scores & family income? What would such an association signify concerning causation? Caution: be careful if the variables are measured on different sets of observations. E.g., association of human body-weight or blood pressure with age—but what if the data are cross-sectional, not longitudinal? That is, what are the pitfalls of cross-sectional data on ostensibly longitudinal trends? See Freedman et al., Statistics, pages 58-61. Key questions about variables: How are the variables defined & measured (i.e. operationalized)? Are these theoretically & empirically adequate? Are the variables categorical (nominal, ordinal) or quantitative (interval, ratio)? Are there response (i.e. outcome, dependent) variables & explanatory (i.e. independent, predictor) variables? Do these make sense? What if they were reversed? Who do the data represent? How were the data collected? Was this adequate? Scatterplot: shows the relationship between the values of two quantitative variables. Look for the overall pattern & striking deviations, including outliers. Describe the overall pattern by its form, direction & strength. Look for outliers. “Outlier” Defined W.N. Venables and B.D. Ripley, Modern Applied Statistics with S (119). "Outliers are sample values that cause surprise in relation to the majority of the sample." As commentator Austin Nichols wrote in Stata listserv (February 21, 2008), this definition implies that “such surprise is a function of the model contemplated and the subject-matter knowledge of the researcher, and not an inbuilt characteristic of the data.” 30 40 50 60 70 80 . scatter read math || lfit read math 30 40 50 60 math score reading score Fitted values 70 80 Commonly—not always, however—the following two kinds of variables are inspected in a scatterplot: Explanatory (or independent or predictor) variable: predicts, explains, or perhaps causes changes in the response variable. Outcome (response or dependent) variable Scatterplot’s overall pattern & striking deviations: Form: degree of linear or curvilinear association. Direction: degree of positive or negative association. Strength: degree of adherence to a clear form. Outliers: numbers & distance from overall pattern. When the observations on a scatterplot are tightly clustered around a diagonal line, there is a strong linear association between the two variables: Positive association Negative association Neither association necessarily signifies causation. 30 40 50 60 70 80 . scatter read math || qfit read math 30 40 50 60 70 math score reading score Fitted values Positive or negative linear association? 80 Here’s a curvilinear relationship. Meaning? When interpreting a scatterplot, beware of lurking (i.e. unmeasured confounding) variables. Examples? E.g., Utts & Heckard (Statistical Ideas and Methods) cite the example of a surprisingly negative relationship: the more pages in a book, the cheaper the cost of the book on average. The relationship changed directions when the lurking variable was taken into account (i.e. controlled). The lurking variable? To anticipate later discussion, how would you graph it? The lesson learned? A lurking (i.e. unmeasured confounding) variable (1) affects the response variable and (2) is associated with the explanatory variable. That is, the effect of a lurking variable on a response variable is mixed up with the effect of the explanatory variable. Always consider the potential effects of lurking variables! Scatterplot’s overall pattern & striking deviations: Form: degree of linear or curvilinear association. Direction: degree of positive or negative association. Strength: degree of adherence to a clear form. Outliers: numbers & distance from overall pattern. 30 40 50 60 70 80 . scatter read math || qfit read math 30 40 50 60 math score reading score Fitted values 70 80 80 . scatter read math, ml(id) || qfit read math 61 40 50 60 70 34 30 2 30 103 68 132 95 180 57 121 123 66 157 125 139 1148882 174 200 39 62 131 96 192 5923 15480 136 167 141 3 183 149 84185 120 188 12781 135 137 194100 143 77 793575 20144 97 17126 101 162 70 7498 6048 116711877 195 186 33 161 58 56 31 159 160 152 85 118 94 177 169 146 65 104 112 29199 55 110 102166158168 921226327 16324 49133 8350 4187 64 105 181 18197 184 52 73 156 91 173 32 9 13 176 25191 5 164454 43 6 107 1747 10 198 76151 142 147 144299178 190 172 179 170189 46 38 134 694 182 140 36 155 129193 196113 86130 21 145 22138 124 51 109 40 115 119 72 90 148 126 150 37 30 128 111 153 15 8 78 28 67 12 106 89 175 165 117 1 108 45 1153 164 19 40 50 93 60 math score reading score Fitted values 70 80 How to do a scatterplot in Stata . use hsb2, clear . . . . . . . kdensity read, norm gr box read kdensity math, norm gr box math summarize read math, detail scatter read math scatter read math || lfit read math or: scatter read math || qfit read math . scatter read math, ml(id) || qfit read math lfit: ‘linear fit.’ qfit: ‘quadratic fit’ permits graphing of possible curvilinear relationships. To explore more bivariate complexity, lowess (locally weighted scatterplot smoother) 30 40 50 60 70 80 Lowess smoother 30 40 50 60 math score bandwidth = .8 . lowess read math 70 80 How to eliminate specified observations in a scatterplot? If there’s no id-variable, create one: . generate id = _n . list id Display the scatter plot eliminating specified observations: . scatter read math if id~=19 & id~=167, ml(id) || qfit read math 0 .01 .02 .03 .04 . kdensity read, norm 20 40 60 reading score Kernel density estimate Normal density 80 30 40 50 60 70 80 . gr box read . kdensity math, norm .02 0 .01 Density .03 .04 Kernel density estimate 30 40 50 60 math score Kernel density estimate Normal density kernel = epanechnikov, bandwidth = 2.92 70 80 30 40 50 60 70 80 . gr box math 30 40 50 60 70 80 . scatter read math || qfit read math 30 40 50 60 math score reading score Fitted values 70 80 80 . scatter read math, ml(id) || qfit read math 61 34 103 93 95 132 68 57 121 123 66 157 125 139 114 8882 174 200 39 62 131 96 192 5923 154 80 136 167 141 3 183 149 84185 120 188 127 81 135 137 194 100 143 77 793575 20144 97 171 26 101 162 70 7498 6048 116 71187 7 195 186 33 161 58 56 31 159 160 152 85 118 94 177 169 146 65 104 112 29199 55 110 102 166 158 168 92122 6327 163 24 49133 8350 4187 64 105 181 18197 184 52 73 156 91 173 32 9 13 176 25191 5 164454 43 6 107 1747 10 198 76151 142 147 144299178 190 172 179 170 189 46 38 134 694 182 140 36 155 129 193 196 113 86130 21 145 22138 124 51 109 40 115 119 72 90 148 126 150 37 30 128 111 153 15 8 78 28 67 12 106 89 175 165 117 1 108 45 1153 164 19 40 50 60 70 180 30 2 30 40 50 60 math score reading score Fitted values 70 80 80 . scatter read write if id~=32 & id~=92, ml(id) || qfit read write 61 34 103 93 95 132 68 57 121 123 66 157 125 139 114 8882 174 200 39 62 131 96 192 5923 154 80 136 167 141 3 183 149 84185 120 188 127 81 135 137 194 100 143 77 793575 20144 97 171 26 101 162 70 7498 6048 116 71187 7 195 186 33 161 58 56 31 159 160 152 85 118 94 177 169 146 65 104 112 29199 55 110 102 166 158 168 122 6327 163 24 49133 8350 4187 64 105 181 18197 184 52 73 156 91 173 13 176 25191 5 164454 43 6 107 1747 10 198 769 142 151 147 144299178 190 172 179 170 189 46 38 134 694 182 140 36 155 129 193 196 113 86130 21 145 22138 124 51 109 40 115 119 72 90 148 126 150 37 30 128 111 153 15 8 78 28 67 12 106 89 175 165 117 1 108 45 1153 164 19 40 50 60 70 180 30 2 30 40 50 60 math score reading score Fitted values 70 80 . lowess read math if id~=32 & id~=92 30 40 50 60 70 80 Lowess smoother 30 40 50 60 math score bandwidth = .8 70 80 Here’s how to examine a quantitative bivariate scatter plot in terms of a categorical variable: . scatter read math, mlabel(id) . scatter read math||qfit read math, ml(id) . scatter read math, ml(female) . scatter read math, ml(race) . scatter read math, ml(prog) . scatter read math, by(female) . scatter read math, by(race) 80 . scatter read math, ml(id) 61 34 103 93 70 180 121 123 66 157 125 62 141 3 183 149 77 793575 60 167 40 50 162 70 58 2 95 132 68 57 139 1148882 174 39 131 96 192 5923 15480 136 84185 120 188 12781 135 137 194100 20144 97 17126 200 143 101 7498 6048 116711877 195 186 31 159 160 152 85118 94177 169 14665 104 27 112 29199 55 110 102166158168 9212263 16324 49133 8350 4187 64 105 181 18197 184 52 73 156 91 173 32 9 13 176 25191 5 164454 43 6 107 1747 10 198 76151 142 147 14 99178 190 172 179 170189 42 46 38 134 694 182 140 36 155 129193 196113 86 21 138 130 145 22 124 51115 109 40 11972 90148 126 150 37 30 128 111 153 15 8 78 28 67 12 106 175 165 89 117 1 108 45 1153 33 161 56 30 164 19 30 40 50 60 math score 70 80 80 . scatter read math, ml(female) female male female 70 female male male female female female female male female male female male female female male male female female female female male female male male male male female male female male male female female female male male male male male female female female male 60 male female female male female male female female female male male female female male female female male female male male male male female male male female male female female male male female female female male female female female male male male female male male female female male male female female female female male female male female male female male female female male female female male female female female male female female male female male male female female male male female male female female female female male female male male male male female female female male female female male female female female female male female female female male female male male female female male male female female male female male female female male male male 40 50 female male male male male male female female male 30 male female 30 40 50 60 math score 70 80 . scatter read math, by(female) female 20 40 60 80 male 20 40 60 80 20 math score Graphs by female=1 male=0 40 60 80 Categorical explanatory variable: How to examine its relationship with a quantitative variable? Use a box plot or a stem plot to graph the quantitative variable by a categorical variable. . graph box science, over(female, total) . bys female: stem science 30 40 50 60 70 80 . gr box math, over(female, total) male female (total) Here, again, consider the potential effects of lurking (i.e. unmeasured confounding) variables. We’ll next examine associations from the standpoints of correlation & regression. Both correlation & regression are computed via means & standard deviations. Consequently, both of these statistics are highly sensitive to pronounced skewness & extreme observations. Correlation: measures the direction & strength of the linear relationship between two quantitative variables. Linear relationship. Two quantitative variables. Does not describe causal relationships. Beware of lurking variables. Correlation measures the strength and direction of a linear (i.e. straight-line) relationship between two quantitative variables. That is, correlation measures the degree to which the bivariate observations cluster along a straight line, and the positive or negative direction of the relationship. This is demonstrated in the next slide’s scatterplots. A correlation is stronger to the degree that the bivariate data cluster along the straight line, and weaker to the degree that they do not. The direction of the relationships may be positive or negative. Later in the course we’ll review measures of correlation & other forms of association involving categorical variables. On that topic, see the class’s slides for chapter 10. If the bivariate scatterplot of two quantitative variables displays a tight, pronounced curvilinear cluster, will the correlation coefficient be relatively strong or weak? It will be relatively weak, because a correlation coefficient measures a linear relationship between two quantitative variables. Even pronounced curvilinear bivariate relationships yield weak correlation coefficients. The relationship between two quantitative variables—as charted on a scatter plot—can be summarized by: The The & of the x-values & of the y-values The correlation coefficient (r) In a scatter plot: The of x establishes the center-point of the x-values, & the of x establishes their spread. The of y establishes the center-point of the y-values, & the of y establishes their spread. The correlation coefficient (r) measures the degree to which the x & y observations cluster around a straight line. Correlation near 1 or –1 means tight clustering around a straight line in a positive or negative direction: a strong positive or negative linear relationship. Correlation near 0 means loose clustering around a straight line: a weak linear relationship. True or false, & explain: if the correlation coefficient is 0.90, then 90% of the data points are highly correlated. See Freedman et al., Statistics. Answer False. The correlation coefficient indicates the direction & degree of cluster between two quantitative variables around a straight line. How to compute a correlation coefficient (r): Convert each x-value & each y-value to a standard value (i.e. z-score): egen zx=std(x) egen zy=std(y) Multiply the standard values (i.e. zscores) of each x & y pair, sum the products of all the multiplied pairs, then divide the sum by n - 1. That is: Standardize each x-observation & each y-observation, i.e. compute z-scores for each value of x & each value of y. Multiply each z(x) by each z(y). Sum the products of the multipled pairs of z-scores. Divide the sum by n – 1. Correlation coefficient: 1 r ( xi x ) sx )( yi y ) sy ) n 1 Here’s how to compute it: x y z(x) z(y) z(x)*z(y) 1 5 -1.5 -0.5 0.75 3 9 -0.5 0.5 -0.25 4 7 0.0 0.0 0.00 5 1 0.5 - 1.5 -0.75 7 13 1.5 1.5 2.25 r = (0.75 – 0.25 + 0.00 – 0.75 + 2.25)/(5-1) = 0.50 In the preceding problem: Would changing the order of the observations change the correlation? Would flip-flopping the x & y variables change the correlation? Would adding 3 to each observation change the correlation? Would multiplying each observation by 4 change the correlation? Answers No to all. See Freedman et al., Statistics. What if the standard deviation of x or y or both is 0? Answer Then, by virtue of the formula, the correlation coefficient can’t be computed. Correlation coefficient: 1 r ( xi x ) sx )( yi y ) sy ) n 1 Correlation coefficient values range from –1.0 to 1.0. Changing the order of x/y observations does not change the correlation. Adding or multiplying by the same positive number to each observation does not change the correlation. Features of the correlation coefficient: Linear relationship between two quantitative variables Describes association, not causal order: interchanging the two variables does not change the relationship Standardized units of measurement Cautions about Correlation The correlation coefficient, as we’ve seen, is the average of the product of the standardized values of the two quantitative variables. Therefore it is highly sensitive to pronounced skewness & extreme values. Always do the following graphs/plots before computing a correlation coefficient: Graph each variable (e.g., boxplot or stemplot) to check for possible extreme values. The univariate analysis will alert you to possible problems in the bivariate scatterplot. Then do a scatterplot to check the bivariate relationship for possible non-linearity & pronounced outliers. It’s the scatterplot that really matters. If the scatterplot detects substantial non-linearity, then it is not appropriate to compute a correlation coefficient. If the scatterplot detects pronounced outliers, then don’t compute a correlation coefficient (unless you delete or, via transformation, temper the outliers). Or possibly use ‘non-parametric’ (distribution-free) alternatives such as ‘spearman y x.’ Another possibility: controlling for a lurking variable—which would display a separate scatterplot for each level (such as subcategory) of the lurking variable—might result in linear relationships within each separate scatterplot & reduce or eliminate the prevalence/magnitude of outliers. What does ‘control for’ mean? Question If graphs reveal skewness &/or outliers in the distribution of one or both variables, does this necessarily mean that such problems will also occur in the bivariate scatterplot? Answer Not necessarily. Why not? Because the independent characteristics of each variable by themselves do not fully determine the form, direction, & strength of the bivariate scatterplot relationship. Think of it this way: One nice person plus one nice person does not necessarily equal a nice relationship. One bad person plus one bad person does not necessarily equal a bad relationship. One nice person plus one bad person does not necessarily equal a nice or bad relationship. Moral of the story? Putting two (or more) variables together often yields relationships that are surprising in view of the independent characteristics of each individual variable: issues of aggregation. In the case of correlation: use a univariate graph (e.g., normal quantile plot or histogram) to alert you to possible problems of nonlinearity &/or extreme values, but it is the bivariate scatterplot that provides the definitive evidence. Here are some more things to worry about with regard to computing & interpreting correlation coefficients. Always ask how the variables are defined & measured (i.e. operationalized), who the data represent, & how the data were collected (see next chapter). Are these adequate? In addition, check the sample size: small sample size may make it hard to detect an association because there may not be enough observations to reveal a pattern. So, there may be a correlation within a population, but the sample size may be too small to reveal it. Check the scatterplot for outliers. Beware of curvilinear clustering (i.e. a curvilinear x/y relationship). In such cases there may be a strong relationship between the two variables, but because it’s a curvilinear relationship the correlation coefficient will be relatively weak. When interpreting a correlation coefficient, beware of lurking (i.e. unmeasured confounding) variables. Beware of correlations based on restricted range data—using just part of the range of values. This usually causes attenuation: reduced correlation coefficient. E.g., the correlation between SAT scores and grades will be lower in an elite academic university with only a narrow range of high-end SATs than at a less selective university with a wide range of SATs. The elite university’s narrow range of SATs is associated with a wide range of grades; the less selective university’s wider range of SATs is associated with a wide range of grades. 30 40 50 60 70 80 Here’s an example: 30 40 50 60 math score reading score . corr read math = .66 Fitted values 70 80 30 40 50 60 70 . scatter read math if math<50 || qfit read math 30 40 50 60 math score reading score Fitted values . corr read math if read<50 = .38 Note the decreased correlation. 70 80 Beware of ecological correlations: correlations based on averaged data (which are common in the social sciences: e.g., correlation between GPA of individual students nationwide and average standardized educational assessment test score per state). Using averaged data typically inflates correlation coefficients by reducing scatter among the values. Finally, a correlation coefficient just partially describes a relationship between two quantitative variables. Thus, always accompany a correlation coefficient with the means & standard deviations of the two variables (as well as perhaps a measure of skewness). And beforehand always graph the bivariate relationship. How to do it in Stata univariate & bivariate analysis . kdensity read, norm . gr box read . kdensity write, norm . gr box write . summarize read write, detail . scatter read write || qfit read write Compute correlation coefficient . corr read math (obs=200) | read math ------------+------------------ read | 1.0000 math | 0.6623 1.0000 By a categorical variable (in order to control its influence): . scatter read math, by(female) || qfit read math . bys female: corr read math Note: Make sure there are enough observations in each category to detect a possible association (see chapter 3). Let’s now consider a measure that’s related to correlation: simple linear regression. Examples What does knowing a person’s years of schooling (x) enable us to say about the person’s earnings (y)? What does knowing amount of dietary fat consumed (x) enable us to say about the rate of heart disease (y)? Unlike correlation, regression involves a response variable (y) & an explanatory variable (x). But the y/x relationship does not necessarily imply a causal relationship. Always ask: What is the conceptual reason for the y/x order? What if the y/x order were reversed? Simple linear regression describes how the values of a response variable depend on the values of an explanatory variable. On average, how does earnings level (y) change for every unit of increase in years of education (x)? On average, how does rate of heart disease (y) change for every unit of increase in years of education (x)? Why is correlation of limited use in shedding light on such questions? Why is regression more useful in this regard? Unlike correlation, regression enables us to gauge how much values of a response variable (y) change, on average, with increases in an explanatory variable (x). And unlike correlation, regression references the relationship to the units of the model’s variables: e.g., for every added year of education (x), earnings (y) increases by $1,023, on average. The Most Basic Differences between Correlation & Regression: Correlation measures the degree of bivariate cluster along a straight line: the strength of a linear relationship. It implies nothing about causal order. It is measured in standardized units. Regression measures the degree of slope in the linear relationship between an outcome variable (y) and an explanatory variable (x): the average rate of change in y for every unit change in x. It is measured in the units of the model’s variables: with every unit increase in x the value of y changes by… units, on average. Be careful about implied causal relationships! To repeat, always ask: What is the conceptual reason for the y/x order? What if the y/x order were reversed? . reg wage educ Source | SS df MS Number of obs = -------------+-----------------------------Model | 1179.73204 1 526 F( 1, 524) = 103.36 1179.73204 Residual | 5980.68225 524 11.4135158 -------------+------------------------------ Prob > F = 0.0000 R-squared = 0.1648 Adj R-squared = 0.1632 Total | 7160.41429 525 13.6388844 Root MSE = 3.3784 -----------------------------------------------------------------------------wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | .5413593 .053248 _cons | -.9048516 .6849678 10.17 0.000 -1.32 0.187 .4367534 -2.250472 .6459651 .4407687 ------------------------------------------------------------------------------ Always first check number of observations: Is it correct? For each unit (i.e. year) increase in education, hourly wage increases by 0.54 dollars, on average. Regression line: a straight line that describes how changes in a response variable (y) are associated with changes in an explanatory variable (x), on average. It describes the average rate of change in y for every unit change in x. Regression measures a linear association: non-linearity creates misleading results. Regression involves fitting a line to data: drawing a straight line that comes as close as possible to the data points. Fitting a line to data means drawing a straight line that comes as close as possible to the data points. Regression line: y a bx y: response (or outcome or dependent) variable a: intercept (value of y when x=0) b: slope (rate of change in y associated with every unit increase in x) x: explanatory (or independent or predictor or right-hand side) variable The equation fits a straight line to the data. It’s called the least squares line because it minimizes the distance (i.e. residuals) between the equation’s ypredictions & the data’s y-observations. The better the model fits the data, the smaller the distance between the ypredictions & the y-observations (that is, the smaller the residuals). y-predictions are typically called ‘yhat.’ The y-intercept (a) is usually meaningless in substantive terms: it is the value of the dependent variable when the independent variable=0. E.g., your GRE score if your IQ=0! The y-intercept (a) is included because it is mathematically necessary for the regression equation. Whether it’s substantively meaningful or not depends on the sample. Simple linear regression: y a bx How do we interpret a regression equation? The regression line for y on x estimates the average value for y associated with each value of x. For every unit increase in x, y increases/decreases by ….., on average. Keep in mind that regression measures a linear assocation. Non-linearity creates misleading results in regression, just as it does in correlation. The least-squares line of y on x makes the sum of the squares of the vertical distances of the data points from the line (i.e. the squared residuals) as small as possible. It does so via the following formulas: y a bx b r * y x a y bx What, then, is the formula for the slope (b): the rate of change in y for every unit increase in x, on average? It is: the correlation of x & y, times the sd of y divided by the sd of x. b r * y x So, for every sd increase in x, there is a change in rx.y times sd of y. Unlike correlation, in regression the slope coefficient (b) is expressed in terms of the units of the relationship of y to x. This makes it easier to interpret the substantive meaning of a slope coefficient than of a correlation coefficient. E.g., for every hour increase in study time, SAT score increases by 23 points, on average. Simple linear regression: y a bx b r * y x a y bx Regression Computation Example Let’s compute a regression equation to predict math scores (y) from reading scores (x), based on some kind of sample of 200 California students (hsb2.dta). read (y): mean=52.23, sd=10.25 math (x): mean=52.65, sd=9.37 r=0.617 Compute the regression equation: slope (b) = (0.617*9.37)/10.25 = 0.564 y-intercept (a) = 52.23 – (0.564*52.65) = 22.54 So: read = 22.54 + 0.564*xi Let’s now predict reading scores for two x-values, math=35 & then math=65: predicted y = 22.54 + 0.564*35 = 42.28 predicted y = 22.54 + 0.564*65 = = 59.2 Beware Software will accept any y/x ordering of variables, even if it makes no substantive sense. Always question the hypothesized y/x order: e.g., IQ & family earnings—should family earnings be x or y, should IQ be x or y? The y/x order of variables depends on the conceptualization of the particular research question: e.g., you may want to use GPA to predict standardized test score, or you may want to use standardized test score to predict GPA. When we seek to explore causal relations, ask: can we really establish that x precedes y in time? A temporal sequence is not always clear. Typically we’re using cross-sectional, not longitudinal, data (& even longitudinal data don’t always clarify matters). As McClendon says (Multiple Regression and Causal Analysis, p. 5): “… it is often impossible to know whether Y achieved its observed level before or after X reached its observed level.” McClendon (p. 7) goes on to say that: “Good theoretical arguments are often accepted in this regard, although the inference will certainly be more uncertain than if the temporal sequence could be empirically established.” He says, moreover, that, even where temporal sequence is not clear cut, X/Y regression analyses “may cast doubt on existing theoretical formulations by failing to find any relationship between X and Y” (p. 7). How to grasp that the slope (b) implies that y responds to changes in x? Compute b*y/x, then record the slope coefficient & plot the results. Do the same, but this time as b*x/y, then record the slope coefficient & plot the results. Comparing the first & second equations, how do the slope coefficients & plots differ? How does this feature of the slope coefficient differ from the correlation coefficient? Do the y/x flip-flop for the read/math regression equation. What are the results? What do they tell us about the regression coefficient versus the correlation coefficient? The results indicate that there are two regression lines: one for y’s dependence on x & the other for x’s dependence on y. In contrast, in correlation there is just one line: it’s the same for yx or xy. Keep in mind that, in order to make sense, the yx order in regression analysis must be based on substantive & theoretical logic. So be careful: the regression equation will accept the variables in any order, even if the order (or the variables themselves) makes no sense. We’ll talk some more about issues of causality. We’ll do so in view of the introductory discussion in Moore/McCabe/Craig. Another matter: beware of trying to make predictions or interpretations beyond the range of the sample’s values. That is, beware of extrapolations. There are two main ways of assessing a regression model: Slope - (b) coefficient: the rate of change in y for every unit change in x, on average. The slope coefficient—i.e. the regression line—is typically what we care most about. Fit - r-squared: degree of cluster around a regression line. While the slope coefficient (i.e. the regression line) may explain part of the relationship of y to x, there may be other sources of variation in y’s values. The slope coefficient (i.e. the regression line) does not say how large the additional variation is. . use hsb2, clear 30 40 50 60 70 80 . scatter read math || qfit read math 30 40 50 60 70 math score reading score Fitted values There’s a clear linear relationship (i.e. slope), but there’s scatter (i.e. variation) around it. 80 . lowess read math 30 40 50 60 70 80 Lowess smoother 30 40 50 60 math score bandwidth = .8 70 80 This means that: (1) The slope coefficient indeed describes a linear relationship of y on x. (2) But if we wanted to explain the entirety of the relationship of y on x, then we’d have to examine additional explanatory variables (which, so far, are lurking variables). What might the additional explanatory variables be? Two repeat, there are main two ways of assessing a regression model: Slope - (b) coefficient: the rate of change in y for every unit change in x, on average. Fit - r-squared: degree of cluster around a regression line. Let’s discuss r-squared. r-squared r2 = the square of the correlation between y & x. That is: r2 = degree of cluster around the leastsquares line. r2 = the fraction of the variation in the values of y that is explained by the least-squares regression of y on x. r2 = variance of predicted values of y divided by variance of the observed values of y. Apply the r2 ‘fit’ procedure to the read/math data (r=0.617): r x r = r2 What is the result concerning the degree of scatter around the least squares line? Your conclusion? Slope (b) vs. Fit (r2) Slope (b): the degree of change in y for every unit change in x, on average. Fit (r2): the fraction of the variation in the values of y that is explained by the leastsquares regression of y on x (i.e. degree of cluster around the straight line). There can be a high r-squared with a relatively flat slope, or a relatively steep slope with a low r-squared. Why? In short: Regression (i.e. slope) coefficient measures the steepness in the least squares line: the degree of change in y for every unit increase in x, on average. r2 measures the fraction of the variation in the values of y explained by x—the degree of scatter around the least squares line. Especially when we advance to multiple regression, we’ll see that what generally matters most is the regression coefficient: the steepness (i.e. slope) of the linear relationship of y on x. It is the regression line (i.e. slope coefficient) that measures the linear trend in how y changes in response to changes in x. With multiple regression, we’ll see that merely adding more explanatory variables—whether or not they make conceptual sense—increases r-squared. Put differently, the slope coefficient (i.e. the regression line) is about theoretically oriented, generalizing analysis. r-squared is about historicist case-study analysis (i.e. accounting for as much of the variation as possible in a case study). Watch out! The regression equation will yield results for nonsensical or ambiguous y/x order. The slope coefficient measures the linear relationship of y on x. Regression Trouble-Shooting The slope coefficient (b) is highly susceptible to outliers. An outlier is ‘influential’ if removing it notably changes the regression coefficient. Before computing a regression equation, always graphically check the y variable & the x variable for pronounced skewness & outliers. Do so to alert you to possible problems. Then do a bivariate scatterplot of y on x to check for non-linearity & possible outliers. It is the bivariate scatterplot that provides the definitive evidence for simple regression. If there are outliers in the bivariate scatterplot, compute the regression equation with & then without the outliers. Compare the difference, & report it if it’s notable. . reg wage educ Source | SS df MS Number of obs = -------------+------------------------------ Model | 1179.73204 526 F( 1, 524) = 103.36 1 1179.73204 Prob > F = 0.0000 Residual | 5980.68225 524 11.4135158 R-squared = 0.1648 -------------+------------------------------ Adj R-squared = 0.1632 Total | 7160.41429 525 13.6388844 Root MSE = 3.3784 -----------------------------------------------------------------------------wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | .5413593 _cons | -.9048516 .053248 10.17 0.000 .4367534 .6459651 .6849678 -1.32 0.187 -2.250472 .4407687 ------------------------------------------------------------------------------ First check N (# observations): is it correct? For every year of education, hourly wage increases by 0.54 dollars, on average. But is this relationship linear? 25 20 15 10 5 0 0 5 10 years of education average hourly earnings 15 Fitted values 20 OLS regression permits the use of categorical explanatory variables. . tab female female | Freq. Percent Cum. --------------------------------------------------0| 274 52.09 52.09 1| 252 47.91 100.00 --------------------------------------------------Total | 526 100.00 . tab female, su(wage) | Summary of average hourly earnings female | Mean Std. Dev. Freq. ----------------------------------------------------0| 7.1 4.2 274 1| 4.6 2.5 252 ----------------------------------------------------Total | 5.9 3.7 526 0 5 10 15 20 25 . gr box wage, over(female, total) 0 0=male 1=female 1 (total) . reg wage female --------------------------------------------------------------------------------------------wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] ------------- ------------------------------------------------------------------------------female | -2.51183 .3034092 -8.28 0.000 -3.107878 -1.915782 _cons | 7.099489 .2100082 33.81 0.000 6.686928 7.51205 Because ‘wage’ refers to hourly wage, this indicates that being female reduces hourly wage by $2.51, on average. Of course, the validity of wage’s relationships to education as well as to female needs to be assessed by regression diagnostics. Regression Diagnostics: Is the fit linear? Residual: the difference between an observed value of the response variable & the value predicted by the regression line. residual = observed y – predicted y Residual plot: a scatterplot of the regression residuals against the explanatory variable. It helps assess the fit of a regression line to the data: is the fit linear, or not? If the regression model (i.e. equation) fits the data, then the residual plot indicates no pattern in the residuals. If the regression model doesn’t fit the data, then the residual plot indicates a pattern—typically curvilinear or fanshaped. Scatterplot & residual-vs.-explanatory-variable diagnostic plot of a linear fit. Residual-vs.-explanatory-variable diagnostic plots of nonlinear fit. What if the fit is nonlinear? Check for possible data errors (in measurement or reporting/coding). Consider transforming either y or x, or both: more on this near the end of the semester. Consider reformulating the regression model, including possibly incorporating other x-variables: multiple regression. We’ll say more on this near the end of the semester. Other Questions What happens to the regression coefficient if the standard deviation of either variable or both variables is zero? What kinds of social research that are doable, or not doable, with regression analysis? How to do it in Stata . kdensity wage, norm . gr box wage . kdensity educ, norm . gr box educ . su wage educ, detail . scatter wage educ || qfit wage educ .2 .15 0 5 10 15 average hourly earnings 20 25 .2 .3 Kernel density estimate Normal density 0 .1 Density Density .1 .05 0 0 5 10 years of education Kernel density estimate Normal density Serious problems of nonlinearity. 15 20 25 20 15 0 5 10 0 5 10 years of education average hourly earnings 15 Fitted values Given the pronounced nonlinearity, we should explore transforming the variables before estimating the equation, but for our didactic purposes we won’t do so. 20 . reg wage educ Source | SS df MS Number of obs = --------------------------------------------Model | 1179.73204 Residual | 5980.68225 1 1179.73204 524 11.4135158 --------------------------------------------Total | 7160.41429 525 13.6388844 F( 1, 526 524) = 103.36 Prob > F R-squared = 0.0000 = 0.1648 Adj R-squared = 0.1632 Root MSE = 3.3784 -----------------------------------------------------------------------------wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------------------------------------------------------------------------educ | .5413593 _cons | -.9048516 .053248 10.17 0.000 .4367534 .6459651 .6849678 -1.32 0.187 -2.250472 .4407687 ------------------------------------------------------------------------------ . predict yhat . hist yhat, norm . predict resid, resid . hist resid, norm . list yhat wage resid | yhat wage resid | |------------------------------| 1. | 5.0501 3.1 -1.9501 | 2. | 5.591459 3.2 -2.35146 | 3. | 3 5.0501 - 2.0501 | 4. | 3.426023 6 2.573977 | 5. | 5.591459 5.3 -.2914593 | |------------------------------| 6. | 7.756896 8.8 .9931036 | 7. | 8.839615 11 2.410385 | 8. | 5.591459 5 -.5914595 | -5 0 5 10 15 . rvfplot, yline(0) -2 0 2 4 Fitted values 6 Very nonlinear fit—not surprising given the preliminary graphic evidence. 8 . rvfplot, yline(0) ml(id) 15 15 186 112 229 440 10 66 343 5 379 503 139 417 -2 0 378 252 488 4 514 220 484 63 298 287 307 504 396 412 471 179 127 25 397 366 468 305 330 367 81 324 486 345 318 309 221 401 410 96 438 130 403 30 319 483 336 463 120 219 411 303 331 425 226 266 320 384 466 116 64 297 228 341 255 44 487 407 210 507 162 499 372 250 317 370 151 83 308 129 271 404 161 111 424 128 159 1 289 428 3 232 385 474 356 134 195 342 365 416 456 494 523 147 47 248 502 -5 0 465 38 306 470 284 406 214 2 4 Fitted values 326 278 245 505 476 16 10 170 200 399 256 42 68 235 355 394 513 383 198 177 230 13 265 283 386 187 302 22 79 217 88 208 405 469 142 41 104 183 17 90 519 500 175 299 21 201 377 234 84 108 449 95 154 430 510 251 288 437 40 197 314 491 176 37 209 54 270 91 259 454 206 521 14 472 113 5 173 328 348 8 34 53 185 73 94 290 329 167 243 143 35 402 354 190 357 20 103 135 148 164 275 340 373 475 482 481 189 249 398 294 461 77 180 239 261 273 321 344 346 254 145 434 75 102 152 285 304 462 118 9 19 349 263 446 133 362 85 100 191 224 246 442 464 509 353 369 429 222 312 188 347 493 87 86 238 419 452 480 2 310 124 242 292 286 374 78 155 241 517 204 323 93 205 236 262 264 300 451 511 316 48 240 50 74 295 322 439 485 490 501 388 473 109 55 136 70 51 453 516 24 6 260 105 107 98 172 58 203 89 59 76 61 18 33 477 12 163 420 413 45 267 168 169 296 144 227 153 520 253 512 279 361 387 459 360 97 216 268 335 389 495 496 258 415 211 174 150 166 46 31 497 62 92 29 381 178 140 156 436 518 423 450 478 65 215 231 458 375 421 194 114 338 460 233 414 137 184 350 122 269 280 351 32 43 363 432 99 115 247 257 223 368 380 49 117 315 526 364 457 39 213 244 426 282 276 106 508 281 443 393 101 132 524 119 467 56 146 123 391 447 522 325 444 171 72 28 339 427 479 149 525 352 395 141 441 334 7 138 158 26 199 110 431 433 435 6 181 202 52 332 515 489 455 131 27 165 359 400 337 311 157 182 126 445 23 11 67 192 196 225 327 409 506 301 390 274 358 80 160 291 371 69 71 333 408 422 272 313 498 121 293 376 82 492 448 60 418 36 392 207 193 57 382 277 212 125 218 237 8 Summary What are the main differences between correlation & regression? What assumption, measures, problems, & procedures are common to computing & interpreting a correlation coefficient (r) & a regression slope coefficient (b)? Cautions about Correlation & Regression Nonlinearity causes misleading results. A lurking (i.e. unmeasured confounding) variable is one that isn’t among the explanatory or response variables in a study yet may influence the interpretation of the relationships among the study’s variables. An outlier is an observation that lies outside the overall pattern of the other observations. An outlier is influential if removing it would markedly change the result of the calculation. Points that are outliers in the x-variable direction of a scatterplot are often influential in the leastsquares regression line. Be careful concerning correlations (or regression equations) based on different sets of individuals (i.e. different sets of observations or subjects). Beware of correlations based on averaged data (called ecological correlations). Beware of correlations based on restrictedrange data (the common resulting problem being attenuation). Beware: association does not imply causation. For any implied causal relationship, always ask: What is the conceptual basis for the relationship? What if y & x were reversed? Beware: a regression equation will accept & report results for questionable y/x relationships. Correlation & Regression: What Are the Computational Building Blocks? Individual observation-values (xi’s & yi’s); x & y; x & y; z(xi’s) & z(yi’s) All of the above go into computing a correlation coefficient. All of the above, plus the correlation coefficient, go into computing the regression coefficient. Data Analysis for Two-Way Tables Like correlation & regression, in two-way tables (‘contingency tables’ or ‘crosstabulations’) both variables must be measured on the same individuals or cases. But two-way tables use categorical variables, which summarize counts of observations. How does the question of causal order enter into all of this? Causation Association does not necessarily signify causation. Three basic forms of causation: (1) Direct causation (x causes y): more knowledge (x) causes higher test scores (y). (2) Common response of x & y to lurking variable-z: more knowledge (x) is associated with higher test scores (y); higher test scores (y) are associated with higher SES (z). (3) Confounding (x causes y & z causes y, while x & z are associated with each other): the effects of two explanatory or lurking variables on a response variable are mixed together, so that we can’t (easily) disentangle their effects on the response variable. Attending church (x) causes longer life (y); but good health habits (z) are associated with attending church (x) & a longer life (y). So good health habits are confounded with attending church. By the way, what if y were re-conceptualized as the explanatory variable? Note: There’s no hard & fast distinction between common response & confounding. The distinction between common response & confounding is not always clear. What matters is that “even a strong association between two variables is not by itself good evidence that there’s a cause & effect link between the variables” (Moore/McCabe/Craig). See King et al., Designing Social Inquiry, pages 75-114. How to (more or less) establish a causal relation between x & y? (1) The association between x & y is strong. (2) The association between x & y is consistent across different settings. (3) Changes in one variable are consistently associated with changes in the other variable. (4) x precedes y in time. (5) The causal relationship is plausible. (6) Lurking variables have been controlled for (see #2). Beware: conclusions are always uncertain. Review What are the most basic issues of theory, methods & statistics? What is statistics? What are data? What is exploratory data analysis? How do we analyze a data set from the perspective of statistics? What is a variable? What are the most basic kinds of variables? How do we analyze them graphically & numerically? What are the basic numerical measures? How are they computed? What problems are associated with them? How should we address these problems? What are linear transformations? What are the basic kinds? How do they affect variables? What are density curves? What kinds are there, & why are they important? How do the median & the mean pertain to the various kinds of density curves? What are normal distributions, and what are their basic features? Why are normal distributions important? How do the median, mean & standard deviation describe a normal distribution? What is the 68-95-99.7 rule, & why is it important? What is a standard normal distribution? What is standardization? Why is it important? How does standardization pertain to the normal distribution? How are standard values computed? How do we recapture an original x-value from its standardized value? What’s a correlation? What kind of variable does it assess? What are a correlation coefficient’s characteristics? What does a correlation have to do with causation? How is a correlation computed? What does the computation have to do with mean, standard deviation & standardization? What problems are associated with a correlation coefficient? How do we examine such problems? What kind of remedial action can be taken? What is an association between variables? What’s a response variable? What’s an explanatory variable? What’s a scatterplot? How do we assess its pattern? How do we use graphs & a scatterplot as a combined strategy to examine univariate & bivariate distributions? What’s a positive association? What’s a negative association? How do we examine the relationship between a quantitative variable & a categorical variable? What’s regression? What’s a regression line? What’s the form of a regression equation? What does it measure? What does each component of the equation measure? How is each component computed? What’s the difference between correlation & regression? How can the difference be demonstrated? What is the connection between a correlation coefficient (r) & a regression slope coefficient (b)? How are both of them connected, in turn, to mean & standard deviation? What problems are associated with regression? What remedial actions can be taken? What is association? What is a negative association? What is a positive association? What is causation? What’s the difference between association & causation? What are the basic kinds of causation? How do we (more or less) establish causation? What are lurking variables? What are the ramifications of what we’ve considered so far for the social construction of reality & the study of social relations/public policies? How can we summarize all of this in terms of the ‘six fundamental issues of statistics in theoretical perspective’? What kinds of social research are doable or not doable with correlation & regression?