Association between variables  Two variables are associated if knowing the values of one variable helps a lot in predicting the values of.

Download Report

Transcript Association between variables  Two variables are associated if knowing the values of one variable helps a lot in predicting the values of.

Association between variables
 Two variables are associated if knowing the
values of one variable helps a lot in predicting the
values of the other variable.
 If there’s a weak association, then knowing the
values of one variable doesn’t help much in
predicting the values of the other.
 Associations are tendencies,
not ironclad rules.
 Association does not
necessarily mean causation.
 What might be the association
between IQ test scores & family
income?
 What would such an association
signify concerning causation?
 Caution: be careful if the variables are
measured on different sets of observations.
 E.g., association of human body-weight or
blood pressure with age—but what if the data
are cross-sectional, not longitudinal?
 That is, what are the pitfalls of cross-sectional
data on ostensibly longitudinal trends?
See Freedman et al., Statistics, pages 58-61.
Key questions about variables:
 How are the variables defined & measured (i.e.
operationalized)? Are these theoretically & empirically
adequate?
 Are the variables categorical (nominal, ordinal) or
quantitative (interval, ratio)?
 Are there response (i.e. outcome, dependent) variables &
explanatory (i.e. independent, predictor) variables? Do
these make sense? What if they were reversed?
 Who do the data represent?
 How were the data collected? Was this adequate?
Scatterplot: shows the relationship
between the values of two quantitative
variables.
 Look for the overall pattern & striking
deviations, including outliers.
 Describe the overall pattern by its
form, direction & strength.
 Look for outliers.
“Outlier” Defined
W.N. Venables and B.D. Ripley, Modern
Applied Statistics with S (119).
"Outliers are sample values that cause surprise in
relation to the majority of the sample."
As commentator Austin Nichols wrote in Stata
listserv (February 21, 2008), this definition
implies that “such surprise is a function of the
model contemplated and the subject-matter
knowledge of the researcher, and not an inbuilt
characteristic of the data.”
30
40
50
60
70
80
. scatter read math || lfit read math
30
40
50
60
math score
reading score
Fitted values
70
80
Commonly—not always, however—the
following two kinds of variables are
inspected in a scatterplot:
 Explanatory (or independent or
predictor) variable: predicts, explains, or
perhaps causes changes in the response
variable.
 Outcome (response or dependent)
variable
Scatterplot’s overall pattern &
striking deviations:
 Form: degree of linear or curvilinear
association.
 Direction: degree of positive or negative
association.
 Strength: degree of adherence to a clear
form.
 Outliers: numbers & distance from
overall pattern.
When the observations on a scatterplot
are tightly clustered around a diagonal
line, there is a strong linear association
between the two variables:
 Positive association
 Negative association
 Neither association necessarily signifies
causation.
30
40
50
60
70
80
. scatter read math || qfit read math
30
40
50
60
70
math score
reading score
Fitted values
 Positive or negative linear association?
80
 Here’s a curvilinear relationship. Meaning?
 When interpreting a
scatterplot, beware of lurking
(i.e. unmeasured
confounding) variables.
 Examples?
 E.g., Utts & Heckard (Statistical Ideas and Methods) cite
the example of a surprisingly negative relationship: the
more pages in a book, the cheaper the cost of the book on
average.
 The relationship changed directions when the lurking
variable was taken into account (i.e. controlled).
 The lurking variable?
 To anticipate later discussion, how would you graph it?
 The lesson learned?
 A lurking (i.e. unmeasured confounding)
variable (1) affects the response variable
and (2) is associated with the explanatory
variable.
 That is, the effect of a lurking variable on
a response variable is mixed up with the
effect of the explanatory variable.
 Always consider the potential effects
of lurking variables!
Scatterplot’s overall pattern & striking
deviations:
 Form: degree of linear or curvilinear
association.
 Direction: degree of positive or negative
association.
 Strength: degree of adherence to a clear
form.
 Outliers: numbers & distance from overall
pattern.
30
40
50
60
70
80
. scatter read math || qfit read math
30
40
50
60
math score
reading score
Fitted values
70
80
80
. scatter read math, ml(id) || qfit read math
61
40
50
60
70
34
30
2
30
103
68 132
95
180 57
121 123
66 157
125 139
1148882
174
200
39
62
131
96 192
5923 15480 136
167
141
3 183
149
84185
120
188
12781
135
137
194100
143
77
793575
20144
97 17126
101
162
70
7498
6048 116711877 195 186
33
161
58
56
31 159
160
152
85
118
94
177
169
146
65
104
112
29199
55
110
102166158168
921226327
16324
49133
8350
4187
64
105
181
18197 184
52
73
156
91
173
32
9
13 176
25191
5 164454
43
6 107
1747
10 198
76151
142
147
144299178
190
172 179
170189
46
38
134
694 182
140
36 155
129193
196113 86130
21
145
22138
124
51
109
40
115
119
72
90
148
126
150
37 30
128
111
153
15
8 78
28
67
12
106 89 175
165
117
1 108
45
1153
164
19
40
50
93
60
math score
reading score
Fitted values
70
80
How to do a scatterplot in Stata
. use hsb2, clear
.
.
.
.
.
.
.
kdensity read, norm
gr box read
kdensity math, norm
gr box math
summarize read math, detail
scatter read math
scatter read math || lfit read math
or: scatter read math || qfit read math
. scatter read math, ml(id) || qfit read math
 lfit: ‘linear fit.’ qfit: ‘quadratic fit’ permits graphing of
possible curvilinear relationships.
 To explore more bivariate complexity, lowess
(locally weighted scatterplot smoother)
30
40
50
60
70
80
Lowess smoother
30
40
50
60
math score
bandwidth = .8
. lowess read math
70
80
How to eliminate specified observations in a
scatterplot?
 If there’s no id-variable, create one:
. generate id = _n
. list id
 Display the scatter plot eliminating
specified observations:
. scatter read math if id~=19 & id~=167,
ml(id) || qfit read math
0
.01
.02
.03
.04
. kdensity read, norm
20
40
60
reading score
Kernel density estimate
Normal density
80
30
40
50
60
70
80
. gr box read
. kdensity math, norm
.02
0
.01
Density
.03
.04
Kernel density estimate
30
40
50
60
math score
Kernel density estimate
Normal density
kernel = epanechnikov, bandwidth = 2.92
70
80
30
40
50
60
70
80
. gr box math
30
40
50
60
70
80
. scatter read math || qfit read math
30
40
50
60
math score
reading score
Fitted values
70
80
80
. scatter read math, ml(id) || qfit read math
61
34
103
93
95 132
68
57
121 123
66 157
125 139
114 8882
174
200
39
62
131
96 192
5923 154 80 136
167
141
3 183
149
84185
120
188
127 81
135
137
194 100
143
77
793575
20144
97 171 26
101
162
70
7498
6048 116 71187 7 195 186
33
161
58
56
31 159
160
152
85
118
94
177
169
146
65
104
112
29199
55
110
102 166 158 168
92122 6327
163 24
49133
8350
4187
64
105
181
18197 184
52
73
156
91
173
32
9
13 176
25191
5 164454
43
6 107
1747
10 198
76151
142
147
144299178
190
172 179
170 189
46
38
134
694 182
140
36 155
129 193
196 113 86130
21
145
22138
124
51
109
40
115
119
72
90
148
126
150
37 30
128
111
153
15
8 78
28
67
12
106 89 175
165
117
1 108
45
1153
164
19
40
50
60
70
180
30
2
30
40
50
60
math score
reading score
Fitted values
70
80
80
. scatter read write if id~=32 & id~=92, ml(id) || qfit
read write
61
34
103
93
95 132
68
57
121 123
66 157
125 139
114 8882
174
200
39
62
131
96 192
5923 154 80 136
167
141
3 183
149
84185
120
188
127 81
135
137
194 100
143
77
793575
20144
97 171 26
101
162
70
7498
6048 116 71187 7 195 186
33
161
58
56
31 159
160
152
85
118
94
177
169
146
65
104
112
29199
55
110
102 166 158 168
122 6327
163 24
49133
8350
4187
64
105
181
18197 184
52
73
156
91
173
13 176
25191
5 164454
43
6 107
1747
10 198
769
142
151
147
144299178
190
172 179
170 189
46
38
134
694 182
140
36 155
129 193
196 113 86130
21
145
22138
124
51
109
40
115
119
72
90
148
126
150
37 30
128
111
153
15
8 78
28
67
12
106 89 175
165
117
1 108
45
1153
164
19
40
50
60
70
180
30
2
30
40
50
60
math score
reading score
Fitted values
70
80
. lowess read math if id~=32 & id~=92
30
40
50
60
70
80
Lowess smoother
30
40
50
60
math score
bandwidth = .8
70
80
 Here’s how to examine a quantitative
bivariate scatter plot in terms of a
categorical variable:
. scatter read math, mlabel(id)
. scatter read math||qfit read math,
ml(id)
. scatter read math, ml(female)
. scatter read math, ml(race)
. scatter read math, ml(prog)
. scatter read math, by(female)
. scatter read math, by(race)
80
. scatter read math, ml(id)
61
34
103
93
70
180
121 123
66 157
125
62
141
3 183
149
77
793575
60
167
40
50
162
70
58
2
95 132
68
57
139
1148882
174
39
131
96 192
5923 15480 136
84185
120
188
12781
135
137
194100
20144
97 17126
200
143
101
7498
6048 116711877 195 186
31 159
160
152
85118 94177
169
14665
104
27
112
29199
55
110
102166158168
9212263
16324
49133
8350
4187
64
105
181
18197 184
52
73
156
91
173
32
9
13 176
25191
5 164454
43
6 107
1747
10 198
76151
142
147
14 99178
190
172 179
170189
42
46
38
134
694 182
140
36 155
129193
196113 86
21
138
130
145
22 124
51115
109
40 11972
90148
126
150
37 30
128
111
153
15
8 78
28
67
12
106
175
165
89
117
1 108
45
1153
33
161
56
30
164
19
30
40
50
60
math score
70
80
80
. scatter read math, ml(female)
female male
female
70
female
male
male
female
female
female
female
male
female
male
female
male
female
female
male male
female
female female
female
male
female
male
male
male
male
female
male
female
male
male
female female
female male
male
male
male
male
female
female
female
male
60
male
female
female
male
female
male
female
female
female
male
male female
female
male
female
female
male
female
male
male
male
male
female
male
male
female
male
female
female
male
male
female
female
female
male
female
female female
male
male
male
female
male male
female
female
male
male female
female female female
male
female
male
female
male
female
male
female
female
male
female
female
male
female
female
female
male
female
female
male female
male
male
female
female
male
male
female
male
female
female
female
female
male
female
male
male male
male
female
female
female
male
female
female
male
female
female
female
female
male
female
female
female male
female
male male
female
female
male male
female female
male
female
male
female
female male
male
male
40
50
female
male
male
male
male
male
female
female
male
30
male
female
30
40
50
60
math score
70
80
. scatter read math, by(female)
female
20
40
60
80
male
20
40
60
80
20
math score
Graphs by female=1 male=0
40
60
80
 Categorical explanatory variable:
How to examine its relationship with
a quantitative variable?
 Use a box plot or a stem plot to
graph the quantitative variable by a
categorical variable.
. graph box science, over(female, total)
. bys female: stem science
30
40
50
60
70
80
. gr box math, over(female, total)
male
female
(total)
Here, again, consider the potential
effects of lurking (i.e. unmeasured
confounding) variables.

 We’ll next examine associations from the
standpoints of correlation & regression.
 Both correlation & regression are
computed via means & standard
deviations.
 Consequently, both of these statistics are
highly sensitive to pronounced skewness &
extreme observations.
Correlation: measures the direction &
strength of the linear relationship
between two quantitative variables.
 Linear relationship.
 Two quantitative variables.
 Does not describe causal
relationships.
 Beware of lurking variables.
 Correlation measures the strength and
direction of a linear (i.e. straight-line)
relationship between two quantitative variables.
 That is, correlation measures the degree to
which the bivariate observations cluster
along a straight line, and the positive or
negative direction of the relationship.
 This is demonstrated in the next slide’s
scatterplots.
 A correlation is stronger to the degree that the
bivariate data cluster along the straight line, and
weaker to the degree that they do not.
 The direction of the relationships may be
positive or negative.
 Later in the course we’ll review measures
of correlation & other forms of association
involving categorical variables.
 On that topic, see the class’s slides for
chapter 10.
 If the bivariate scatterplot of two quantitative
variables displays a tight, pronounced curvilinear
cluster, will the correlation coefficient be relatively
strong or weak?
 It will be relatively weak, because a
correlation coefficient measures a linear
relationship between two quantitative variables.
 Even pronounced curvilinear bivariate
relationships yield weak correlation coefficients.
The relationship between two
quantitative variables—as
charted on a scatter plot—can
be summarized by:
 The
 The
 &
of the x-values
 &  of the y-values
 The correlation coefficient (r)
In a scatter plot:
 The  of x establishes the center-point of the
x-values, & the  of x establishes their spread.
 The  of y establishes the center-point of the
y-values, & the  of y establishes their spread.
 The correlation coefficient (r) measures the
degree to which the x & y observations cluster
around a straight line.
Correlation near 1 or –1 means
tight clustering around a
straight line in a positive or
negative direction: a strong
positive or negative linear
relationship.

 Correlation near 0 means loose
clustering around a straight line:
a weak linear relationship.
True or false, & explain: if the
correlation coefficient is 0.90, then
90% of the data points are highly
correlated.

See Freedman et al., Statistics.
Answer
 False. The correlation coefficient
indicates the direction & degree of
cluster between two quantitative
variables around a straight line.
How to compute a correlation coefficient
(r):
 Convert each x-value & each y-value to
a standard value (i.e. z-score): egen
zx=std(x) egen zy=std(y)
 Multiply the standard values (i.e. zscores) of each x & y pair, sum the
products of all the multiplied pairs, then
divide the sum by n - 1.
That is:
 Standardize each x-observation & each
y-observation, i.e. compute z-scores for
each value of x & each value of y.
 Multiply each z(x) by each z(y).
 Sum the products of the multipled pairs
of z-scores.
 Divide the sum by n – 1.
 Correlation coefficient:
1
r
( xi  x ) sx )( yi  y ) sy )
n 1
 Here’s how to compute it:
x
y
z(x)
z(y)
z(x)*z(y)
1
5
-1.5
-0.5
0.75
3
9
-0.5
0.5
-0.25
4
7
0.0
0.0
0.00
5
1
0.5
- 1.5
-0.75
7
13
1.5
1.5
2.25
r = (0.75 – 0.25 + 0.00 – 0.75 + 2.25)/(5-1)
= 0.50
In the preceding problem:
 Would changing the order of the observations
change the correlation?
 Would flip-flopping the x & y variables change
the correlation?
 Would adding 3 to each observation change the
correlation?
 Would multiplying each observation by 4
change the correlation?
Answers
 No to all.
See Freedman et al., Statistics.
What if the standard
deviation of x or y or both
is 0?

Answer
 Then, by virtue of the
formula, the correlation
coefficient can’t be
computed.

Correlation coefficient:
1
r
( xi  x ) sx )( yi  y ) sy )
n 1
 Correlation coefficient values range from
–1.0 to 1.0.
 Changing the order of x/y observations
does not change the correlation.
 Adding or multiplying by the same
positive number to each observation does
not change the correlation.
Features of the correlation
coefficient:
 Linear relationship between two
quantitative variables
 Describes association, not causal
order: interchanging the two variables
does not change the relationship
 Standardized units of measurement
Cautions about Correlation
 The correlation coefficient, as we’ve
seen, is the average of the product of the
standardized values of the two
quantitative variables.
 Therefore it is highly sensitive to
pronounced skewness & extreme
values.
Always do the following graphs/plots before
computing a correlation coefficient:
 Graph each variable (e.g., boxplot or stemplot)
to check for possible extreme values. The
univariate analysis will alert you to possible
problems in the bivariate scatterplot.
 Then do a scatterplot to check the bivariate
relationship for possible non-linearity &
pronounced outliers.
 It’s the scatterplot that really matters.
 If the scatterplot detects
substantial non-linearity, then it is
not appropriate to compute a
correlation coefficient.
 If the scatterplot detects
pronounced outliers, then don’t
compute a correlation coefficient
(unless you delete or, via
transformation, temper the outliers).
 Or possibly use ‘non-parametric’
(distribution-free) alternatives such as
‘spearman y x.’
 Another possibility: controlling for a lurking
variable—which would display a separate
scatterplot for each level (such as subcategory) of
the lurking variable—might result in linear
relationships within each separate scatterplot &
reduce or eliminate the prevalence/magnitude of
outliers.
 What does ‘control for’ mean?
Question
If graphs reveal skewness &/or
outliers in the distribution of one or
both variables, does this necessarily
mean that such problems will also
occur in the bivariate scatterplot?

Answer

Not necessarily.
 Why not? Because the
independent characteristics of each
variable by themselves do not fully
determine the form, direction, &
strength of the bivariate scatterplot
relationship.
Think of it this way:
 One nice person plus one nice
person does not necessarily equal a
nice relationship.
 One bad person plus one bad person
does not necessarily equal a bad
relationship.
 One nice person plus one bad
person does not necessarily equal a
nice or bad relationship.
Moral of the story?
Putting two (or more) variables
together often yields relationships
that are surprising in view of the
independent characteristics of each
individual variable: issues of
aggregation.

In the case of correlation: use a
univariate graph (e.g., normal
quantile plot or histogram) to alert
you to possible problems of nonlinearity &/or extreme values, but
it is the bivariate scatterplot
that provides the definitive
evidence.

Here are some more things to
worry about with regard to
computing & interpreting
correlation coefficients.

 Always ask how the variables are
defined & measured (i.e.
operationalized), who the data
represent, & how the data were
collected (see next chapter).
 Are these adequate?
 In addition, check the sample size:
small sample size may make it hard to
detect an association because there may
not be enough observations to reveal a
pattern.
 So, there may be a correlation within a
population, but the sample size may be
too small to reveal it.
Check the scatterplot for
outliers.

 Beware of curvilinear clustering (i.e. a
curvilinear x/y relationship).
 In such cases there may be a strong
relationship between the two variables, but
because it’s a curvilinear relationship
the correlation coefficient will be
relatively weak.
 When interpreting a correlation
coefficient, beware of lurking
(i.e. unmeasured
confounding) variables.
 Beware of correlations based on restricted
range data—using just part of the range of values.
 This usually causes attenuation: reduced
correlation coefficient.
 E.g., the correlation between SAT scores and
grades will be lower in an elite academic university
with only a narrow range of high-end SATs than at a
less selective university with a wide range of SATs.
 The elite university’s narrow range of SATs is associated
with a wide range of grades; the less selective university’s
wider range of SATs is associated with a wide range of
grades.
30
40
50
60
70
80
 Here’s an example:
30
40
50
60
math score
reading score
. corr read math = .66
Fitted values
70
80
30
40
50
60
70
. scatter read math if math<50 || qfit read math
30
40
50
60
math score
reading score
Fitted values
. corr read math if read<50 = .38
 Note the decreased correlation.
70
80
 Beware of ecological correlations: correlations
based on averaged data (which are common in the
social sciences: e.g., correlation between GPA of
individual students nationwide and average
standardized educational assessment test
score per state).
 Using averaged data typically inflates correlation
coefficients by reducing scatter among the values.
 Finally, a correlation coefficient just
partially describes a relationship
between two quantitative variables.
 Thus, always accompany a
correlation coefficient with the
means & standard deviations of the
two variables (as well as perhaps a
measure of skewness).
 And beforehand always graph the
bivariate relationship.
How to do it in Stata
 univariate & bivariate analysis
. kdensity read, norm
. gr box read
. kdensity write, norm
. gr box write
. summarize read write, detail
. scatter read write || qfit read write
 Compute correlation coefficient
. corr read math
(obs=200)
|
read
math
------------+------------------
read |
1.0000
math |
0.6623
1.0000
 By a categorical variable (in order to
control its influence):
. scatter read math, by(female) || qfit
read math
. bys female: corr read math
 Note: Make sure there are enough
observations in each category to
detect a possible association (see
chapter 3).
Let’s now consider a
measure that’s related to
correlation: simple linear
regression.

Examples
 What does knowing a person’s years of
schooling (x) enable us to say about the
person’s earnings (y)?
 What does knowing amount of dietary fat
consumed (x) enable us to say about the
rate of heart disease (y)?
 Unlike correlation,
regression involves a
response variable (y) & an
explanatory variable (x).
 But the y/x relationship
does not necessarily imply
a causal relationship.
Always ask: What is the conceptual
reason for the y/x order? What if the
y/x order were reversed?

 Simple linear regression describes how
the values of a response variable depend
on the values of an explanatory variable.
 On average, how does earnings level (y)
change for every unit of increase in years
of education (x)?
 On average, how does rate of heart
disease (y) change for every unit of
increase in years of education (x)?
 Why is correlation of limited use in
shedding light on such questions?
 Why is regression more useful in
this regard?
 Unlike correlation, regression enables
us to gauge how much values of a
response variable (y) change, on
average, with increases in an
explanatory variable (x).
 And unlike correlation, regression
references the relationship to the units
of the model’s variables: e.g., for every
added year of education (x), earnings
(y) increases by $1,023, on average.
The Most Basic Differences between
Correlation & Regression:
 Correlation measures the degree of bivariate
cluster along a straight line: the strength of a
linear relationship.
 It implies nothing about causal order.
 It is measured in standardized units.
 Regression measures the degree of slope in
the linear relationship between an outcome
variable (y) and an explanatory variable (x): the
average rate of change in y for every unit
change in x.
 It is measured in the units of the model’s
variables: with every unit increase in x the value
of y changes by… units, on average.
 Be careful about implied causal
relationships!
To repeat, always ask: What is the
conceptual reason for the y/x order?
What if the y/x order were reversed?

. reg wage educ
Source |
SS
df
MS
Number of obs =
-------------+-----------------------------Model | 1179.73204
1
526
F( 1, 524) = 103.36
1179.73204
Residual | 5980.68225 524 11.4135158
-------------+------------------------------
Prob > F
= 0.0000
R-squared
= 0.1648
Adj R-squared = 0.1632
Total | 7160.41429 525 13.6388844
Root MSE
= 3.3784
-----------------------------------------------------------------------------wage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------educ | .5413593 .053248
_cons | -.9048516
.6849678
10.17 0.000
-1.32 0.187
.4367534
-2.250472
.6459651
.4407687
------------------------------------------------------------------------------
 Always first check number of observations: Is it correct?
 For each unit (i.e. year) increase in education, hourly wage
increases by 0.54 dollars, on average.
 Regression line: a straight line that
describes how changes in a response
variable (y) are associated with changes
in an explanatory variable (x), on
average.
 It describes the average rate of change
in y for every unit change in x.
 Regression measures a linear
association: non-linearity creates
misleading results.
 Regression involves
fitting a line to data:
drawing a straight line that
comes as close as possible
to the data points.
 Fitting a line to data means drawing a
straight line that comes as close as possible
to the data points.
Regression line:
y  a  bx
y: response (or outcome or dependent) variable
a: intercept (value of y when x=0)
b: slope (rate of change in y associated with every
unit increase in x)
x: explanatory (or independent or predictor or
right-hand side) variable
 The equation fits a straight line to the
data.
 It’s called the least squares line
because it minimizes the distance (i.e.
residuals) between the equation’s ypredictions & the data’s y-observations.
 The better the model fits the data, the
smaller the distance between the ypredictions & the y-observations (that
is, the smaller the residuals).
 y-predictions are typically called
‘yhat.’
 The y-intercept (a) is usually meaningless in
substantive terms: it is the value of the dependent
variable when the independent variable=0.
 E.g., your GRE score if your IQ=0!
 The y-intercept (a) is included because it is
mathematically necessary for the regression
equation.
 Whether it’s substantively meaningful or not
depends on the sample.
 Simple linear regression:
y  a  bx
How do we interpret a regression
equation?

 The regression line for y on x
estimates the average value for y
associated with each value of x.
 For every unit increase in x, y
increases/decreases by ….., on average.
Keep in mind that regression
measures a linear
assocation.

 Non-linearity creates
misleading results in
regression, just as it does in
correlation.
 The least-squares line of y on x
makes the sum of the squares of the
vertical distances of the data points
from the line (i.e. the squared
residuals) as small as possible.
 It does so via the following
formulas:
y  a  bx
b 
r * y
x
a  y  bx
What, then, is the formula for the
slope (b): the rate of change in y
for every unit increase in x, on
average?

It is:
 the correlation of x & y, times the
sd of y
 divided by the sd of x.
b 
r * y
x
 So, for every sd increase in x, there is a
change in rx.y times sd of y.
 Unlike correlation, in regression the
slope coefficient (b) is expressed in
terms of the units of the relationship of
y to x.
 This makes it easier to interpret the
substantive meaning of a slope
coefficient than of a correlation
coefficient.
 E.g., for every hour increase in study
time, SAT score increases by 23 points,
on average.
 Simple linear regression:
y  a  bx
b
r * y
x
a  y  bx
Regression Computation Example
 Let’s compute a regression equation to
predict math scores (y) from reading
scores (x), based on some kind of
sample of 200 California students
(hsb2.dta).
read (y): mean=52.23, sd=10.25
math (x): mean=52.65, sd=9.37
r=0.617
 Compute the regression equation:
slope (b) = (0.617*9.37)/10.25 = 0.564
y-intercept (a) = 52.23 – (0.564*52.65) =
22.54
So: read = 22.54 + 0.564*xi
 Let’s now predict reading scores for two
x-values, math=35 & then math=65:
predicted y = 22.54 + 0.564*35 = 42.28
predicted y = 22.54 + 0.564*65 = = 59.2
Beware
 Software will accept any y/x ordering of
variables, even if it makes no substantive
sense.
 Always question the hypothesized y/x
order: e.g., IQ & family earnings—should
family earnings be x or y, should IQ be x or y?
 The y/x order of variables depends on the
conceptualization of the particular research
question: e.g., you may want to use GPA to
predict standardized test score, or you may
want to use standardized test score to predict
GPA.
 When we seek to explore causal relations, ask:
can we really establish that x precedes y in time?
 A temporal sequence is not always clear.
 Typically we’re using cross-sectional, not
longitudinal, data (& even longitudinal data
don’t always clarify matters).
 As McClendon says (Multiple Regression and
Causal Analysis, p. 5): “… it is often impossible to
know whether Y achieved its observed level
before or after X reached its observed level.”
 McClendon (p. 7) goes on to say that: “Good
theoretical arguments are often accepted in this
regard, although the inference will certainly be
more uncertain than if the temporal sequence
could be empirically established.”
 He says, moreover, that, even where temporal
sequence is not clear cut, X/Y regression
analyses “may cast doubt on existing theoretical
formulations by failing to find any relationship
between X and Y” (p. 7).
How to grasp that the slope (b) implies
that y responds to changes in x?
 Compute b*y/x, then record the slope coefficient
& plot the results.
 Do the same, but this time as b*x/y, then record
the slope coefficient & plot the results.
 Comparing the first & second equations, how do
the slope coefficients & plots differ?
 How does this feature of the slope coefficient differ
from the correlation coefficient?
 Do the y/x flip-flop for the read/math
regression equation.
 What are the results? What do they tell
us about the regression coefficient
versus the correlation coefficient?
 The results indicate that there are
two regression lines: one for y’s
dependence on x & the other for x’s
dependence on y.
 In contrast, in correlation there is just
one line: it’s the same for yx or xy.
 Keep in mind that, in order to make
sense, the yx order in regression
analysis must be based on
substantive & theoretical logic.
 So be careful: the regression equation
will accept the variables in any order,
even if the order (or the variables
themselves) makes no sense.
 We’ll talk some more about issues of
causality.
 We’ll do so in view of the introductory
discussion in Moore/McCabe/Craig.
 Another matter: beware of trying
to make predictions or
interpretations beyond the range of
the sample’s values.
 That is, beware of extrapolations.
There are two main ways of assessing a
regression model:
 Slope - (b) coefficient: the rate of change
in y for every unit change in x, on average.
 The slope coefficient—i.e. the regression
line—is typically what we care most about.
 Fit - r-squared: degree of cluster around a
regression line.
 While the slope coefficient (i.e. the regression
line) may explain part of the relationship of y to x,
there may be other sources of variation in y’s
values.
 The slope coefficient (i.e. the regression line)
does not say how large the additional variation is.
. use hsb2, clear
30
40
50
60
70
80
. scatter read math || qfit read math
30
40
50
60
70
math score
reading score
Fitted values
 There’s a clear linear relationship (i.e. slope), but
there’s scatter (i.e. variation) around it.
80
. lowess read math
30
40
50
60
70
80
Lowess smoother
30
40
50
60
math score
bandwidth = .8
70
80

This means that:
(1) The slope coefficient indeed describes a linear
relationship of y on x.
(2) But if we wanted to explain the entirety of
the relationship of y on x, then we’d have to
examine additional explanatory variables
(which, so far, are lurking variables).
What might the additional
explanatory variables be?

Two repeat, there are main two ways of
assessing a regression model:
 Slope - (b) coefficient: the rate of change in
y for every unit change in x, on average.
 Fit - r-squared: degree of cluster
around a regression line.
 Let’s discuss r-squared.
r-squared
r2 = the square of the correlation
between y & x. That is:
r2 = degree of cluster around the leastsquares line.
r2 = the fraction of the variation in the
values of y that is explained by the
least-squares regression of y on x.
r2 = variance of predicted values of y
divided by variance of the observed
values of y.
Apply the r2 ‘fit’ procedure to the
read/math data (r=0.617): r x r =
r2

 What is the result concerning the
degree of scatter around the least
squares line? Your conclusion?
Slope (b) vs. Fit (r2)
 Slope (b): the degree of change in y for
every unit change in x, on average.
 Fit (r2): the fraction of the variation in the
values of y that is explained by the leastsquares regression of y on x (i.e. degree of
cluster around the straight line).
 There can be a high r-squared with a
relatively flat slope, or a relatively
steep slope with a low r-squared.
 Why?
In short:
 Regression (i.e. slope) coefficient measures the
steepness in the least squares line: the degree
of change in y for every unit increase in x, on
average.
 r2 measures the fraction of the variation
in the values of y explained by x—the
degree of scatter around the least
squares line.
 Especially when we advance to multiple regression, we’ll
see that what generally matters most is the
regression coefficient: the steepness (i.e. slope) of the
linear relationship of y on x.
 It is the regression line (i.e. slope coefficient)
that measures the linear trend in how y
changes in response to changes in x.
 With multiple regression, we’ll see that merely adding
more explanatory variables—whether or not they make
conceptual sense—increases r-squared.
 Put differently, the slope coefficient (i.e. the
regression line) is about theoretically oriented,
generalizing analysis.
 r-squared is about historicist case-study
analysis (i.e. accounting for as much of the
variation as possible in a case study).
Watch out!
 The regression equation will yield
results for nonsensical or ambiguous
y/x order.
 The slope coefficient measures the
linear relationship of y on x.
Regression Trouble-Shooting
 The slope coefficient (b) is highly
susceptible to outliers.
 An outlier is ‘influential’ if
removing it notably changes the
regression coefficient.
 Before computing a regression
equation, always graphically check the
y variable & the x variable for
pronounced skewness & outliers. Do
so to alert you to possible problems.
 Then do a bivariate scatterplot of y on
x to check for non-linearity &
possible outliers.
 It is the bivariate scatterplot that
provides the definitive evidence for
simple regression.
 If there are outliers in the bivariate
scatterplot, compute the regression
equation with & then without the outliers.
 Compare the difference, & report it
if it’s notable.
. reg wage educ
Source |
SS
df
MS
Number of obs =
-------------+------------------------------
Model | 1179.73204
526
F( 1, 524) = 103.36
1 1179.73204
Prob > F = 0.0000
Residual | 5980.68225 524 11.4135158
R-squared = 0.1648
-------------+------------------------------
Adj R-squared = 0.1632
Total | 7160.41429 525 13.6388844
Root MSE
= 3.3784
-----------------------------------------------------------------------------wage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------educ | .5413593
_cons | -.9048516
.053248
10.17 0.000
.4367534
.6459651
.6849678
-1.32 0.187
-2.250472
.4407687
------------------------------------------------------------------------------
 First check N (# observations): is it correct?
 For every year of education, hourly wage increases by 0.54 dollars, on
average. But is this relationship linear?
25
20
15
10
5
0
0
5
10
years of education
average hourly earnings
15
Fitted values
20
 OLS regression permits the use of
categorical explanatory variables.
. tab female
female |
Freq.
Percent
Cum.
--------------------------------------------------0|
274
52.09
52.09
1|
252
47.91
100.00
--------------------------------------------------Total |
526
100.00
. tab female, su(wage)
| Summary of average hourly earnings
female |
Mean Std. Dev.
Freq.
----------------------------------------------------0|
7.1
4.2
274
1|
4.6
2.5
252
----------------------------------------------------Total |
5.9
3.7
526
0
5
10
15
20
25
. gr box wage, over(female, total)
0
 0=male 1=female
1
(total)
. reg wage female
--------------------------------------------------------------------------------------------wage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
------------- ------------------------------------------------------------------------------female | -2.51183 .3034092 -8.28 0.000 -3.107878 -1.915782
_cons | 7.099489 .2100082 33.81 0.000
6.686928
7.51205
 Because ‘wage’ refers to hourly wage, this
indicates that being female reduces hourly wage
by $2.51, on average.
 Of course, the validity of wage’s relationships to
education as well as to female needs to be
assessed by regression diagnostics.
Regression Diagnostics:
Is the fit linear?
 Residual: the difference between an observed
value of the response variable & the value
predicted by the regression line.
residual = observed y – predicted y
 Residual plot: a scatterplot of the regression
residuals against the explanatory variable. It
helps assess the fit of a regression line to the
data: is the fit linear, or not?
 If the regression model (i.e. equation)
fits the data, then the residual plot
indicates no pattern in the residuals.
 If the regression model doesn’t fit the
data, then the residual plot indicates a
pattern—typically curvilinear or fanshaped.
 Scatterplot & residual-vs.-explanatory-variable diagnostic
plot of a linear fit.
 Residual-vs.-explanatory-variable diagnostic plots of nonlinear fit.
What if the fit is nonlinear?
Check for possible data errors (in
measurement or reporting/coding).

 Consider transforming either y or x, or
both: more on this near the end of the
semester.
 Consider reformulating the regression
model, including possibly incorporating
other x-variables: multiple regression.
We’ll say more on this near the end of the
semester.
Other Questions
 What happens to the regression
coefficient if the standard deviation of
either variable or both variables is
zero?
 What kinds of social research that are
doable, or not doable, with regression
analysis?
How to do it in Stata
. kdensity wage, norm
. gr box wage
. kdensity educ, norm
. gr box educ
. su wage educ, detail
. scatter wage educ || qfit wage educ
.2
.15
0
5
10
15
average hourly earnings
20
25
.2
.3
Kernel density estimate
Normal density
0
.1
Density
Density
.1
.05
0
0
5
10
years of education
Kernel density estimate
Normal density
 Serious problems of nonlinearity.
15
20
25
20
15
0
5
10
0
5
10
years of education
average hourly earnings
15
Fitted values
 Given the pronounced nonlinearity, we should explore
transforming the variables before estimating the equation,
but for our didactic purposes we won’t do so.
20
. reg wage educ
Source |
SS
df
MS
Number of obs =
--------------------------------------------Model | 1179.73204
Residual | 5980.68225
1 1179.73204
524 11.4135158
--------------------------------------------Total | 7160.41429
525 13.6388844
F( 1,
526
524) = 103.36
Prob > F
R-squared
= 0.0000
= 0.1648
Adj R-squared = 0.1632
Root MSE
= 3.3784
-----------------------------------------------------------------------------wage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-----------------------------------------------------------------------------educ |
.5413593
_cons | -.9048516
.053248
10.17
0.000
.4367534
.6459651
.6849678
-1.32
0.187
-2.250472
.4407687
------------------------------------------------------------------------------
. predict yhat
. hist yhat, norm
. predict resid, resid
. hist resid, norm
. list yhat wage resid
|
yhat wage
resid |
|------------------------------|
1. |
5.0501
3.1
-1.9501 |
2. | 5.591459
3.2
-2.35146 |
3. |
3
5.0501
- 2.0501 |
4. | 3.426023
6
2.573977 |
5. | 5.591459
5.3 -.2914593 |
|------------------------------|
6. | 7.756896
8.8
.9931036 |
7. | 8.839615
11
2.410385 |
8. | 5.591459
5 -.5914595 |
-5
0
5
10
15
. rvfplot, yline(0)
-2
0
2
4
Fitted values
6
Very nonlinear fit—not surprising given the
preliminary graphic evidence.

8
. rvfplot, yline(0) ml(id)
15
15
186
112
229
440
10
66
343
5
379
503
139
417
-2
0
378
252
488
4
514 220
484
63
298
287
307
504 396
412 471 179
127
25
397
366 468
305 330 367 81 324
486
345
318
309 221
401
410
96
438
130
403
30
319
483
336
463
120 219 411
303
331
425
226
266
320
384
466
116
64
297
228
341
255
44
487 407
210
507
162
499
372
250
317
370
151
83 308
129
271
404
161
111
424
128 159
1
289 428
3
232
385
474
356
134
195
342
365
416
456
494
523
147
47 248
502
-5
0
465 38
306
470
284
406
214
2
4
Fitted values
326 278
245
505
476
16 10
170
200
399
256
42
68
235
355
394
513
383
198
177
230
13
265
283
386
187
302
22
79
217
88
208
405
469
142
41
104
183
17
90
519
500
175
299
21
201
377
234
84
108
449
95
154
430
510
251
288
437
40
197
314
491
176
37
209
54
270
91
259
454
206
521
14
472
113
5
173
328
348
8
34
53
185
73
94
290
329
167
243
143
35
402
354
190
357
20
103
135
148
164
275
340
373
475
482
481
189
249
398
294
461
77
180
239
261
273
321
344
346
254
145
434
75
102
152
285
304
462
118
9
19
349
263
446
133
362
85
100
191
224
246
442
464
509
353
369
429
222
312
188
347
493
87
86
238
419
452
480
2
310
124
242
292
286
374
78
155
241
517
204
323
93
205
236
262
264
300
451
511
316
48
240
50
74
295
322
439
485
490
501
388
473
109
55
136
70
51
453
516
24
6
260
105
107
98
172
58
203
89
59
76
61
18
33
477
12
163
420
413
45
267
168
169
296
144
227
153
520
253
512
279
361
387
459
360
97
216
268
335
389
495
496
258
415
211
174
150
166
46
31
497
62
92 29
381
178
140
156
436
518
423 450
478
65
215 231
458
375
421
194
114
338
460
233
414
137
184
350
122
269
280
351
32
43
363
432
99
115
247
257
223
368
380
49
117
315
526
364
457
39
213
244
426
282
276
106
508
281
443
393
101
132
524
119
467
56
146
123
391
447
522
325 444
171
72
28
339 427
479
149
525
352
395
141 441 334
7
138
158
26 199
110 431
433
435
6
181
202
52
332 515
489
455
131
27
165
359
400
337
311
157
182
126
445
23
11
67
192
196
225
327
409
506
301
390
274
358 80
160
291
371
69
71
333
408
422
272
313
498
121
293
376
82
492
448
60
418
36
392
207
193
57
382
277
212
125
218
237
8
Summary
 What are the main differences between
correlation & regression?
 What assumption, measures, problems, &
procedures are common to computing &
interpreting a correlation coefficient (r) & a
regression slope coefficient (b)?
Cautions about Correlation &
Regression
 Nonlinearity causes misleading results.
 A lurking (i.e. unmeasured confounding)
variable is one that isn’t among the explanatory
or response variables in a study yet may influence
the interpretation of the relationships among the
study’s variables.
 An outlier is an observation that lies outside the
overall pattern of the other observations. An
outlier is influential if removing it would markedly
change the result of the calculation.
 Points that are outliers in the x-variable direction
of a scatterplot are often influential in the leastsquares regression line.

Be careful concerning correlations (or regression
equations) based on different sets of
individuals (i.e. different sets of observations or
subjects).
 Beware of correlations based on averaged data
(called ecological correlations).
 Beware of correlations based on restrictedrange data (the common resulting problem being
attenuation).
Beware: association does not imply
causation.

 For any implied causal relationship,
always ask: What is the conceptual basis
for the relationship? What if y & x were
reversed?
 Beware: a regression equation will
accept & report results for questionable
y/x relationships.
Correlation & Regression: What Are the
Computational Building Blocks?
 Individual observation-values (xi’s & yi’s);
 x &  y;  x &  y; z(xi’s) & z(yi’s)
 All of the above go into computing a
correlation coefficient.
 All of the above, plus the correlation
coefficient, go into computing the regression
coefficient.
Data Analysis for Two-Way Tables
 Like correlation & regression, in two-way
tables (‘contingency tables’ or ‘crosstabulations’) both variables must be measured
on the same individuals or cases.
 But two-way tables use categorical
variables, which summarize counts of
observations.
How does the question of
causal order enter into all
of this?
Causation
 Association does not necessarily signify
causation.
 Three basic forms of causation:
(1) Direct causation (x causes y): more
knowledge (x) causes higher test scores
(y).
(2) Common response of x & y to
lurking variable-z: more knowledge
(x) is associated with higher test scores
(y); higher test scores (y) are associated
with higher SES (z).
(3) Confounding (x causes y & z causes y, while x
& z are associated with each other): the effects of
two explanatory or lurking variables on a response
variable are mixed together, so that we can’t
(easily) disentangle their effects on the response
variable.
Attending church (x) causes longer life (y); but
good health habits (z) are associated with attending
church (x) & a longer life (y). So good health
habits are confounded with attending church.
 By the way, what if y were re-conceptualized as
the explanatory variable?
Note: There’s no hard & fast distinction between
common response & confounding.
 The distinction between common
response & confounding is not always
clear.
 What matters is that “even a
strong association between two
variables is not by itself good
evidence that there’s a cause &
effect link between the variables”
(Moore/McCabe/Craig).
 See King et al., Designing Social
Inquiry, pages 75-114.
How to (more or less) establish a causal
relation between x & y?
(1) The association between x & y is strong.
(2) The association between x & y is consistent
across different settings.
(3) Changes in one variable are consistently
associated with changes in the other variable.
(4) x precedes y in time.
(5) The causal relationship is plausible.
(6) Lurking variables have been controlled for (see
#2).
Beware: conclusions are always uncertain.
Review

What are the most basic issues of theory,
methods & statistics?
 What is statistics? What are data?
 What is exploratory data analysis?
 How do we analyze a data set from the
perspective of statistics?
 What is a variable? What are the most basic
kinds of variables? How do we analyze them
graphically & numerically?
 What are the basic numerical measures? How are
they computed? What problems are associated with
them? How should we address these problems?

What are linear transformations? What are the
basic kinds? How do they affect variables?
 What are density curves? What kinds are there,
& why are they important?
 How do the median & the mean pertain to the
various kinds of density curves?
 What are normal distributions, and what are their
basic features? Why are normal distributions
important?
 How do the median, mean & standard
deviation describe a normal distribution?
What is the 68-95-99.7 rule, & why is it
important?
 What is a standard normal distribution?
 What is standardization? Why is it important?
 How does standardization pertain to the
normal distribution?
 How are standard values computed? How do
we recapture an original x-value from its
standardized value?
 What’s a correlation? What kind of variable does
it assess? What are a correlation coefficient’s
characteristics?
 What does a correlation have to do with
causation?
 How is a correlation computed? What does the
computation have to do with mean, standard
deviation & standardization?

What problems are associated with a correlation
coefficient? How do we examine such problems?
What kind of remedial action can be taken?
 What is an association between variables?
 What’s a response variable? What’s an
explanatory variable?
 What’s a scatterplot? How do we assess its
pattern?
 How do we use graphs & a scatterplot as a
combined strategy to examine univariate &
bivariate distributions?
 What’s a positive association? What’s a negative
association?
 How do we examine the relationship between a
quantitative variable & a categorical variable?
 What’s regression?
 What’s a regression line? What’s the form of a
regression equation? What does it measure?
 What does each component of the equation
measure? How is each component computed?
 What’s the difference between correlation &
regression? How can the difference be
demonstrated?
 What is the connection between a correlation
coefficient (r) & a regression slope coefficient (b)?
How are both of them connected, in turn, to mean
& standard deviation?
 What problems are associated with regression?
What remedial actions can be taken?
 What is association? What is a negative
association? What is a positive association?
 What is causation? What’s the difference between
association & causation?
 What are the basic kinds of causation?
 How do we (more or less) establish causation?
 What are lurking variables?
 What are the ramifications of what we’ve
considered so far for the social construction of
reality & the study of social relations/public
policies?
How can we summarize all
of this in terms of the ‘six
fundamental issues of
statistics in theoretical
perspective’?

What kinds of social research
are doable or not doable with
correlation & regression?
