IV. Selecting Variables  How do we go about selecting variables for regression models?  In fact, we’ve already spent considerable time on this topic.

Download Report

Transcript IV. Selecting Variables  How do we go about selecting variables for regression models?  In fact, we’ve already spent considerable time on this topic.

IV. Selecting Variables
 How do we go about selecting variables for
regression models?

In fact, we’ve already spent considerable time
on this topic (questions of causality within a
multivariate framework).

Most fundamentally, we should include
variables only because within a sound
conceptual framework:
(1) We want to find out how they effect the
dependent variable. Or
(2) We want to control for their effects on the
dependent variable.
 So, we include independent variables
only within sound conceptual
frameworks that lead us to
hypothesize that the variables:
(1) Have causal effects on the
dependent variable.
(2) Are correlated with each other.
(see Allison, pages 49-52)
 Let’s keep in mind that properly conducted,
randomized experimental design
automatically imposes controls.
 That is, it automatically ensures that there’s
no correlation between the treatment variable
and the characteristics of the subjects.
(see Allison, page 50)
 Today we’ll introduce some variable-selection
procedures that most of us do not recommend
using (see, e.g., Allison, pages 92-93; Mendenhall
& Sincich, chapter 6).
 More important, we’ll then examine a nonautomated, conceptually guided & systematic
approach to selecting variables—the way we
should do things.
Automated Procedures
 What if we have lots of potential explanatory
variables but we have no clear reasons to
guide us in selecting them for a model?
 An automated approach to the problem is
stepwise regression:
sw regress y x1 x2 x3…xk, options
. use stepwise, clear
. corr x*
. collin x*
[a set of collinearity statistics]
 Forward stepwise selection:
. sw regress y x1 x2 x3 x4 x5 x6, pe(.99)
Set ‘parameter entry’ to .99 so that all the
variables will enter & their p-value order can
be observed.
. sw regress y x1 x2 x3 x4 x5 x6, pe(.25)
Set ‘parameter entry’ to .25 so only the
variables with p-value<=.25 will be retained.
Logic of forward stepwise selection:
(1) Use stepwise to fit a model of y on the constant.
(2) Stepwise adds x1, then x2, then…x6.
(3) Stepwise finds the x-variable of the series that is
most significant statistically. In our example, if a
variable’s significance is <=.25, stepwise keeps it
in the model.
 pe: ‘eligible for addition’
 Backward stepwise selection:
. sw regress y x1 x2 x3 x4 x5 x6, pr(.99)
Set ‘parameter entry’ to .99 so that all the
variables will enter & their p-value order can
be observed.
. sw regress y x1 x2 x3 x4 x5 x6, pr(.25)
Set ‘parameter entry’ to .25 so only the
variables with p-value<=.25 will be retained.
 pr: ‘eligible for removal’
Logic of backward stepwise selection:
(1) Use stepwise to fit a model of y on x1…x6.
(2) Stepwise considers dropping x1, then x2,
then…x6.
(3) Stepwise finds the x-variable that’s most
significant statistically. In our example, if a
variable’s significance is >.25, stepwise removes it
from the model.
 Stepwise selection may prove helpful in
exploratory data analysis, but it is fraught
with serious problems:
(1) Most basically, it capitalizes on samplespecific chance: with a large enough pool of
variables, the procedure will find statistically
significant results by chance, based on the
particular sample’s quirks.
(2) In another sample, the procedure is likely
to select different variables.
(3) And in the case of any sample, stepwise
cannot take into account theoretical or
practical significance: there’s nothing to keep
it from selecting nonsensical variables.
 What’s a much more constructive &
defensible approach?
 Use conceptual criteria to select the pool of
possible variables, then use some
combination of conceptual & model-fitting
criteria to narrow the pool.
A Conceptually based Approach
Part I
 How is outcome variable y conceptualized? What
are its topic-specific, as well as broader political,
social, & cultural, premises (e.g., IQ, race, gender)?
In what ways are these valid or not? How do they
contribute to the social construction of reality?
 What is it about y that needs to be explained, &
why? What are the topic-specific, as well as broader
political, social, & cultural, premises of the question
(e.g., IQ, race, gender)? How do they pertain to the
social construction of reality?
 Within a solid conceptual framework,
what independent variables are likely to
have causal effects on the dependent
variable?
 And what independent variables—having
causal effects on the dependent variable
as well as correlations with other
independent variables—need to be
controlled?
 Concerning the data for y, are the sample,
the broader study design & procedures, & the
variable’s measurement valid or not:
 Temporally speaking for the implied X/Y
relationship?
 Technically speaking for measurement
premises (i.e. ordinal or interval quantitative
variable, or some kind of categorical
variable).
 Allison (pages 52-57) lays out other basic
questions to ask:
 Based on our knowledge of the topic,
does the dependent variable affect any of
the independent variables?
 Reverse causation may bias the
independent variables.
 There’s not much that can be done
about it.
 If the problem seems serious enough,
re-conceptualize the model.
 Are there omitted variables?
 This causes bias to some degree or
another.
 Are the variables measured well?
 There’s bias to the degree that they
are not measured well.
 Greater measurement error: bias
tends to be toward zero (i.e.
underestimated values)
 Less measurement error: bias
tends to be further from zero (i.e.
overestimated values)
 Do some independent variables
mediate the effects of others on the
dependent variable?
 This is a key question for
understanding the causal processes
within a model.
 An independent variable’s total
effect=direct effect + indirect effects
 Test nested models to find out.
 Is there multicollinearity?
 VIF>10; tolerance<.1; condition
index>15.
 If there is multicollinearity, the
standard errors become too large .
 This makes it harder to detect
statistical significance.
 Note: multicollinearity is a problem for
hypothesis-testing models but not for
predictive models.
 Taking all of this into account, here’s a
series of questions to ask:
Part I
 Is the sample—including its size—adequate for the
study’s purpose?
 What is the dependent variable. How is it defined
& measured? Is it well measured? Does it possibly
have effects on the explanatory variables? If so, to
what degree?
 What explanatory variables should be included in
the model, & why? To what extent are data on these
variables available or collectible?
 How is each potential explanatory variable defined
& measured? Is it well measured?
Part II
 Document, in terms of the literature and your
knowledge of the topic, the hypothesized
relationship of each explanatory variable to y (see
McClendon, chap. 3; Agresti/Finlay, chap. 10; King
et al.).
(1) Linear, independent (untransformed
quantitative variable)?
(2) Same slope but unequal y-intercepts (dummy
variables, including multinomial categorical)?
(3) Critical thresholds (categorical binary or
ordinal)?
(4) Increasing or decreasing effect (quadratic or
log)?
(5) Dependent on the level of another
explanatory variable (interactional, meaning
unequal slopes)?
(6) Or some combination of these?
Part III
 Univariate analysis: Graphically & numerically
describe y’s distribution: overall pattern & striking
deviations—i.e. shape, center, & spread, including
notable outliers.
 Should y be transformed or not, & why?
 Univariate analysis: Graphically & numerically
describe the distribution of each explanatory
variable: overall pattern & striking deviations—i.e.
shape, center, & spread, including notable
deviations.
 Should any of the explanatory variables be
transformed or not, and why?
Part IV
 List the explanatory variables (including
transformations & interactions) in order of their
conceptual importance to y, & explain this order as
well as the hypothesized form of each relationship
(see McClendon, chap. 3).
Part V
 Bivariate analysis: Graphically & numerically
describe the bivariate relationship of each
explanatory variable to y (see McClendon, pages
107-116; Agresti/Finlay, chap. 10).
 Bivariate analysis: Graphically & numerically
describe the bivariate relationships of the explanatory
variables to each other (see McClendon, pages 107116; Agresti/Finlay, chap. 10).
 Bivariate analysis, controlling another
explanatory variable: Graphically &
numerically describe the bivariate relationship of
each explanatory variable to y, sequentially holding
another explanatory variable constant (see
McClendon, pages 107-116; Agresti/Finlay, chap.
10).
 Estimate a bivariate regression model of each
explanatory variable’s relation to y, noting the
value and p-value of each coefficient.
Part VI
 After completing the preparatory data analysis, estimate &
assess multiple regression models:
(1) Estimate a preliminary main effects model (i.e.without
curvilinear terms).
(a) Use mark/markout to ensure equal # observations.
(b) How have any bivariate relationships changed?
(c) For now, eliminate the explanatory variables that test
insignificant.
(2) Conduct a nested-model F-test, also comparing #observations
(which must be equal for the nested tests), Adj R2, & the
explanatory variables’ signs & coefficients, standard errors, pvalues, & confidence intervals.
(3) One by one drop each explanatory variable &
compare the models regarding Adj R2, slope
coefficients (direction & size), standard errors, pvalues, & confidence intervals.
(4) For the time being, drop any insignificant
variables: this is the preliminary main effects
model.
(5) Re-check for any other possible y/x
curvilinearities by combining qladder/ladder with any
other of the following commands: qladder x1, ladder
x1; sparl y x1 (& options); locpoly y x1, de(#); twoway
mband y x1, ba(#); lowess y x1, bw(.#); scatter y x1 ||
qfit (& fpfit) y x1.
(6) Re-estimate the model as necessary to explore
curvilinear y/x relationships, comparing Adj R2,
coefficients (direction & size), standard errors, pvalues, & confidence intervals.
(7) Consider & possibly explore whether or not it
makes sense to collapse or otherwise revise the
categories of categorical variables.
(8) Re-estimate the model as necessary to explore
revised y/x relationships in terms of collapsed &/or
otherwise revised categorical variables.
(9) Consider all possible substantively meaningful
interactions.
(a) One by one add each such interaction to the
model, comparing #observations, Adj R2, & the
explanatory variables’ coefficients, standard
errors, p-values, & confidence intervals.
(b) Add all of the interactions that tested
significant to the model; conduct all possible
nested model tests, also comparing
#observations (which must be equal for the
nested tests), Adj R2, & the explanatory
variables’ coefficients, standard errors, pvalues, & confidence intervals.
(c) Add all the variables that had previously
been dropped; conduct all possible
nested model tests, also comparing
#observations (which must be equal for the
nested tests), Adj R2, & the explanatory
variables’ coefficients (direction & size),
standard errors, p-values, & confidence
intervals.
(10) Drop those explanatory variables that are
characterized by some combination of weak
conceptual relationship with y & statistical
insignificance, or which somehow detract from
the clarity of the model.
(a) Estimate this—the preliminary final
model.
(b) Conduct the battery of graphic &
numerical diagnostic tests.
(11) Re-add any other variables that are
conceptually/theoretically relevant.
(a) Estimate this—the final, complete
model.
(b) Conduct the battery of graphic &
numerical diagnostic tests.
(c) Test nested models (see Mendenhall
& Sincich), conducting the diagnostic
tests for each model.