Linear Regression 1 - Home | Social Sciences | UCI Social

Download Report

Transcript Linear Regression 1 - Home | Social Sciences | UCI Social

Multiple Regression 6
Sociology 5811 Lecture 27
Copyright © 2005 by Evan Schofer
Do not copy or distribute without
permission
Announcements
• Final paper due now!!!
• Course Evaluations
• Wrap up multiple regression
• Discuss issues of causality, if time remains.
Review: Outliers
• Strategy for identifying outliers:
• 1. Look at scatterplots or regression partial plots
for extreme values
• Easiest. A minimum for final projects
• 2. Ask SPSS to compute outlier diagnostic
statistics
– Examples: “Leverage”, Cook’s D, DFBETA,
residuals, standardized residuals.
Review: Outliers
• Example: Study time and student achievement.
– X variable: Average # hours spent studying per day
– Y variable: Score on reading test
Case
1
2
3
4
5
6
7
X
2.6
1.4
.65
4.1
.25
1.9
3.5
Y
28
13
17
31
8
16
6
Y axis
30
20
10
X axis
0
0
1
2
3
4
Review: Outliers
• Cook’s D: Identifies cases that are strongly
influencing the regression line
– SPSS calculates a value for each case
• Go to “Save” menu, click on Cook’s D
• How large of a Cook’s D is a problem?
– Rule of thumb: Values greater than: 4 / (n – k – 1)
– Example: N=7, K = 1: Cut-off = 4/5 = .80
– Cases with higher values should be examined
• Big residuals also can indicate outliers.
Review: Outliers
• Example: Outlier/Influential Case Statistics
Hours
2.60
1.40
.65
4.10
.25
1.90
3.50
Score
28
13
17
31
8
16
6
Resid
9.32
-1.97
4.33
7.70
-3.43
-.515
-15.4
Std Resid
1.01
-.215
.473
.841
-.374
-.056
-1.68
Cook’s D
.124
.006
.070
.640
.082
.0003
.941
Review: Outliers
• Question: Should you drop “outlier” cases?
– Obviously, you should drop cases that are incorrectly
coded or erroneous
– But, you should be cautious about throwing out cases
• If you throw out enough cases, you can produce any result
that you want! So, be judicious when destroying data
• When in doubt: Present results both with and
without outliers
– Or present one set of results, but mention how results
differ depending on how outliers were handled.
Multicollinearity
• Another common regression problem:
Multicollinearity
• Definition: collinear = highly correlated
– Multicollinearity = inclusion of highly correlated
independent variables in a single regression model
• Recall: High correlation of X variables causes
problems for estimation of slopes (b’s)
– Recall: variable denominators approach zero,
coefficients may wrong/too large.
Multicollinearity
• Multicollinearity symptoms:
• Unusually large standard errors and betas
• Compared to if both collinear variables aren’t included
• Betas often exceed 1.0
• Two variables have the same large effect when
included separately… but…
– When put together the effects of both variables shrink
– Or, one remains positive and the other flips sign
• Note: Not all “sign flips” are due to multicollinearity!
Multicollinearity
• What does multicollinearity do to models?
– Note: It does not violate regression assumptions
• But, it can mess things up anyway
• 1. Multicollinearity can inflate standard error
estimates
– Large standard errors = small t-values = no rejected
null hypotheses
– Note: Only collinear variables are effected. The rest
of the model results are OK.
Multicollinearity
• What does multicollinearity do?
• 2. It leads to instability of coefficient estimates
– Variable coefficients may fluctuate wildly when a
collinear variable is added
– These fluctuations may not be “real”, but may just
reflect amplification of “noise” and “error”
• One variable may only be slightly better at predicting Y…
but SPSS will give it a MUCH higher coefficient
– Note: These only affect variables that are highly
correlated. The rest of the model is OK.
Multicollinearity
• Diagnosing multicollinearity:
• 1. Look at correlations of all independent vars
– Correlation of .7 is a concern, .8> is often a problem
– But, sometimes problems aren’t always bivariate…
and don’t show up in bivariate correlations
• Ex: If you forget to omit a dummy variable
• 2. Watch out for the “symptoms”
• 3. Compute diagnostic statistics
• Tolerances, VIF (Variance Inflation Factor).
Multicollinearity
• Multicollinearity diagnostic statistics:
• “Tolerance”: Easily computed in SPSS
– Low values indicate possible multicollinearity
• Start to pay attention at .4; Below .2 is very likely to be a
problem
– Tolerance is computed for each independent variable
by regressing it on other independent variables.
Multicollinearity
• If you have 3 independent variables: X1, X2, X3…
– Tolerance is based on doing a regression: X1 is
dependent; X2 and X3 are independent.
• Tolerance for X1 is simply 1 minus regression R-square.
• If a variable (X1) is highly correlated with all the
others (X2, X3) then they will do a good job of
predicting it in a regression
• Result: Regression r-square will be high… 1 minus rsquare will be low… indicating a problem.
Multicollinearity
• Variance Inflation Factor (VIF) is the reciprocal
of tolerance: 1/tolerance
• High VIF indicates multicollinearity
– Gives an indication of how much the Standard Error
of a variable grows due to presence of other variables.
Multicollinearity
• Solutions to multcollinearity
– It can be difficult if a fully specified model requires
several collinear variables
• 1. Drop unnecessary variables
• 2. If two collinear variables are really measuring
the same thing, drop one or make an index
– Example: Attitudes toward recycling; attitude toward
pollution. Perhaps they reflect “environmental views”
• 3. Advanced techniques: e.g., Ridge regression
• Uses a more efficient estimator (but not BLUE – may
introduce bias).
Nested Models
• It is common to conduct a series of multiple
regressions
– Adding new variables or sets of variables to a model
• Example: Student achievement in school
• Suppose you are interested in effects of neighborhood
– You might first look at all demographic effects…
– Then add neighborhood variables as a group
• Hopefully to show that they improve the model.
Nested Models
• Question: Do the new variables substantially
improve the model?
• Idea #1: See if your variables are significant
• Idea #2: See if there is an increase in the adjusted
R-Square
• Idea #3: Conduct an F-test
– A formal test to see if the group of variables improves
the model as a whole (increases the R-square)
– Recall that F-tests allow comparisons of variance
(e.g., SSbetween to SSwithin).
Nested Models
• F-tests require “nested models”
– Models are the same, except for addition of new
variables
– You can’t compare totally different models this way
F( K 2  K1 )( N  K 2 1)
( R  R ) ( K 2  K1 )

2
(1  R2 ) ( N  K 2  1)
2
2
2
1
• Tests following Hypotheses:
• H0: Two models have the same R-square
• H1: Two models have different R-square
Nested Models
• SPSS can conduct an F-test between two
regression models
• A significant F-test indicates:
• The second model (with additional variables) is a
significant improvement (in R-square) compared
to the first.
Extensions of Regression
• The multivariate regression model has been
altered and extended in many ways
– Many techniques are “regression analogues”
– Often, they address shortcomings of regression
• Problem: Regression requires that the dependent
variable is interval
• Solution: Logistic Regression (also Probit, others)
– Allows analysis dichotomous dependent variable
Extensions of Regression
• Problem: Many variables are “counts”
– Example: Number of crimes committed
– Counts = non-negative integers; often highly skewed
• Solution: Poisson Regression and Negative
Binomial Regression
– These models use a non-linear approach to model
counts.
Extensions of Regression
• Problem: Sometimes we want to measure cases
at multiple points in time
– Example: economic data for companies
– Cases are not independent, errors may be correlated
• Solution: Time-series procedures:
– ARIMA; Prais-Winston; Newey West, and others
– All different ways to address serial correlation of
errors.
Extensions of Regression
• Problem: Sometimes cases are not independent
because they are part of larger groups
– Example: Research on students in several schools
– Cases within each school share certain similarities
(e.g., neighborhood). They are not independent.
• Solution: Hierarchical Linear Models (HLM).
Extensions of Regression
• Problem: Severe measurement error
• Solution: Structural equation models with latent
variables
• Uses multiple indicators to estimate a better model
• Problem: Sample selection issues
• Solution: Heckman sample selection model
• AND: there are many more…
• Event history analysis
• Fixed and random effects models for pooled time series
• Etc. etc., etc…
Models and “Causality”
• Issue: People often use statistics to support
theories or claims regarding causality
– They hope to “explain” some phenomena
• What factors make kids drop out of school
• Whether or not discrimination leads to wage differences
• What factors make corporations earn higher profits
• Statistics provide information about association
• Always remember: Association (e.g., correlation)
is not causation!
• The old aphorism is absolutely right
• Association can always be spurious
Models and “Causality”
• How do we determine causality?
• The randomized experiment is held up as the
ideal way to determine causality
• Example: Does drug X cure cancer?
• We could look for association between receiving
drug X and cancer survival in a sample of people
• But: Association does not demonstrate causation; Effect
could be spurious
• Example: Perhaps rich people have better access to drug X;
and rich people have more skilled doctors!
• Can you think of other possible spurious processes?
Models and “Causality”
• In a randomized experiment, people are assigned
randomly to take drug X (or not)
• Thus, taking drug X is totally random and totally
uncorrelated with any other factor (such as wealth, gender,
access to high quality doctors, etc)
• As a result, the association between drug X and
cancer survival cannot be affected by any
spurious factor
• Nor can “reverse causality” be a problem
• SO: We can make strong inferences about causality!
Models and “Causality”
• Unfortunately, randomized experiments are
impractical (or unethical) in many cases
• Example: Consequences of high-school dropout, national
democracy, or impact of homelessness
• Plan B: Try to “control” for spurious effects:
• Option 1: Create homogenous sub-groups
– Effects of Drug X: If there is a spurious relationship
with wealth, compare people with comparable wealth
• Ex: Look at effect of drug X on cancer survivors among
people of constant wealth… eliminating spurious effect.
Models and “Causality”
• Option 2: Use multivariate model to “control”
for spurious effects
• Examine effect of key variable “net” of other relationships
– Ex: Look at effect of Drug X, while also including a
variable for wealth
• Result: Coefficients for Drug X represent effect net of
wealth, avoiding spuriousness.
Models and “Causality”
• Limitations of “controls” to address spuriousness
• 1. The “homogenous sub-groups” reduces N
• To control for many possible spurious effects, you’ll throw
away lots of data
• 2. You have to control for all possible spurious
effects
• If you overlook any important variable, your results could
be biased… leading to incorrect conclusions about causality
• First: It is hard to measure and control for everything
• Second: Someone can always think up another thing you
should have controlled for, undermining your causal claims.
Models and “Causality”
• Under what conditions can a multivariate model
support statements about causality?
• In theory: A multivariate model support claims
about causality… IF:
• The sample is unbiased
• The measurement is accurate
• The model includes controls for every major possible
spurious effect
• The possibility of reverse causality can be ruled out
• And, the model is executed well: assumptions, outliers,
multicollinearity, etc. are all OK.
Models and “Causality”
• In Practice: Scholars commonly make tentative
assertions about causality… IF:
• The data set is of high quality; sample is either random or
arguably not seriously biased
• Measures are high quality by the standards of the literature
• The model includes controls for major possible spurious
effects discussed in the prior literature
• The possibility of reverse causality is arguably unlikely
• And, the model is executed well: assumptions, outliers,
multicollinearity, etc. are all acceptable… (OR, the author
uses variants of regression necessary to address problems).
Models and “Causality”
• In sum: Multivariate analysis is not the ideal tool
to determine causality
• If you can run an experiment, do it
• But: Multivariate models are usually the best tool that we
have!
• Advice: Multivariate models are a terrific way to
explore your data
• Don’t forget: “correlation is not causation”
• The models aren’t magic; they simply sort out correlation
• But, if used thoughtfully, they can provide hints into likely
causal processes!