Linear Regression 1 - University of California, Irvine

Download Report

Transcript Linear Regression 1 - University of California, Irvine

Multiple Regression 5
Sociology 5811 Lecture 26
Copyright © 2005 by Evan Schofer
Do not copy or distribute without
permission
Announcements
• Schedule:
– Today: Multiple regression hypothesis tests,
assumptions, and problems
• Reminder: Paper due on Thursday
• Questions about the paper?
Multiple Regression Assumptions
• As discussed in Knoke, p. 256
• Note: Allison refers to error (e) as disturbance (U); And
uses slightly different language… but ideas are the same!
• 1. a. Linearity: The relationship between
dependent and independent variables is linear
• Just like bivariate regression
• Points don’t all have to fall exactly on the line; but error
(disturbance) must not have a pattern
– Check scatterplots of X’s and error (residual)
• Watch out for non-linear trends: error is systematically
negative (or positive) for certain ranges of X
• There are strategies to cope with non-linearity, such as
including X and X-squared to model curved relationship.
Multiple Regression Assumptions
• 1. b. And, the model is properly specified:
– No extra variables are included in the model, and no
important variables are omitted. This is HARD!
• Correct model specification is critical
• If an important variable is left out of the model, results are
biased (“omitted variable bias”)
– Example: If we model job prestige as a function of
family wealth, but do not include education
• Coefficient estimate for wealth would be biased
– Use theory and previous research to decide what
critical variables must be included in your model.
Multiple Regression Assumptions
• Correct model specification is critical
– If an important variable is left out of the model,
results are biased
• This is called “omitted variable bias”
– Example: If we model job prestige as a function of
family wealth, but do not include education
• Coefficient estimate for wealth would be biased
– Use theory and previous research to help you identify
critical variables
• For final paper, it is OK if model isn’t perfect.
Multiple Regression Assumptions
• 2. All variables are measured without error
• Unfortunately, error is common in measures
– Survey questions can be biased
– People give erroneous responses (or lie)
– Aggregate statistics (e.g., GDP) can be inaccurate
• This assumption is often violated to some extent
– We do the best we can:
– Design surveys well, use best available data
– And, there are advanced methods for dealing with
measurement error.
Multiple Regression Assumptions
• 3. The error term (ei) has certain properties
• Recall: error is a cases deviation from the regression line
• Not the same as measurement error!
• After you run a regression, SPSS can tell you the error
value for any or all cases (called the “residual”)
• 3. a. Error is conditionally normal
– For bivariate, we looked to see if Y was conditionally
normal… Here, we look to see if error is normal
– Examine “residuals” (ei) for normality at different
values of X variables.
Multiple Regression Assumptions
• 3. b. The error term (ei) has a mean of 0
– This affects the estimate of the constant. (Not a huge
problem)
• 3. c. The error term (ei) is homoskedastic (has
constant variance)
– Note: This affects standard error estimates,
hypothesis tests
– Look at residuals, to see if they spread out with
changing values of X
• Or plot standardized residuals vs. standardized predicted
values.
Multiple Regression Assumptions
• 3. d. Predictors (Xis) are uncorrelated with error
– This most often happens when we leave out an
important variable that is correlated with another Xi
– Example: Predicting job prestige with family wealth,
but not including education
– Omission of education will affect error term. Those
with lots of education will have large positive errors.
• Since wealth is correlated with education, it will be
correlated with that error!
– Result: coefficient for family wealth will be biased.
Multiple Regression Assumptions
• 4. In systems of equations, error terms of
equations are uncorrelated
• Knoke, p. 256
– This is not a concern for us in this class
• Worry about that later!
Multiple Regression Assumptions
• 5. Sample is independent, errors are random
• Technically, part of 3.c.
– Not only should errors not increase with X
(heteroskedasticity), there should be no pattern at all!
• Things that cause patterns in error
(autocorrelation):
– Measuring data over long periods of time (e.g., every
year). Error from nearby years may be correlated.
• Called: “Serial correlation”.
Multiple Regression Assumptions
• More things that cause patterns in error
(autocorrelation):
– Measuring data in families. All members are similar,
will have correlated error
– Measuring data in geographic space.
• Example: data on 50 US states. States in a similar region
have correlated error
• Called “spatial autocorrelation”
• There are variations of regression models to
address each kind of correlated error.
Multiple Regression Assumptions
• Regression assumptions and final projects:
• At a minimum, check all bivariate regression
assumptions
– Also, you should check for outliers
• To be discussed soon!
– If you are capable of doing multiple regression
assumptions, go ahead and do them
• It will show mastery… which can’t hurt your grade!
Regression: Outliers
• Note: Even if regression assumptions are met,
slope estimates can have problems
• Example: Outliers -- cases with extreme values
that differ greatly from the rest of your sample
• More formally: “influential cases”
• Outliers can result from:
• Errors in coding or data entry
• Highly unusual cases
• Or, sometimes they reflect important “real” variation
• Even a few outliers can dramatically change
estimates of the slope, especially if N is small.
Regression: Outliers
• Outlier Example:
Extreme case that
pulls regression
line up
4
2
-4
-2
0
-2
-4
2
4
Regression line
with extreme case
removed from
sample
Regression: Outliers
• Strategy for identifying outliers:
• 1. Look at scatterplots or regression partial plots
for extreme values
• Easiest. A minimum for final projects
• 2. Ask SPSS to compute outlier diagnostic
statistics
– Examples: “Leverage”, Cook’s D, DFBETA,
residuals, standardized residuals.
Regression: Outliers
• SPSS Outlier strategy: Go to Regression – Save
– Choose “influence” and “distance” statistics such as
Cook’s Distance, DFFIT, standardized residual
– Result: SPSS will create new variables with values of
Cook’s D, DFFIT for each case
– High values signal potential outliers
– Note: This is less useful if you have a VERY large
dataset, because you have to look at each case value.
Scatterplots
• Example: Study time and student achievement.
– X variable: Average # hours spent studying per day
– Y variable: Score on reading test
Case
1
2
3
4
5
6
7
X
2.6
1.4
.65
4.1
.25
1.9
3.5
Y
28
13
17
31
8
16
6
Y axis
30
20
10
X axis
0
0
1
2
3
4
Outliers
• Results with outlier:
Model Summaryb
Model
1
R
a
.466
R Sq uare
.217
Adjusted
R Sq uare
.060
Std. Error of
the Estimate
9.1618
a. Predictors: (Constant), HRSTUDY
Coefficientsa
b. Dependent Variable: TESTSCOR
Standardi
zed
Unstandardized
Coefficien
Coefficients
ts
Model
B
Std. Error
Beta
1
(Constant)
10.662
6.402
HRSTUDY
3.081
2.617
.466
a. Dependent Variable: TESTSCOR
t
1.665
1.177
Sig .
.157
.292
Outlier Diagnostics
• Residuals: The numerical value of the error
• Error = distance that points falls from the line
• Cases with unusually large error may be outliers
• Standardized residuals
• Z-score of residuals… converts to a neutral unit
• Often, standardized residuals larger than 3 are considered
worthy of scrutiny
• But, it isn’t the best outlier diagnostic.
Outlier Diagnostics
• Cook’s D: Identifies cases that are strongly
influencing the regression line
– SPSS calculates a value for each case
• Go to “Save” menu, click on Cook’s D
• How large of a Cook’s D is a problem?
– Rule of thumb: Values greater than: 4 / (n – k – 1)
– Example: N=7, K = 1: Cut-off = 4/5 = .80
– Cases with higher values should be examined.
Outlier Diagnostics
• Example: Outlier/Influential Case Statistics
Hours
2.60
1.40
.65
4.10
.25
1.90
3.50
Score
28
13
17
31
8
16
6
Resid
9.32
-1.97
4.33
7.70
-3.43
-.515
-15.4
Std Resid
1.01
-.215
.473
.841
-.374
-.056
-1.68
Cook’s D
.124
.006
.070
.640
.082
.0003
.941
Outliers
• Results with outlier removed:
Model Summaryb
Model
1
R
.903a
R Sq uare
.816
Adjusted
R Sq uare
.770
Std. Error of
the Estimate
4.2587
a. Predictors: (Constant), HRSTUDY
Coefficientsa
b. Dependent Variable: TESTSCOR
Standardi
zed
Unstandardized
Coefficien
Coefficients
ts
Model
B
Std. Error
Beta
1
(Constant)
8.428
3.019
HRSTUDY
5.728
1.359
.903
a. Dependent Variable: TESTSCOR
t
2.791
4.215
Sig .
.049
.014
Regression: Outliers
• Question: What should you do if you find
outliers? Drop outlier cases from the analysis?
Or leave them in?
– Obviously, you should drop cases that are incorrectly
coded or erroneous
– But, generally speaking, you should be cautious about
throwing out cases
• If you throw out enough cases, you can produce any result
that you want! So, be judicious when destroying data.
Regression: Outliers
• Circumstances where it can be good to drop
outlier cases:
• 1. Coding errors
• 2. Single extreme outliers that radically change
results
• Your results should reflect the dataset, not one case!
• 3. If there is a theoretical reason to drop cases
• Example: In analysis of economic activity, communist
countries may be outliers
• If the study is about “capitalism”, they should be dropped.
Regression: Outliers
• Circumstances when it is good to keep outliers
• 1. If they form meaningful cluster
– Often suggests an important subgroup in your data
• Example: Asian-Americans in a dataset on education
• In such a case, consider adding a dummy variable for them
• Unless, of course, research design is not interested in that
sub-group… then drop them!
• 2. If there are many
• Maybe they reflect a “real” pattern in your data.
Regression: Outliers
• When in doubt: Present results both with and
without outliers
• Or present one set of results, but mention how results differ
depending on how outliers were handled
• For final projects: Check for outliers!
• At least with scatterplots
• But, a better strategy is to use partialplots and Cooks D (or
similar statistics)
– In the text: Mention if there were outliers, how you
handled them, and the effect it had on results.
Multiple Regression Problems
• Another common regression problem:
Multicollinearity
• Definition: collinear = highly correlated
• Multicollinearity = inclusion of highly correlated
independent variables in a single regression model
• Recall: High correlation of X variables causes
problems for estimation of slopes (b’s)
• Recall: variable denominators approach zero, coefficients
may wrong/too large.
Multiple Regression Problems
• Multicollinearity symptoms:
– Addition of a new variable to the model causes other
variables to change wildly
• Note: occasionally a major change is expected (e.g., if a
key variable is added, or for interaction terms)
– If a variable typically has a small effect
• BUT, when paired with another highly correlated variable,
BOTH have big effects in opposite directions.
Multicollinearity
• Diagnosing multicollinearity:
• 1. Look at correlations of all independent vars
– Correlation of .7 is a concern, .8> is often a problem
– But, sometimes problems aren’t always bivariate…
and don’t show up in bivariate correlations
• Ex: If you forget to omit a dummy variable
• 2. Watch out for the “symptoms”
• 3. Compute diagnostic statistics
• Tolerances, VIF (Variance Inflation Factor).
Multicollinearity
• Multicollinearity diagnostic statistics:
• “Tolerance”: Easily computed in SPSS
– Low values indicate possible multicollinearity
• Start to pay attention at .4; Below .2 is very likely to be a
problem
– Tolerance is computed for each independent variable
by regressing it on other independent variables.
Multicollinearity
• If you have 3 independent variables: X1, X2, X3…
– Tolerance is based on doing a regression: X1 is
dependent; X2 and X3 are independent.
• Tolerance for X1 is simply 1 minus regression R-square.
• If a variable (X1) is highly correlated with all the
others (X2, X3) then they will do a good job of
predicting it in a regression
• Result: Regression r-square will be high… 1 minus rsquare will be low… indicating a problem.
Multicollinearity
• Variance Inflation Factor (VIF) is the reciprocal
of tolerance: 1/tolerance
• High VIF indicates multicollinearity
– Gives an indication of how much the Standard Error
of a variable grows due to presence of other variables.
Multicollinearity
• Solutions to multcollinearity
– It can be difficult if a fully specified model requires
several collinear variables
• 1. Drop unnecessary variables
• 2. If two collinear variables are really measuring
the same thing, drop one or make an index
– Example: Attitudes toward recycling; attitude toward
pollution. Perhaps they reflect “environmental views”
• 3. Advanced techniques: e.g., Ridge regression
• Uses a more efficient estimator (but not BLUE – may
introduce bias).