Linear Regression 1 - University of California, Irvine

Download Report

Transcript Linear Regression 1 - University of California, Irvine

Multiple Regression Review
Sociology 229A
Copyright © 2008 by Evan Schofer
Do not copy or distribute without
permission
Multiple Regression
• Question: What if a dependent variable is
affected by more than one independent variable?
• Strategy #1: Do separate bivariate regressions
– One regression for each independent variable
• This yields separate slope estimates for each
independent variable
– Bivariate slope estimates implicitly assume that
neither independent variable mediates the other
– In reality, there might be no effect of family wealth
over and above education
Multiple Regression
• Job Prestige: Two separate regression models
Coefficientsa
Model
1
(Constant)
HIGHEST YEAR OF
SCHOOL COMPLETED
Unstandardized
Coefficients
B
Std. Error
9.417
1.421
2.488
Standardi
zed
Coefficien
ts
Beta
.108
.520
t
6.625
Sig .
.000
23.056
.000
a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE (1970)
Coefficientsa
Model
1
(Constant)
RS FAMILY INCOME
WHEN 16 YRS OLD
Unstandardized
Coefficients
B
Std. Error
35.608
1.290
2.075
.446
Standardi
zed
Coefficien
ts
Beta
.122
t
27.611
Sig .
.000
4.652
.000
a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE (1970)
Both variables have positive, significant slopes
Multiple Regression
• Idea #2: Use Multiple Regression
• Multiple regression can examine “partial”
relationships
– Partial = Relationships after the effects of other
variables have been “controlled” (taken into account)
• This lets you determine the effects of variables
“over and above” other variables
– And shows the relative impact of different factors on
a dependent variable
• And, you can use several independent variables to
improve your predictions of the dependent var
Multiple Regression
• Job Prestige: 2 variable multiple regression
Coefficientsa
Model
1
(Constant)
HIGHEST YEAR OF
SCHOOL COMPLETED
RS FAMILY INCOME
WHEN 16 YRS OLD
Unstandardized
Coefficients
B
Std. Error
8.977
1.629
Standardi
zed
Coefficien
ts
Beta
t
5.512
Sig .
.000
2.487
.111
.520
22.403
.000
.178
.394
.011
.453
.651
a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE (1970)
Education slope is basically
unchanged
Family Income slope decreases
compared to bivariate analysis
(bivariate: b = 2.07)
And, outcome of hypothesis
test changes – t < 1.96
Multiple Regression
• Ex: Job Prestige: 2 variable multiple regression
• 1. Education has a large slope effect controlling
for (i.e. “over and above”) family income
• 2. Family income does not have much effect
controlling for education
• Despite a strong bivariate relationship
• Possible interpretations:
• Family income may lead to education, but education is the
critical predictor of job prestige
• Or, family income is wholly unrelated to job prestige… but
is coincidentally correlated with a variable that is
(education), which generated a spurious “effect”.
The Multiple Regression Model
• A two-independent variable regression model:
Yi  a  b1 X1i  b2 X 2i  ei
• Note: There are now two X variables
• And a slope (b) is estimated for each one
• The full multiple regression model is:
Yi  a  b1 X1i  b2 X 2i  bk X ki  ei
• For k independent variables
Multiple Regression: Slopes
• Regression slope for the two variable case:
 sY
b1  
 sX
 1
 rYX 1  rYX 2 rX1 X 2

 1  rX2 X
1 2

• b1 = slope for X1 – controlling for the other
independent variable X2
• b2 is computed symmetrically. Swap X1s, X2s
• Compare to bivariate slope: b  sY r
YX
sX
YX
Multiple Regression Slopes
• Let’s look more closely at the formulas:
 sY
b1  
 sX
 1
 rYX 1  rYX 2 rX1 X 2
s
Y

versus

b

r
YX
YX
2
 1  rX X
sX
1 2

• What happens to b1 if X1 and X2 are totally
uncorrelated?
• Answer: The formula reduces to the bivariate
• What if X1 and X2 are correlated with each other
AND X2 is more correlated with Y than X1?
• Answer: b1 gets smaller (compared to bivariate)
Regression Slopes
• So, if two variables (X1, X2) are correlated and
both predict Y:
• The X variable that is more correlated with Y will
have a higher slope in multivariate regression
– The slope of the less-correlated variable will shrink
• Thus, slopes for each variable are adjusted to how
well the other variable predicts Y
– It is the slope “controlling” for other variables.
Multiple Regression Slopes
• One last thing to keep in mind…
 sY
b1  
 sX
 1
 rYX 1  rYX 2 rX1 X 2
s
Y

versus

b

r
YX
YX
2
 1  rX X
sX
1 2

• What happens to b1 if X1 and X2 are almost
perfectly correlated?
• Answer: The denominator approaches Zero
• The slope “blows up”, approaching infinity
• Highly correlated independent variables can
cause trouble for regression models… watch out
Interpreting Results
• (Over)Simplified rules for interpretation
– Assumes good sample, measures, models, etc.
• Multivariate regression with two variables: A, B
• If slopes of A, B are the same as bivariate, then
each has an independent effect
• If A remains large, B shrinks to zero we typically
conclude that effect of B was spurious, or
operates through A
• If both A and B shrink a little, each has an effect,
but some overlap or mediation is occurring
Interpreting Multivariate Results
• Things to watch out for:
• 1. Remember: Correlation is not causation
– Ability to “control” for many variables can help detect
spurious relationships… but it isn’t perfect.
– Be aware that other (omitted) variables may be
affecting your model. Don’t over-interpret results.
• 2. Reverse causality
– Many sociological processes involve bi-directional
causality. Regression slopes (and correlations) do not
identify which variable “causes” the other.
• Ex: self-esteem and test scores.
Standardized Regression Coefficients
• Regression slopes reflect the units of the
independent variables
• Question: How do you compare how “strong”
the effects of two variables if they have totally
different units?
• Example: Education, family wealth, job prestige
– Education measured in years, b = 2.5
– Family wealth measured on 1-5 scale, b = .18
– Which is a “bigger” effect? Units aren’t comparable!
• Answer: Create “standardized” coefficients
Standardized Regression Coefficients
• Standardized Coefficients
– Also called “Betas” or Beta Weights”
– Symbol: Greek b with asterisk: b*
– Equivalent to Z-scoring (standardizing) all
independent variables before doing the regression
• Formula of coeficient for Xj:
 sX j
b  
 sY
*
j
• Result: The unit is standard deviations
• Betas: Indicates the effect a 1 standard
deviation change in Xj on Y

b j


Standardized Regression Coefficients
• Ex: Education, family income, and job prestige:
Coefficientsa
Model
1
(Constant)
HIGHEST YEAR OF
SCHOOL COMPLETED
RS FAMILY INCOME
WHEN 16 YRS OLD
Unstandardized
Coefficients
B
Std. Error
8.977
1.629
Standardi
zed
Coefficien
ts
Beta
t
5.512
Sig .
.000
2.487
.111
.520
22.403
.000
.178
.394
.011
.453
.651
a. Dependent Variable: RS OCCUPATIONAL PRESTIGE SCORE (1970)
An increase of 1 standard
deviation in Education results
in a .52 standard deviation
increase in job prestige
What is the interpretation of
the “family income” beta?
Betas give you a sense of
which variables “matter most”
R-Square in Multiple Regression
• Multivariate R-square is much like bivariate:
SSREGRESSION
R 
SSTOTAL
2
• But, SSregression is based on the multivariate
regression
• The addition of new variables results in better
prediction of Y, less error (e), higher R-square.
R-Square in Multiple Regression
• Example:
Model
1
Model Summary
R
.522a
R Sq uare
.272
Adjusted
R Sq uare
.271
Std. Error of
the Estimate
12.41
a. Predictors: (Constant), INCOM16, EDUC
• R-square of .272 indicates that education, parents
wealth explain 27% of variance in job prestige
• “Adjusted R-square” is a more conservative,
more accurate measure in multiple regression
– Generally, you should report Adjusted R-square.
Dummy Variables
• Question: How can we incorporate nominal
variables (e.g., race, gender) into regression?
• Option 1: Analyze each sub-group separately
– Generates different slope, constant for each group
• Option 2: Dummy variables
– “Dummy” = a dichotomous variables coded to
indicate the presence or absence of something
– Absence coded as zero, presence coded as 1.
Dummy Variables
• Strategy: Create a separate dummy variable for
all nominal categories
• Ex: Gender – make female & male variables
– DFEMALE: coded as 1 for all women, zero for men
– DMALE: coded as 1 for all men
• Next: Include all but one dummy variables into
a multiple regression model
• If two dummies, include 1; If 5 dummies, include 4.
Dummy Variables
• Question: Why can’t you include DFEMALE
and DMALE in the same regression model?
• Answer: They are perfectly correlated
(negatively): r = -1
– Result: Regression model “blows up”
• For any set of nominal categories, a full set of
dummies contains redundant information
– DMALE and DFEMALE contain same information
– Dropping one removes redundant information.
Dummy Variables: Interpretation
• Consider the following regression equation:
Yi  a  b1INCOMEi  b2 DFEMALEi  ei
• Question: What if the case is a male?
• Answer: DFEMALE is 0, so the entire term
becomes zero.
– Result: Males are modeled using the familiar
regression model: a + b1X + e.
Dummy Variables: Interpretation
• Consider the following regression equation:
Yi  a  b1INCOMEi  b2 DFEMALEi  ei
• Question: What if the case is a female?
• Answer: DFEMALE is 1, so b2(1) stays in the
equation (and is added to the constant)
– Result: Females are modeled using a different
regression line: (a+b2) + b1X + e
– Thus, the coefficient of b2 reflects difference in
the constant for women.
Dummy Variables: Interpretation
• Remember, a different constant generates a
different line, either higher or lower
– Variable: DFEMALE (women = 1, men = 0)
– A positive coefficient (b) indicates that women are
consistently higher compared to men (on dep. var.)
– A negative coefficient indicated women are lower
• Example: If DFEMALE coeff = 1.2:
– “Women are on average 1.2 points higher than men”.
Dummy Variables: Interpretation
• Visually: Women = blue, Men = red
Overall slope for
all data points
10
9
8
Note: Line for men,
women have same
slope… but one is
high other is lower.
The constant differs!
7
6
5
4
3
HAPPY
2
1
0
0
INCOME
20000
40000
60000
80000
If women=1, men=0:
The constant (a) reflects
men only. Dummy
100000
coefficient (b) reflects
increase for women
(relative to men)
Dummy Variables
• What if you want to compare more than 2 groups?
• Example: Race
– Coded 1=white, 2=black, 3=other (like GSS)
• Make 3 dummy variables:
– “DWHITE” is 1 for whites, 0 for everyone else
– “DBLACK” is 1 for Af. Am., 0 for everyone else
– “DOTHER” is 1 for “others”, 0 for everyone else
• Then, include two of the three variables in the
multiple regression model.
Dummy Variables: Interpretation
• Ex: Job Prestige
Model
1
(Constant)
EDUC
INCOM16
DBLACK
DOTHER
Coefficientsa
Unstandardized
Coefficients
B
Std. Error
9.666
1.672
2.476
.111
6.282E-02
.397
-2.666
1.117
1.114
1.777
Standardi
zed
Coefficien
ts
Beta
.517
.004
-.055
.014
t
5.780
22.271
.158
-2.388
.627
Sig .
.000
.000
.874
.017
.531
a. Dependent Variable: PRESTIGE
• Negative coefficient for DBLACK indicates a
lower level of job prestige compared to whites
– T- and P-values indicate if difference is significant.
Dummy Variables: Interpretation
• Comments:
• 1. Dummy coefficients shouldn’t be called slopes
– Referring to the “slope” of gender doesn’t make sense
– Rather, it is the difference in the constant (or “level”)
• 2. The contrast is always with the nominal
category that was left out of the equation
– If DFEMALE is included, the contrast is with males
– If DBLACK, DOTHER are included, coefficients
reflect difference in constant compared to whites.
Interaction Terms
• Question: What if you suspect that a variable has
a totally different slope for two different subgroups in your data?
• Example: Income and Happiness
– Perhaps men are more materialistic -- an extra dollar
increases their happiness a lot
– If women are less materialistic, each dollar has a
smaller effect on income (compared to men)
• Issue isn’t men = “more” or “less” than women
– Rather, the slope of a variable (income) differs across
groups
Interaction Terms
• Issue isn’t men = “more” or “less” than women
– Rather, the slope of a variable coefficient (for income)
differs across groups
• Again, we want to specify a different regression
line for each group
– We want lines with different slopes, not parallel lines
that are higher or lower.
Interaction Terms
• Visually: Women = blue, Men = red
Overall slope for
all data points
10
9
8
Note: Here, the slope
for men and women
differs.
7
6
5
The effect of income on
happiness (X1 on Y)
varies with gender (X2).
This is called an
“interaction effect”
4
3
HAPPY
2
1
0
0
INCOME
20000
40000
60000
80000
100000
Interaction Terms
• Examples of interaction:
– Effect of education on income may interact with type
of school attended (public vs. private)
• Private schooling has bigger effect on income
– Effect of aspirations on educational attainment
interacts with poverty
• Aspirations matter less if you don’t have money to pay for
college
• Question: Can you think of examples of two
variables that might interact?
• Either from your final project? Or anything else?
Interaction Terms
• Interaction effects: Differences in the
relationship (slope) between two variables for
each category of a third variable
• Option #1: Analyze each group separately
• Look for different sized slope in each group
• Option #2: Multiply the two variables of interest:
(DFEMALE, INCOME) to create a new variable
– Called: DFEMALE*INCOME
– Add that variable to the multiple regression model.
Interaction Terms
• Consider the following regression equation:
Yi  a  b1INCOMEi  b2 DFEM * INCi  ei
• Question: What if the case is male?
• Answer: DFEMALE is 0, so b2(DFEM*INC)
drops out of the equation
– Result: Males are modeled using the ordinary
regression equation: a + b1X + e.
Interaction Terms
• Consider the following regression equation:
Yi  a  b1INCOMEi  b2 DFEM * INCi  ei
• Question: What if the case is female?
• Answer: DFEMALE is 1, so b2(DFEM*INC)
becomes b2*INCOME, which is added to b1
– Result: Females are modeled using a different
regression line: a + (b1+b2) X + e
– Thus, the coefficient of b2 reflects difference in
the slope of INCOME for women.
Interpreting Interaction Terms
• Interpreting interaction terms:
• A positive b for DFEMALE*INCOME indicates
the slope for income is higher for women vs. men
– A negative effect indicates the slope is lower
– Size of coefficient indicates actual difference in slope
• Example: DFEMALE*INCOME. Observed b’s:
– Income: b = .5
– DFEMALE * INCOME: b = -.2
• Interpretation: Slope is .5 for men, .3 for women.
Interpreting Interaction Terms
• Example: Interaction of Race and Education
affecting Job Prestige:
Coefficientsa
Model
1
(Constant)
EDUC
INCOM16
DBLACK
BL_EDUC
Unstandardized
Coefficients
B
Std. Error
8.855
1.744
2.541
.118
6.636E-02
.396
4.293
4.193
-.576
.332
Standardi
zed
Coefficien
ts
Beta
.531
.004
.088
-.149
t
5.076
21.563
.167
1.024
-1.735
Sig .
.000
.000
.867
.306
.083
a. Dependent Variable: PRESTIGE
DBLACK*EDUC has a negative effect (nearly significant).
Coefficient of -.576 indicates that the slope of education and job
prestige is .576 points lower for Blacks than for non-blacks.
Continuous Interaction Terms
• Two continuous variables can also interact
• Example: Effect of education and income on
happiness
– Perhaps highly educated people are less materialistic
– As education increases, the slope between between
income and happiness would decrease
• Simply multiply Education and Income to create
the interaction term “EDUCATION*INCOME”
• And add it to the model.
Interpreting Interaction Terms
• How do you interpret continuous variable
interactions?
• Example: EDUCATION*INCOME: Coefficient = 2.0
• Answer: For each unit change in education, the
slope of income vs. happiness increases by 2
– Note: coefficient is symmetrical: For each unit
change in income, education slope increases by 2
• Dummy interactions effectively estimate 2 slopes:
one for each group
• Continuous interactions result in many slopes: Each value
of education*income yields a different slope.
Interpreting Interaction Terms
• Interaction terms alters the interpretation of
“main effect” coefficients
• Including “EDUC*INCOME changes the interpretation of
EDUC and of INCOME
• See Allison p. 166-9
– Specifically, coefficient for EDUC represents slope of
EDUC when INCOME = 0
• Likewise, INCOME shows slope when EDUC=0
– Thus, main effects are like “baseline” slopes
• And, the interaction effect coefficient shows how the slope
grows (or shrinks) for a given unit change.
Dummy Interactions
• It is also possible to construct interaction terms
based on two dummy variables
– Instead of a “slope” interaction, dummy interactions
show difference in constants
• Constant (not slope) differs across values of a third variable
– Example: Effect of of race on school success varies
by gender
• African Americans do less well in school; but the difference
is much larger for black males.
Dummy Interactions
• Strategy for dummy interaction is the same:
Multiply both variables
– Example: Multiply DBLACK, DMALE to create
DBLACK*DMALE
• Then, include all 3 variables in the model
– Effect of DBLACK*DMALE reflects difference in
constant (level) for black males, compared to white
males and black females
• You would observe a negative coefficient, indicating that
black males fare worse in schools than black females or
white males.
Interaction Terms: Remarks
• 1. If you make an interaction you should also
include the component variables in the model:
– A model with “DFEMALE * INCOME” should also
include DFEMALE and INCOME
• There are rare exceptions. But when in doubt, include them
• 2. Sometimes interaction terms are highly
correlated with its components
• That can cause problems (multicollinearity – which we’ll
discuss more soon)
Interaction Terms: Remarks
• 3. Make sure you have enough cases in each
group for your interaction terms
– Interaction terms involve estimating slopes for subgroups (e.g., black females vs black males).
• If you there are hardly any black females in the dataset, you
can have problems
• 4. “Three-way” interactions are also possible!
• An interaction effect that varies across categories of yet
another variable
– Ex: DMale*DBlack interaction may vary across class
• They are mainly used in experimental research settings with
large sample sizes… but they are possible.