Multiple Regression

Download Report

Transcript Multiple Regression

Multiple Regression

Multiple regression
 Typically, we want to use more than a single predictor
(independent variable) to make predictions
 Regression with more than one predictor is called “multiple
regression”
y i    1x1i  2 x 2i 
  p x pi  i
Motivating example: Sex discrimination in
wages
 In 1970’s, Harris Trust and Savings Bank was sued for
discrimination on the basis of sex.
 Analysis of salaries of employees of one type (skilled, entrylevel clerical) presented as evidence by the defense.
 Did female employees tend to receive lower starting salaries
than similarly qualified and experienced male employees?
Variables collected
 93 employees on data file (61 female, 32 male).







bsal: Annual salary at time of hire.
sal77 : Annual salary in 1977.
educ: years of education.
exper: months previous work prior to hire at bank.
fsex: 1 if female, 0 if male
senior: months worked at bank since hired
age: months
 So we have six x’s and and one y (bsal). However, in what follows
we won’t use sal77.
Comparison for male and females
 This shows men started at
Oneway Ana ly sis of bsal By fsex
higher salaries than women
(t=6.3, p<.0001).
8000
 But, it doesn’t control for
other characteristics.
bsal
7000
6000
5000
4000
Female
Male
fsex
Relationships of bsal with other variables
 Senior and education predict bsal well. We want to
control for them when judging gender effect.
0
0
0
6
0
0
0
5
0
0
4
0
0
L
e
rt
0
0
7
0
0
0
6
0
0
0
0
5
0
0
0
4
0
0 3
s
e
a
r
n
0
io
F
in
i b vFs ai a rt l i
0
0
0
7
0
0
0
6
0
0
0
0
5
0
0
0
0
4
0
0
04
05
006
a
L
t f Be
8
r
it
ao
a
0
i
e
a
007
g
r
08
00
0
0
0
7
0
0
0
6
0
0
0
0
5
0
0
0
0
4
0
0
0
07 8 9 1 1 0 1 1 1 2 1 3 1 4 1 5 1 6 7
e
it
i bs vFse ai an rt l i i o aoBr t f y e
8
e
F
aoB t f y Be
l
ai
8
60 75 70 85 80 95 90 1 5
in
vF
p
s
6
i
u
a
7
Be
o
s
0
t
r
b
0
a
G
a
0
i
X
s
a
r
y
l
a
b
b
v
s
l
i
Y
8
b
B
t
l
i
b
F
L
in
e
a
d
r
u
-
0 5 5 10 0 1 0 2 5 20 0 30 5 30 0 40 5 0 0
c
F
e
it
L
in
e
a
x
r
p
e
F
r
it
Multiple regression model
 For any combination of values of the predictor variables, the
average value of the response (bsal) lies on a straight line:
bsal i    1fsex i   2senior i   3age i   4 educ i   5exper i  i
 Just like in simple regression, assume that ε follows a normal
curve within any combination of predictors.
Output from regression
(fsex = 1 for females, = 0 for males)
Response bsal
Whole Model
Fsex
-767.9
128.9
age
Actual by Predicted Plot
8000
bsal Actual
Term Estimate Std Error t Ratio Prob>|t|
Int.
6277.9
652
9.62 <.0001
7000
6000
5000
-5.95 <.0001
4000
4000
5000
6000
7000
8000
bsal Predicted P<.0001 RSq=0.52
RMSE=508.09
Senior
-22.6
5.3
-4.26 <.0001
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.515156
0.487291
508.0906
5420.323
93
Analysis of Variance
Age
0.63
.72
.88
.3837
Source
DF
Sum of Squares
Mean Square
F Ratio
Model
Error
C. Total
5
87
92
23863715
22459575
46323290
4772743
258156
18. 4878
Prob > F
<. 0001
Parameter Estimates
Term
Educ
92.3
24.8
3.71
.0004
Intercept
fsex
senior
age
educ
exper
Estimate
Std Error
t Ratio
6277.8934
-767.9127
-22.5823
0.6309603
92.306023
0.5006397
652.2713
128. 97
5.295732
0.720654
24.86354
1.055262
9.62
-5. 95
-4. 26
0.88
3.71
0.47
Prob>|t|
<.0001
<.0001
<.0001
0.3837
0.0004
0.6364
Effect T ests
0.50
1.05
.47
.6364
fsex
senior
age
educ
exper
Nparm
DF
Sum of Squares
F Ratio
Prob > F
1
1
1
1
1
1
1
1
1
1
9152264.3
4694256.3
197894. 0
3558085.8
58104.8
35. 4525
18. 1838
0.7666
13. 7827
0.2251
<. 0001
<. 0001
0.3837
0.0004
0.6364
Residual by Predicted Plot
1500
bsal Residual
Exper
Source
1000
500
0
-500
-1000
4000
5000
6000
7000
bsal Predicted
8000
educ
exper
Predictions
 Example: Prediction of beginning wages for a woman with
10 months seniority, that is 25 years old, with 12 years of
education, and two years of experience:
bsal i    1fsex i   2senior i   3age i   4 educ i   5exper i  i
 Pred. bsal = 6277.9 - 767.9*1 - 22.6*10
+ .63*300 + 92.3*12 + .50*24
= 6592.6
Interpretation of coefficients in multiple
regression
 Each estimated coefficient is amount Y is expected to increase when the
value of its corresponding predictor is increased by one, holding
constant the values of the other predictors.
 Example: estimated coefficient of education equals 92.3.
For each additional year of education of employee, we expect salary to
increase by about 92 dollars, holding all other variables constant.
 Estimated coefficient of fsex equals -767.
For employees who started at the same time, had the same education
and experience, and were the same age, women earned $767 less on
average than men.
Which variable is the strongest predictor of
the outcome?
 The coefficient that has the strongest linear association with
the outcome variable is the one with the largest absolute
value of T, which equals the coefficient over its SE.
 It is not size of coefficient. This is sensitive to scales of
predictors. The T statistic is not, since it is a standardized
measure.
 Example: In wages regression, seniority is a better predictor
than education because it has a larger T.
Hypothesis tests for coefficients
 The reported t-stats (coef. / SE) and p-values are used to test whether a
particular coefficient equals 0, given that all other coefficients are in the
model.
 Examples:
1) Test whether coefficient of education equals zero has p-value = .0004.
Hence, reject the null hypothesis; it appears that education is a useful
predictor of bsal when all the other predictors are in the model.
 2) Test whether coefficient of experience equals zero has p-value =
.6364. Hence, we cannot reject the null hypothesis; it appears that
experience is not a particularly useful predictor of bsal when all other
predictors are in the model.
Hypothesis tests for coefficients
 The test statistics have the usual form
 (observed – expected)/SE.
 For p-value, use area under a t-curve with
(n-k) degrees of freedom, where k is the number of terms in
the model.
 In this problem, the degrees of freedom equal (93-6=87).
CIs for regression coefficients
 A 95% CI for the coefficients is obtained in the usual way:
coef. ± (multiplier) SE
 The multiplier is obtained from the t-curve with (n-k) degrees of
freedom. (If degrees of freedom is greater than 26 use normal
table)
 Example: A 95% CI for the population regression coefficient of
age equals:
(0.63 – 1.96*0.72, 0.63 + 1.96*0.72)
Warning about tests and CIs
 Hypothesis tests and CIs are meaningful only when the data fits
the model well.
 Remember, when the sample size is large enough, you will
probably reject any null hypothesis of β=0.
 When the sample size is small, you may not have enough evidence
to reject a null hypothesis of β=0.
 When you fail to reject a null hypothesis, don’t be too hasty to say
that a predictor has no linear association with the outcome. It is
likely that there is some association, it just isn’t a very strong one.
Checking assumptions
 Plot the residuals versus the predicted values from the
regression line.
 Also plot the residuals versus each of the predictors.
 If non-random patterns in these plots, the assumptions might
be violated.
Plot of residuals versus predicted values
variance (heteroscedastic).
 We need to transform
variables.
W hol e M ode l
Re si dua l by Pr e di ct e d Pl ot
5 0 00
R e s id u a l
 It suggests non-constant
Re spons e s a l 77
s a l7 7
 This plot has a fan shape.
4 0 00
3 0 00
2 0 00
1 0 00
0
- 1 00 0
- 2 00 0
- 3 00 0
7 0 0 0 9 0 0 0 1 1 0 0 01 3 0 0 01 5 0 0107 0 0 0
s a l7 7 P r e d ic t e d
0
1
0
0
0
l
s
b
o
a
u
p
r o i f
1
5
0
0
1
0
0
0
a
t R e e
sB F ii i d
v t u
a
5
0
60
2
b s a l
t
-
0
0
75
Y
5
-
70
b y
v a r i
0
0
0
5
85
X
a t e
80
e
95
90
15
n
io
r
G
F i
0
0
1
0
3
o f
0
04
0
-
005
006
g
0 07
00 8
00
0
1
0
7
0
0
8
0
9
1
0
10 11 12 13 1 4 15 16
e
e
1 0 0 0
Boy f
d
u
e Rx ep s e i r d u a l
0
5 0 1 0 10 5 2 0 0 20 5 30 0 3 0 5 40 0 0
ab t s ea l F i 2 t
5 0 0
-
p e r
5
1 0 0 0
5 0 0
e x
0
1 5 0 0
R e s id u a l
5 0 0
5 0
0
R e s Bi i d vu a a r l i
1 0 0 0
-
0
-
a
r o u p
t
0
0
1 5 0 0
R e s id u a l
1
R
0
s
-
0
b
t R se e a s
c
2
0
0
6
-
0
f a
0
b s a l
1
B i
5
0
R
5
-
1
0
R
0
-
a
r o li
e
0
e
0
e
5
F i
r
v t
a
G
B F i i
u
e
id
0
X
t
s
5
y
a
s
1
i
l
b
l
b
r
s
id
a
a
Y
v
u
t
i
id
i
B
u
F
a
l
s
a
l
a
s
a
l
b
Plots of residuals vs. predictors
5 0 0
1 0 0 0
-
0 0 . . 1 1.
2.
3.
4.
f
5.
s
6.
e x
7.
8.
91 1 .
1
b s a
7
Summary of residual plots
 There appears to be a non-random pattern in the plot of
residuals versus experience, and also versus age.
 This model can be improved.
Modeling categorical predictors
 When predictors are categorical and assigned numbers,
regressions using those numbers make no sense.
 Instead, we make “dummy variables” to stand in for the
categorical variables.
Collinearity
 When predictors are highly correlated, standard errors are
inflated
 Conceptual example:
 Suppose two variables Z and X are exactly the same.
 Suppose the population regression line of Y on X is
Y = 10 + 5X
 Fit a regression using sample data of Y on both X and Z. We
could plug in any value for the coefficients of X and Z, so long
as they add up to 5. Equivalently this means that the standard
errors for the coefficients are huge
General warnings for multiple
regression
 Be even more wary of extrapolation. Because there are
several predictors, you can extrapolate in many ways
 Multiple regression shows association. It does not prove
causality. Only a carefully designed observational study or
randomized experiment can show causality