multicollinearity

Download Report

Transcript multicollinearity

Multicollinearity
Overview
• This set of slides
– What is multicollinearity?
– What is its effects on regression analyses?
• Next set of slides
– How to detect multicollinearity?
– How to reduce some types of multicollinearity?
Multicollinearity
• Multicollinearity exists when two or more
of the predictors in a regression model are
moderately or highly correlated.
• Regression analyses most often take place
on data obtained from observational studies.
• In observational studies, multicollinearity
happens more often than not.
Types of multicollinearity
• Structural multicollinearity
– a mathematical artifact caused by creating new
predictors from other predictors, such as,
creating the predictor x2 from the predictor x.
• Sample-based multicollinearity
– a result of a poorly designed experiment,
reliance on purely observational data, or the
inability to manipulate the system on which you
collect the data
Example
120
BP
110
53.25
Age
47.75
97.325
Weight
89.375
2.125
BSA
1.875
8.275
Duration
4.425
72.5
Pulse
65.5
76.25
Stress
30.75
0
11
0
12
. 75 3.25
47
5
5
5
.37 7. 32
89
9
75
25
1. 8 2. 1
25 .275
4. 4
8
.5
65
.5
72
.75 6. 25
30
7
n = 20 hypertensive individuals
p-1 = 6 predictor variables
Example
Age
Weight
BSA
Duration
Pulse
Stress
BP
0.659
0.950
0.866
0.293
0.721
0.164
Age
Weight
BSA
0.407
0.378
0.344
0.619
0.368
0.875
0.201
0.659
0.034
0.131
0.465
0.018
Duration
0.402
0.312
Blood pressure (BP) is the response.
Pulse
0.506
What is effect on regression analyses if
predictors are perfectly uncorrelated?
Example
x1
2
2
2
2
4
4
4
4
x2
5
5
7
7
5
5
7
7
y
52
51
49
46
50
48
44
43
Pearson correlation of
x1 and x2 = 0.000
Example
2.5
3.5
5.5
6. 5
49.75
y
45.25
3.5
x1
2.5
x2
Regress y on x1
The regression equation is y = 48.8 - 0.63 x1
Predictor
Constant
x1
Coef
48.750
-0.625
Analysis of Variance
Source
DF
SS
Regression
1
3.13
Error
6
77.75
Total
7
80.88
SE Coef
4.025
1.273
MS
3.13
12.96
T
12.11
-0.49
F
0.24
P
0.000
0.641
P
0.641
Regress y on x2
The regression equation is y = 55.1 - 1.38 x2
Predictor
Constant
x2
Analysis of
Source
Regression
Error
Total
Coef
55.125
-1.375
SE Coef
7.119
1.170
Variance
DF
SS
1
15.13
6
65.75
7
80.88
MS
15.13
10.96
T
7.74
-1.17
F
1.38
P
0.000
0.285
P
0.285
Regress y on x1 and x2
The regression equation is y = 57.0 - 0.63 x1 - 1.38 x2
Predictor
Constant
x1
x2
Coef
57.000
-0.625
-1.375
SE Coef
8.486
1.251
1.251
Analysis of Variance
Source
DF
SS
Regression 2
18.25
Error
5
62.63
Total
7
80.88
Source
x1
x2
DF
1
1
Seq SS
3.13
15.13
MS
9.13
12.53
T
6.72
-0.50
-1.10
F
0.73
P
0.001
0.639
0.322
P
0.528
Regress y on x2 and x1
The regression equation is y = 57.0 - 1.38 x2 - 0.63 x1
Predictor
Constant
x2
x1
Coef
57.000
-1.375
-0.625
SE Coef
8.486
1.251
1.251
Analysis of Variance
Source
DF
SS
Regression
2
18.25
Error
5
62.63
Total
7
80.88
Source
x2
x1
DF
1
1
Seq SS
15.13
3.13
MS
9.13
12.53
T
6.72
-1.10
-0.50
F
0.73
P
0.001
0.322
0.639
P
0.528
Summary of results
Model
b1
se(b1)
b2
se(b2)
Seq SS
x1 only
-0.625
1.273
---
---
SSR(X1)
x2 only
---
---
-1.375
1.170
SSR(X2)
-0.625
1.251
-1.375
1.251
SSR(X2|X1)
-0.625
1.251
-1.375
1.251
SSR(X1|X2)
x1, x2
(in order)
x2, x1
(in order)
3.13
15.13
15.13
3.13
If predictors are perfectly
uncorrelated, then…
• You get the same slope estimates regardless
of the first-order regression model used.
• That is, the effect on the response ascribed
to a predictor doesn’t depend on the other
predictors in the model.
If predictors are perfectly
uncorrelated, then…
• The sum of squares SSR(X1) is the same as
the sequential sum of squares SSR(X1|X2).
• The sum of squares SSR(X2) is the same as
the sequential sum of squares SSR(X2|X1).
• That is, the marginal contribution of one
predictor variable in reducing the error sum
of squares doesn’t depend on the other
predictors in the model.
Do we see similar results for “real data”
with nearly uncorrelated predictors?
Example
Age
Weight
BSA
Duration
Pulse
Stress
BP
0.659
0.950
0.866
0.293
0.721
0.164
Age
Weight
BSA
0.407
0.378
0.344
0.619
0.368
0.875
0.201
0.659
0.034
0.131
0.465
0.018
Duration
0.402
0.312
Pulse
0.506
Example
75 .125
8
.
1
2
.75 6.25
0
3
7
120
BP
110
2.125
BSA
1.875
Stress
Regress BP on x1 = Stress
The regression equation is
BP = 113 + 0.0240 Stress
Predictor
Constant
Stress
S = 5.502
Analysis of
Source
Regression
Error
Total
Coef
112.720
0.02399
SE Coef
2.193
0.03404
R-Sq = 2.7%
Variance
DF
SS
1
15.04
18
544.96
19
560.00
T
51.39
0.70
P
0.000
0.490
R-Sq(adj) = 0.0%
MS
15.04
30.28
F
0.50
P
0.490
Regress BP on x2 = BSA
The regression equation is
BP = 45.2 + 34.4 BSA
Predictor
Constant
BSA
S = 2.790
Coef
45.183
34.443
SE Coef
9.392
4.690
R-Sq = 75.0%
Analysis of Variance
Source
DF
SS
Regression
1
419.86
Error
18
140.14
Total
19
560.00
T
4.81
7.34
P
0.000
0.000
R-Sq(adj) = 73.6%
MS
419.86
7.79
F
53.93
P
0.000
Regress BP on
x1 = Stress and x2 = BSA
The regression equation is
BP = 44.2 + 0.0217 Stress + 34.3 BSA
Predictor
Constant
Stress
BSA
Analysis of
Source
Regression
Error
Total
Source
Stress
BSA
Coef
44.245
0.02166
34.334
SE Coef
9.261
0.01697
4.611
Variance
DF
SS
2
432.12
17
127.88
19
560.00
DF
1
1
Seq SS
15.04
417.07
MS
216.06
7.52
T
4.78
1.28
7.45
F
28.72
P
0.000
0.219
0.000
P
0.000
Regress BP on
x2 = BSA and x1 = Stress
The regression equation is
BP = 44.2 + 34.3 BSA + 0.0217 Stress
Predictor
Constant
BSA
Stress
Analysis of
Source
Regression
Error
Total
Source
BSA
Stress
Coef
44.245
34.334
0.02166
SE Coef
9.261
4.611
0.01697
Variance
DF
SS
2
432.12
17
127.88
19
560.00
DF
1
1
Seq SS
419.86
12.26
MS
216.06
7.52
T
4.78
7.45
1.28
P
0.000
0.000
0.219
F
28.72
P
0.000
Summary of results
Model
x1 only
x2 only
x1, x2
b1
se(b1)
b2
se(b2)
Seq SS
0.0240 0.0340
---
---
SSR(X1)
34.443
4.690
SSR(X2)
0.0217 0.0170 34.334
4.611
SSR(X2|X1)
0.0217 0.0170 34.334
4.611
SSR(X1|X2)
---
---
(in order)
x2, x1
(in order)
15.04
419.86
417.07
12.26
If predictors are nearly
uncorrelated, then…
• You get similar slope estimates regardless
of the first-order regression model used.
• The sum of squares SSR(X1) is similar to
the sequential sum of squares SSR(X1|X2).
• The sum of squares SSR(X2) is similar to
the sequential sum of squares SSR(X2|X1).
What happens if the predictor variables
are highly correlated?
Example
Age
Weight
BSA
Duration
Pulse
Stress
BP
0.659
0.950
0.866
0.293
0.721
0.164
Age
Weight
BSA
0.407
0.378
0.344
0.619
0.368
0.875
0.201
0.659
0.034
0.131
0.465
0.018
Duration
0.402
0.312
Pulse
0.506
Example
75 .125
8
.
1
2
75 .325
3
.
89
97
120
BP
110
2.125
BSA
1.875
Weight
Regress BP on x1 = Weight
The regression equation is
BP = 2.21 + 1.20 Weight
Predictor
Constant
Weight
S = 1.740
Coef
2.205
1.20093
SE Coef
8.663
0.09297
R-Sq = 90.3%
Analysis of Variance
Source
DF
SS
Regression
1
505.47
Error
18
54.53
Total
19
560.00
T
0.25
12.92
P
0.802
0.000
R-Sq(adj) = 89.7%
MS
505.47
3.03
F
166.86
P
0.000
Regress BP on x2 = BSA
The regression equation is
BP = 45.2 + 34.4 BSA
Predictor
Constant
BSA
S = 2.790
Analysis of
Source
Regression
Error
Total
Coef
45.183
34.443
SE Coef
9.392
4.690
R-Sq = 75.0%
Variance
DF
SS
1
419.86
18
140.14
19
560.00
T
4.81
7.34
P
0.000
0.000
R-Sq(adj) = 73.6%
MS
419.86
7.79
F
53.93
P
0.000
Regress BP on
x1 = Weight and x2 = BSA
The regression equation is
BP = 5.65 + 1.04 Weight + 5.83 BSA
Predictor
Constant
Weight
BSA
Analysis of
Source
Regression
Error
Total
Source
Weight
BSA
Coef
5.653
1.0387
5.831
SE Coef
9.392
0.1927
6.063
Variance
DF
SS
2
508.29
17
51.71
19
560.00
DF
1
1
Seq SS
505.47
2.81
MS
254.14
3.04
T
0.60
5.39
0.96
F
83.54
P
0.555
0.000
0.350
P
0.000
Regress BP on
x2 = BSA and x1 = Weight
The regression equation is
BP = 5.65 + 5.83 BSA + 1.04 Weight
Predictor
Constant
BSA
Weight
Analysis of
Source
Regression
Error
Total
Source
BSA
Weight
Coef
5.653
5.831
1.0387
SE Coef
9.392
6.063
0.1927
Variance
DF
SS
2
508.29
17
51.71
19
560.00
DF
1
1
Seq SS
419.86
88.43
MS
254.14
3.04
T
0.60
0.96
5.39
F
83.54
P
0.555
0.350
0.000
P
0.000
Effect #1 of multicollinearity
When predictor variables are correlated, the regression
coefficient of any one variable depends on which other
predictor variables are included in the model.
Variables
in model
x1
b1
b2
1.20
----
x2
----
34.4
x1, x2
1.04
5.83
Even correlated predictors not in
the model can have an impact!
• Regression of territory sales on territory
population, per capita income, etc.
• Against expectation, coefficient of territory
population was determined to be negative.
• Competitor’s market penetration, which was
strongly positively correlated with territory
population, was not included in model.
• But, competitor kept sales down in territories with
large populations.
Effect #2 of multicollinearity
When predictor variables are correlated, the precision of
the estimated regression coefficients decreases as more
predictor variables are added to the model.
Variables
in model
x1
se(b1)
se(b2)
0.093
----
x2
----
4.69
x1, x2
0.193
6.06
Why not effects #1 and #2?
3D Scatterplot of y vs x1 vs x2
51
y
48
45
4
42
3
5
x2
6
7
2
x1
Why not effects #1 and #2?
3D Scatterplot of BP vs BSA vs Stress
128
120
BP
112
104
0
50
Stress
100
1.80
1.95
2.10
2.25
BSA
Why effects #1 and #2?
3D Scatterplot of BP vs BSA vs Weight
128
120
BP
112
104
85
90
Weight
95
100
1.80
1.95
2.10
2.25
BSA
Effect #3 of multicollinearity
When predictor variables are correlated, the marginal
contribution of any one predictor variable in reducing
the error sum of squares varies, depending on which
other variables are already in model.
SSR(X1) = 505.47
SSR(X2) = 419.86
SSR(X1|X2) = 88.43
SSR(X2|X1) = 2.81
What is the effect on estimating
mean or predicting new response?
(2,92)
Weight
100
95
90
85
1.75
1.85
1.95
2.05
BSA
2.15
2.25
Effect #4 of multicollinearity on
estimating mean or predicting Y
Weight
92
BSA
2
Fit
112.7
Fit
114.1
BSA Weight
2
92
SE Fit
0.402
SE Fit
0.624
95.0% CI
(111.85,113.54)
95.0% CI
(112.76,115.38)
Fit
SE Fit
95.0% CI
112.8
0.448
(111.93,113.83)
95.0% PI
(108.94,116.44)
95.0% PI
(108.06,120.08)
95.0% PI
(109.08, 116.68)
High multicollinearity among predictor variables does
not prevent good, precise predictions of the response
(within scope of model).
What is effect on tests
of individual slopes?
The regression equation is
BP = 45.2 + 34.4 BSA
Predictor
Constant
BSA
S = 2.790
Analysis of
Source
Regression
Error
Total
Coef
45.183
34.443
SE Coef
9.392
4.690
R-Sq = 75.0%
Variance
DF
SS
1
419.86
18
140.14
19
560.00
T
4.81
7.34
P
0.000
0.000
R-Sq(adj) = 73.6%
MS
419.86
7.79
F
53.93
P
0.000
What is effect on tests
of individual slopes?
The regression equation is
BP = 2.21 + 1.20 Weight
Predictor
Constant
Weight
S = 1.740
Coef
2.205
1.20093
SE Coef
8.663
0.09297
R-Sq = 90.3%
Analysis of Variance
Source
DF
SS
Regression
1
505.47
Error
18
54.53
Total
19
560.00
T
0.25
12.92
P
0.802
0.000
R-Sq(adj) = 89.7%
MS
505.47
3.03
F
166.86
P
0.000
What is effect on tests
of individual slopes?
The regression equation is
BP = 5.65 + 1.04 Weight + 5.83 BSA
Predictor
Constant
Weight
BSA
Coef
5.653
1.0387
5.831
Analysis of
Source
Regression
Error
Total
Variance
DF
SS
2 508.29
17
51.71
19 560.00
Source
Weight
BSA
DF
1
1
SE Coef
9.392
0.1927
6.063
MS
254.14
3.04
Seq SS
505.47
2.81
T
0.60
5.39
0.96
F
83.54
P
0.555
0.000
0.350
P
0.000
Effect #5 of multicollinearity on
slope tests
When predictor variables are correlated, hypothesis tests
for βk = 0 may yield different conclusions depending on
which predictor variables are in the model.
Variables
in model
x2
b2
se(b2)
t
P-value
34.4
4.7
7.34
0.000
x1, x2
5.83
6.1
0.96
0.350
The major impacts on model use
• In the presence of multicollinearity, it is okay
to use an estimated regression model to
predict y or estimate μY in scope of model.
• In the presence of multicollinearity, we can
no longer interpret a slope coefficient as …
– the change in the mean response for each
additional unit increase in xk, when all the other
predictors are held constant