Transcript Document
Multicollinearity
Rudra P. Pradhan
Nature of Multicollinearity Problem
Many finance/ economic variables, especially time series
variables are closely related with each other.
Example; Population and GDP are closely related, i.e highly
correlated.
In multiple regression models, a regression coefficient measures
the partial effect of that individual variable on Y when all other
X variables in the model are fixed.
However, when two explanatory variables move closely
together, we cannot assume that one is fixed while the other is
changing. Because when one changes, the other one also
changes as they are closely related. In such a case it is difficult
to isolate the partial effect of a single X variable. This is the
problem of Multicollinearity
Examples of Multicollinearity
Example 5.1 Housing Starts (Data4-3)
No. of housing units (in thousands<) depend on
mortgage interest rates (R), population (in millions)
and GDP (in billions 1982 dollars). Three models
were formulated as;
Model A:
Houst 1 2 Rt 3 Popt u1t
Model B:
Houst 1 2 Rt 3Gdpt u2t
Model C:
Houst 1 2 Rt 3 Popt 4Gdpt u2t
Examples of Multicollinearity
Example 5.1 Housing Starts (Data4-3)
Estimates of Housing Starts
Variables
Model A
Model B
Const.
-3812.93
687.90
t
R
Pop
(2.40)
(1.80)
-198.4
-169.66
(-3.87)
(-3.87)
33.82
-
(3.61)
Gdp
Adj.R2:
0.371
.
Model C .
-1315.75
(-0.27)
-184.75
(-3.18)
14.90
(0.41)
0.91
0.52
(3.64)
0.375
(0.54)
0.348
Examples of Multicollinearity
Example 5.1 Housing Starts (Data4-3)
Both Pop and Gdp (level of income) are
expected to influence the number of housing
units started for construction.
Result 1: in Model C which has both Pop
and GDP, the coefficients have very low tvalues and are insignificant. However, when
Pop and GDP enter the model alone as in
Models A and B, they are statistically
significant.
Examples of Multicollinearity
Example 5.1 Housing Starts (Data4-3)
Result 2:
the coefficients for Pop and GDP in Model C are
very different from those in Model A and Model B.
i.e when they are bpresent together, drast,ic changes
occur in the estimates. This is because, these three
variables are highly correlated.
r(GDP,Pop) = 0.99
r(GDP,R) = 0.88
r(Pop,R) = 0.91
Examples of Multicollinearity
Example 5.2 - Car Maintenance (Data3-7)
Et = cumulative expenditure at time t on the maintenance
for a given car,
Milest = cumulative milage in thousands of miles (+)
Aget = age of car in weeks since the original purchase (+)
Model A:
Et 1 2 Aget u1t
Model B:
Et 1 2 Milest u1t
Model C:
Et 1 2 Aget 3Milest u1t
Examples of Multicollinearity
Example 5.2 - Car Maintenance (Data3-7)
Estimates of Car Maintenance Model
.
Variables
Model A
Model B
Model C
Const.
-626.24
-796.07
7.29
t
Age
(-5.98)
(-5.91)
7.35
-
(0.06)
27.58
(22.16)
Miles
.
(9.58)
54.45
-151.15
(18.27)
(-7.08)
________________________________________________________
df
Adj.R2:
55
0.897
55
(3.64)
0.856
54
(0.54)
0.946
Examples of Multicollinearity
Example 5.2 - Car Maintenance (Data3-7)
NOTE: Although the coefficient of Miles is positive as we
expected in Model B, it is significant and negative in Model C.
So, there is a reversal of the sign. The coefficient for Age has also
changed drastically.
The t-statistics for Age and Miles are much lower in Model C. The
reason for significant changes in the estimates in Model C is
because of high correlation between Age and Miles which is
0.996.
So, high correlation between explanatory variables can make
regression coefficients insignificant or reverse their signs.
5.2 Exact Multicollinearity
If two or more independent variables have a linear
relationship between them, we have exact (or perfect)
multicollinearity.
In this case, we cannot obtain the least square
estimates.
Yt 1 2 X t 2 3 X t 3 ut
5.2 Exact Multicollinearity
Let X3 = 2X2 . In this case r23 = 1
Yt 1 2 X t 2 3 (2 X t 2 ) ut
The estimable function is
Yt 1 ( 2 23 ) X t 2 ut
i.e we cannot get estimates of α2 and α3 together. The
assumption of no exact linear relationship among the X
variables of the Classical Regression Model is violated.
5.3 Near Multicollinearity
When explanatory variables are closely related (but
not exactly correlated) we can obtain estimated for
the coefficients. In this case, the questions are;
1) What are the consequences of ignoring
multicollinearity?
2) How do we identify the presence of the
problem?
3) What are the remedies? (solutions to the
problem?)
Consequences of Ignoring Near
Multicollinearity
OLS estimators are still BLUE. So they are unbised
and efficient
However, although BLUE, the OLS estimators have
large variances (standard errors). So, the
confidence intervals are wider leading to acceptance
of “zero null hypothesis” and low t-values indicating
that coefficients are insignificant or less significant.
i.e the estimates are not precise.
Multicollinearity may not affect the forecasting
performance of a model.
Identifying Multicollinearity
In practice, multicollinearity often shows up in a
number of ways;
1) High R2 with Low Values of t-statistics:
As in Example 5.1, it is possible that the F value for a
group of coefficients may be significant while
coefficients are insignificant individually according to
the t-statistics.
2) High Values for Correlation Coefficients:
Usually, correlation between explanatory variables is
high. However, multicollinearity may still be present
even though the correlation between two explanatory
variables is not high. This may be in case of multiple
regression with more than 3 explanatory variables.
Identifying Multicollinearity
3) Regression Coefficients Sensitive to Model
Specification.
Addition or deletion of variables may drastically
change the coefficient estimates. This is also an
indication of multicollinearity problem.
4) Regression Coefficients Sensitive to Deletion or
Addion of Few Observations.
Addition or deletion of few observations may also
drastically change the coefficient estimates.
Identifying Multicollinearity
- Formal Tests of Multicollinearity
Formal Tests of Muticollinearity:
The proposed tests are quite controversial. This is
because, multicollinearity is a problem of data rather
than population.
i.e It is a problem of insufficient information
(data) not a problem of population.
Solutions to Multicollinearity Problem
No single solution exists.
Multiple Regression
Previously introduced how to do multiple
regression
Note difference in interpreting the coefficients
Then looked at omitted variable bias
Now look at problem of adding too many
variables
Multicollinearity
18
How many variables to include?
A response to omitted variable bias might be to
include every possible variable in model
Undesirable because
Including irrelevant variables increases standard
errors of other variables,
distorting confidence intervals and hypothesis tests
Including “too many” variables that measure the same
concept can lead to multicollinearity
19
Problem of high correlation
Intuition: If explanatory variables are highly
correlated with one another, the regression model
has trouble telling which individual variable is
explaining Y
In the extreme of exact linear relationships
amongst the explanatory variables (e.g. X1+X2=1)
the model cannot be estimated
Why?
The matrix X’X cannot be inverted
20
Symptoms of Multicollinearity
Individual coefficients may look statistically
insignificant (low t values), but regression as a
whole is significant (high R2 and significant Fstat)
High correlation amongst the explanatory
variables (may not always be apparent, with more
than 2 explanatory variables, correlated linear
combinations may occur)
Coefficient estimates are “fragile”in the sense
that small changes in the specification of the
model (e.g. including or excluding a seemingly
irrelevant variable) cause big changes in
estimated coefficient values
21
Multicollinearity Example
Y = exchange rate
Explanatory variable(s) = interest rate
X1 = bank prime rate
X2 = Treasury bill rate
Using both X1 and X2 will probably cause multicollinearity
problem because both interest rates move together
Solution: Include either X1 or X2 but not both.
In some cases this “solution” will be unsatisfactory if it
causes you to drop out explanatory variables which
economic theory says should be there.
22
Illustration of the Effect of Multicollinearity
True Model : Y .5 X 1 2 X 2 e
Correlation between X1 and X2 = .98
Coeff.
Inter.
.1662
St.
Error
.1025
t-Stat
P-val.
.1211
Lower
95%
-.0456
Upper
95%
.3780
1.579
X1
2.084
.9529
2.187
.0338
.1667
4.001
X2
.1478
.9658
.1530
.8790
-1.795
2.091
R2=.76, P-value for R2=0 is 1.87E-15.
Coefficient estimates badly biased from their true values of .5 and
2, and coefficient on X2 not significant
23
Solutions’ to multicollinearity
The available data has insufficient variation to identify
the separate effects of each X variable
common strategies are to either get more data or
to reduce the number of X variables
Drop some of the variables, or
Form some linear index that summarises the effect of
several of the variables
24
Example: Dropping X2 From the
Regression on Artificial Data
t-Stat
P-val.
Inter.
Coeff. St.
Error
.1667 .1041
1.601
.1160
X1
2.227
12.454 1.E-16 1.867 2.586
.1788
Lower Upper
95% 95%
-.0427 .3761
Note that R2=0.76, so not reduced. Why?
the explanatory power of one variable is almost the
same as for the combination of the two
25
The coefficient on X1 is not very close to its true
value, but it is close to the sum of the coefficients on
X1 and X2.
Why? The high correlation between X1 and X2 made
the model almost like (X1+X2).
See further examples in Black Chap 15.4 and Koop
Chap 6
26