Transcript Document
Multicollinearity Rudra P. Pradhan Nature of Multicollinearity Problem Many finance/ economic variables, especially time series variables are closely related with each other. Example; Population and GDP are closely related, i.e highly correlated. In multiple regression models, a regression coefficient measures the partial effect of that individual variable on Y when all other X variables in the model are fixed. However, when two explanatory variables move closely together, we cannot assume that one is fixed while the other is changing. Because when one changes, the other one also changes as they are closely related. In such a case it is difficult to isolate the partial effect of a single X variable. This is the problem of Multicollinearity Examples of Multicollinearity Example 5.1 Housing Starts (Data4-3) No. of housing units (in thousands<) depend on mortgage interest rates (R), population (in millions) and GDP (in billions 1982 dollars). Three models were formulated as; Model A: Houst 1 2 Rt 3 Popt u1t Model B: Houst 1 2 Rt 3Gdpt u2t Model C: Houst 1 2 Rt 3 Popt 4Gdpt u2t Examples of Multicollinearity Example 5.1 Housing Starts (Data4-3) Estimates of Housing Starts Variables Model A Model B Const. -3812.93 687.90 t R Pop (2.40) (1.80) -198.4 -169.66 (-3.87) (-3.87) 33.82 - (3.61) Gdp Adj.R2: 0.371 . Model C . -1315.75 (-0.27) -184.75 (-3.18) 14.90 (0.41) 0.91 0.52 (3.64) 0.375 (0.54) 0.348 Examples of Multicollinearity Example 5.1 Housing Starts (Data4-3) Both Pop and Gdp (level of income) are expected to influence the number of housing units started for construction. Result 1: in Model C which has both Pop and GDP, the coefficients have very low tvalues and are insignificant. However, when Pop and GDP enter the model alone as in Models A and B, they are statistically significant. Examples of Multicollinearity Example 5.1 Housing Starts (Data4-3) Result 2: the coefficients for Pop and GDP in Model C are very different from those in Model A and Model B. i.e when they are bpresent together, drast,ic changes occur in the estimates. This is because, these three variables are highly correlated. r(GDP,Pop) = 0.99 r(GDP,R) = 0.88 r(Pop,R) = 0.91 Examples of Multicollinearity Example 5.2 - Car Maintenance (Data3-7) Et = cumulative expenditure at time t on the maintenance for a given car, Milest = cumulative milage in thousands of miles (+) Aget = age of car in weeks since the original purchase (+) Model A: Et 1 2 Aget u1t Model B: Et 1 2 Milest u1t Model C: Et 1 2 Aget 3Milest u1t Examples of Multicollinearity Example 5.2 - Car Maintenance (Data3-7) Estimates of Car Maintenance Model . Variables Model A Model B Model C Const. -626.24 -796.07 7.29 t Age (-5.98) (-5.91) 7.35 - (0.06) 27.58 (22.16) Miles . (9.58) 54.45 -151.15 (18.27) (-7.08) ________________________________________________________ df Adj.R2: 55 0.897 55 (3.64) 0.856 54 (0.54) 0.946 Examples of Multicollinearity Example 5.2 - Car Maintenance (Data3-7) NOTE: Although the coefficient of Miles is positive as we expected in Model B, it is significant and negative in Model C. So, there is a reversal of the sign. The coefficient for Age has also changed drastically. The t-statistics for Age and Miles are much lower in Model C. The reason for significant changes in the estimates in Model C is because of high correlation between Age and Miles which is 0.996. So, high correlation between explanatory variables can make regression coefficients insignificant or reverse their signs. 5.2 Exact Multicollinearity If two or more independent variables have a linear relationship between them, we have exact (or perfect) multicollinearity. In this case, we cannot obtain the least square estimates. Yt 1 2 X t 2 3 X t 3 ut 5.2 Exact Multicollinearity Let X3 = 2X2 . In this case r23 = 1 Yt 1 2 X t 2 3 (2 X t 2 ) ut The estimable function is Yt 1 ( 2 23 ) X t 2 ut i.e we cannot get estimates of α2 and α3 together. The assumption of no exact linear relationship among the X variables of the Classical Regression Model is violated. 5.3 Near Multicollinearity When explanatory variables are closely related (but not exactly correlated) we can obtain estimated for the coefficients. In this case, the questions are; 1) What are the consequences of ignoring multicollinearity? 2) How do we identify the presence of the problem? 3) What are the remedies? (solutions to the problem?) Consequences of Ignoring Near Multicollinearity OLS estimators are still BLUE. So they are unbised and efficient However, although BLUE, the OLS estimators have large variances (standard errors). So, the confidence intervals are wider leading to acceptance of “zero null hypothesis” and low t-values indicating that coefficients are insignificant or less significant. i.e the estimates are not precise. Multicollinearity may not affect the forecasting performance of a model. Identifying Multicollinearity In practice, multicollinearity often shows up in a number of ways; 1) High R2 with Low Values of t-statistics: As in Example 5.1, it is possible that the F value for a group of coefficients may be significant while coefficients are insignificant individually according to the t-statistics. 2) High Values for Correlation Coefficients: Usually, correlation between explanatory variables is high. However, multicollinearity may still be present even though the correlation between two explanatory variables is not high. This may be in case of multiple regression with more than 3 explanatory variables. Identifying Multicollinearity 3) Regression Coefficients Sensitive to Model Specification. Addition or deletion of variables may drastically change the coefficient estimates. This is also an indication of multicollinearity problem. 4) Regression Coefficients Sensitive to Deletion or Addion of Few Observations. Addition or deletion of few observations may also drastically change the coefficient estimates. Identifying Multicollinearity - Formal Tests of Multicollinearity Formal Tests of Muticollinearity: The proposed tests are quite controversial. This is because, multicollinearity is a problem of data rather than population. i.e It is a problem of insufficient information (data) not a problem of population. Solutions to Multicollinearity Problem No single solution exists. Multiple Regression Previously introduced how to do multiple regression Note difference in interpreting the coefficients Then looked at omitted variable bias Now look at problem of adding too many variables Multicollinearity 18 How many variables to include? A response to omitted variable bias might be to include every possible variable in model Undesirable because Including irrelevant variables increases standard errors of other variables, distorting confidence intervals and hypothesis tests Including “too many” variables that measure the same concept can lead to multicollinearity 19 Problem of high correlation Intuition: If explanatory variables are highly correlated with one another, the regression model has trouble telling which individual variable is explaining Y In the extreme of exact linear relationships amongst the explanatory variables (e.g. X1+X2=1) the model cannot be estimated Why? The matrix X’X cannot be inverted 20 Symptoms of Multicollinearity Individual coefficients may look statistically insignificant (low t values), but regression as a whole is significant (high R2 and significant Fstat) High correlation amongst the explanatory variables (may not always be apparent, with more than 2 explanatory variables, correlated linear combinations may occur) Coefficient estimates are “fragile”in the sense that small changes in the specification of the model (e.g. including or excluding a seemingly irrelevant variable) cause big changes in estimated coefficient values 21 Multicollinearity Example Y = exchange rate Explanatory variable(s) = interest rate X1 = bank prime rate X2 = Treasury bill rate Using both X1 and X2 will probably cause multicollinearity problem because both interest rates move together Solution: Include either X1 or X2 but not both. In some cases this “solution” will be unsatisfactory if it causes you to drop out explanatory variables which economic theory says should be there. 22 Illustration of the Effect of Multicollinearity True Model : Y .5 X 1 2 X 2 e Correlation between X1 and X2 = .98 Coeff. Inter. .1662 St. Error .1025 t-Stat P-val. .1211 Lower 95% -.0456 Upper 95% .3780 1.579 X1 2.084 .9529 2.187 .0338 .1667 4.001 X2 .1478 .9658 .1530 .8790 -1.795 2.091 R2=.76, P-value for R2=0 is 1.87E-15. Coefficient estimates badly biased from their true values of .5 and 2, and coefficient on X2 not significant 23 Solutions’ to multicollinearity The available data has insufficient variation to identify the separate effects of each X variable common strategies are to either get more data or to reduce the number of X variables Drop some of the variables, or Form some linear index that summarises the effect of several of the variables 24 Example: Dropping X2 From the Regression on Artificial Data t-Stat P-val. Inter. Coeff. St. Error .1667 .1041 1.601 .1160 X1 2.227 12.454 1.E-16 1.867 2.586 .1788 Lower Upper 95% 95% -.0427 .3761 Note that R2=0.76, so not reduced. Why? the explanatory power of one variable is almost the same as for the combination of the two 25 The coefficient on X1 is not very close to its true value, but it is close to the sum of the coefficients on X1 and X2. Why? The high correlation between X1 and X2 made the model almost like (X1+X2). See further examples in Black Chap 15.4 and Koop Chap 6 26