Multicollinearity is one of the more serious problems that can arise in a regression analysis or even in simple frequency and mean analysis with a causal interpretation. Regression analysis assumes a degree of independence between the explanatory factors. However, in practice many of the explanatory variables are correlated with each other. Imagine you are trying to understand what factors are leading causes of lung cancer. There could be a host of socioeconomic and demographic factors than can be correlated with behaviors such as smoking. It may be that people who smoke are more likely to 1) work in factories with carcinogens , 2) have poorer diets, 3) live closer to toxic waste dumps or under power lines. How can one separate out important causal risk factors for the development of lung cancer when all of these factors move together? This inseparability is the essence of what econometricians call multicollinearity problem.
Fortunately, we don’t have to guess or speculate about the severity of this problem. One can use a regression analysis to precisely quantify the severity of this problem. This statistic is called the Variance Inflation Factor (VIF), and it is simply the R-squared of the regression of the suspected multicollinear variable with the other explanatory variables.
Here is the formal mathematical syntax for the VIF in the context of a causal regression between smoking and lung cancer. This is the relationship we are interested in…
We suspect that smoking may be moving too closely with the other independent variables to get a good read on the causal relationship. Well, we can simply create a second regression where we regress smoking (the suspected collinear variable) with the other explanatory variables.
The equation for the VIF takes the R-squared from the second regression and uses the following algebraic transformation:
The rule of thumb is that if the VIF is greater than 5, there may be an issue with multicollinearity. If we would like to be generous, we would tolerate a VIF of up 10. One must be careful though when including squared variables or interaction terms in a regression, because by definition there is an expressed relationship between the linear and square versions of the same variable. Interaction terms and their component variables also tend to be highly collinear. This is expected and generally should not be a cause for concern.
Here is an example, with data on the number of cigarettes a person smokes and how that relates to demographics and income. The statistical program STATA has a built in VIF command that one can use after a regression with suspected multicollinearity has been run, that command is “VIF”.
As you can see the VIF is high for the age and age squared variables, but that is expected. These variables are there to see if there is a diminishing return/non-linearity to the number of cigarettes a person smokes with respect to age. What we would be concerned about is education and the lincome variable. Labor economists have made their careers studying the relationship between education and wages, so we would be more concerned with multicollinearity in these variables. Using the VIF we can see education and income aren’t causing a troublesome amount of collinearity, so we can interpret their impact independent of each other. This is the value that the VIF can provide a researcher, the confidence to claim estimates free of multicollinearity.
If troublesome multicollinearity does exist in your regression, please reference this excellent guide for methods and transformations to purge your analysis of this problem: