Simultaneous Equations: What is the relationship between income and smoking?

Data Source: Introductory Econometrics data sets

One can say that income helps determine how many cigarettes a person smokes, but it could be plausible that smoking, because of its negative health effects can also reduce a persons wages. Whenever there is the possibility of reverse causality estimates of coefficients and their statistical variances should be looked at with a healthy dose of skepticism. Imagine if we looked at the correlation between the number of police officers and crime rates, we might see a high correlation and erroneously conclude that the more police officers in a city the more crime. Of course the causality link between these two variables is reversed, and applying a little common sense would tell you that increases in crime lead to more police officers being hired for the most part. Police and crime rates is a fairly easy problem to decipher and ensure that correlation is not confused with causality, but what about smoking and income? Do people with low-income tend to smoke more or is are the negative effects of smoking reducing earning due to missed productivity due to health concerns? One way to examine this problem is through Two-Stage Least Squares for a system of simultaneous equations.

The first equation says that there is a relationship between how much income a person makes and the number of cigarettes they smoke, their level of education and their age. The second equation says that the number of cigarettes a person smokes is related to how much income a person makes, education, age, price of cigarettes and whether or not they live in a state which allows smoking at a restaurant.

Two approaches to this problem:

1) Ignore the possibility of reverse causation and run a OLS regression.

2) Estimate the lincome equation as a simultaneous equation system via 2SSLS. Ensure the order and rank condition are meet.

Evaluate Two Approaches:

1) Use the Hausman Test for endogenous cigs variable.

2) Choose model based on they hypothesis of the residuals for the reduced form as an explanatory variable in the structural equation.

1) Ignore the possibility of reverse causation and run a OLS regression.


cigs– For ever cigarette a person smokes their wages should increase by 0.1% so if a person smokes 10 cigarettes a day their wages should be 1% higher. This result is not statistically significant even at the 75% level.

2) Estimate the lincome equation as a simultaneous equation system via 2SLS. Ensure the order and rank condition are meet.

Note that the order condition are met because the cigs equation has two variables,lcigprice and restaurn, which represent the price of a cigarette per state and a binary variable indicating whether or not a state allows smoking in restaurants. These are good instruments since it is pretty evident that these variables should be uncorrelated with lincome. The rank condition can be verified by taking the variable of potential endogeneity and running a regression on all exogenous variables including the instruments from the cigs equation.

The joint test that the instrumental variables that will serve to tease out simultaneity result is that we can reject the null hypothesis that there is no correlation. Hence, we can are confident that these are good IV to use to estimate the real effect of smoking cigarettes on a persons income.


cigs-With the iv regression to control for reverse causality from a simultaneous equation system we can see that now the coefficient on smoking is negative. It can be interpreted as, “after controlling for education and age, smoking one more cigarette a day reduces a persons wages by 4% over their lifetime”. This doesn’t make sense, because if someone smokes a pack a day that would mean that their income would be 80% lower than those who don’t smoke?”. The statistical insignificance of cigs helps explain why the puzzling result, despite having controlled for education, age and accounting for reverse causality the coefficient is not distinguishable from zero at a 95% percent level.

Which model is correct OLS or 2SLS to control for simultaneity?

Hausman Test

To answer this question we do a Hausman test for endogeneity in the form of simultaneity.

1) Run a regression on the variable we suspect to be endogenous and include all the exogenous variables including the instruments from the cigs equation, then predict the residuals.

2) Run an OLS regression in the structural equation with the residuals included and test the statistical significance of the coefficient on the residuals. If we can reject the null-hypothesis that the coefficient is zero we can conclude that 2SLS is the better model, else OLS is the better model.


Since we cannot reject the null-hypothesis that the coefficient on the residual is zero we conclude that endogeneity exist in the form of simultaneity and that the 2SLS estimate is the more accurate one.The income is highly correlated with education and age, but the number of cigarettes that a person smokes reduces income, but it is not statistically insignificant from zero. One can say that smoking has no impact on a persons wages and that bad health as a result of smoking has no effect on earnings.