Who Cheats on their Spouse and What Makes Marriages Happy? – Multinomial Logit and Probit Regression Analysis



In 1978, Ray C. Fair from Yale University, extended the microeconomic framework to study why people have affairs. The paper creates an econometric model grounded in Utility theory. In economics, Utility Theory describes how people consume goods/services based off their costs, price and availability of substitute goods, price and availability of complementary goods, and explains the well-being people derive from multiple goods consumption. This analysis and theory are extended to the consumption of multiple relationships (i.e. affairs) in Fair’s paper.

The goal of this post is to estimate a probability model, similar to what Fair(1978) used in his paper, that will quantify what socioeconomic factors impact the probability of affairs for married couples. This post will also extend on Fair’s work by analyzing what factors increase happiness in a marriage, which he found to have a significant influences on whether or not a person had an extramarital affair.


Original Paper (Fair 1978)

Data Affairs (Fair 1978)



The data are survey data from Psychology Today (1969) and Redbook (1974) magazines. The variables in the data are as follows and contain the following names and labels.



The probit regression model assumes that the regression errors will be normally distributed, or more formally:

ScreenHunter_12 May. 28 19.33The results from this regression model are in the table below, followed by an explanation of the results.

Probit Regressions

Probit Graph


You can also get these results using SPSS, however SPSS doesn’t have the ability to calculate marginal effect.  Here are the log-odds ratios and syntax to replicate the STATA results above:

SPSS Probit Results

plum affair BY kids male  WITH occup educ ratemarr age yrsmarr relig
/print= parameter summary.





The probability of having an affair is greatly reduced if a person rates themselves happier in a marriage. If a person rates themselves as ‘very happy’ as opposed to ‘average’ on a questionnaire they are 54% less likely to have an affair.


People who self-identify themselves as more religious also are less likely to have affairs. People who self-identify themselves as “very religious” versus those that say they are “slightly” slightly religious are 38% less likely to have an affair.


Older people are less likely to have extramarital affairs. Every ten years a person’s likelihood of having an affair decreases by 20%. This means that a person who is 30 years old is 20% less likely to have an affair than a person who is 20 years old.

Years Married

The longer people are married the more likely they are to have an affair. A person who has been married 10 years is 25% more likely to have an affair than someone who has been married 5 years.

Statistically insignificant factors…

Education, occupation, sex, and the presence of children are all statistically insignificant when it comes to the likelihood of someone having an affair.

A word of caution about the interpretation of marginal effects after a probit regression: marginal effects are linearized coefficient estimates of a non-linear model evaluated at the mean. Inferences about the probability of affairs are less accurate the farther away from the “X” values a person is in the “mfx” table above.


Given the importance of marital happiness in determining whether or not a person has an affair, further exploration beyond what Fair did in his paper seems warranted. I estimate a multinomial logit regression, although it can be argued that an ordered logit would be appropriate given the ordinal nature of the marriage rating variable (5 is best and 1 is worst).

Multinomial Logit

The results from a multinomial logit model above show the (rrr) relative risk ratio of marital happiness based on socioeconomic factors. The baseline are people who are very dissatisfied with their marriage (1), each set of possible answers (2-5) is modeled (5 not shown). An rrr of 1 implies equal likelihood of being in the base category. Here is how to interpret these results, look at the second table we see that the rrr for kids is 6.03, hence people with kids are 6 times more likely to say they are “somewhat unhappy” with their marriage (2) than are to say they are very unhappy (1 aka the base).

Overall the statistically significant drivers of happiness are length of marriage and the presence of children. Having children increase a persons chances of going from very unhappy (1) to somewhat unhappy (2), but no other effect is seen past that. In other words, kids increase marital happiness only if a person is miserable. Being in a relationship longer appears to decrease happiness regardless of the level of happiness that exist in the marriage. In other words, satisfaction can decrease over time in even the happiest of marriages. This also means that even the most dissatisfying marriages don’t appear to increase in satisfaction over time.


It is important to note that Fair’s paper was a catalyst for much research. Although the paper is based on solid economic theory and tried and true econometrics, there were still some conceptual issues that were a source of criticism. These included…

1) Survivorship Bias – If affairs break up marriages, then the relationship between years married and it’s impact on the likelihood of affairs suffers from survival bias.

2) Sampling Error – People who took the survey are more likely to be aware of their own psychological health. If these people are intrinsicly more likely to cheat as well as have a higher propensity to show other socioeconomic characteristics related to having an affair, bias is introduced through sampling error.

3) Measurement Error – People may be hesitant to admit to affairs. If these sample people also tend to be more religious, then we can be overstating the impact of religion on the probability of affairs because people lied on the survey. This can be true for other variables as well (i.e. occupation).

There are probably a few other criticism out there, but despite the potential issues with Fair’s analysis it did start up a very interesting conversation on extra marital affairs among economists. Affairs destroy relationships and break up families, hence there is a strong social costs affairs as well as the damage they cause to relationships. In the end we all want to be happy, and increasing happiness is a worthy goal for economists in my opinion.

Quantifying Multicollinearity

tandem bicycle

Multicollinearity is one of the more serious problems that can arise in a regression analysis or even in simple frequency and mean analysis with a causal interpretation. Regression analysis assumes a degree of independence between the explanatory factors. However, in practice many of the explanatory variables are correlated with each other. Imagine you are trying to understand what factors are leading causes of lung cancer. There could be a host of socioeconomic and demographic factors than can be correlated with behaviors such as smoking. It may be that people who smoke are more likely to 1) work in factories with carcinogens , 2) have poorer diets, 3) live closer to toxic waste dumps or under power lines. How can one separate out important causal risk factors for the development of lung cancer when all of these factors move together? This inseparability is the essence of what econometricians call multicollinearity problem.

Fortunately, we don’t have to guess or speculate about the severity of this problem. One can use a regression analysis to precisely quantify the severity of this problem. This statistic is called the Variance Inflation Factor (VIF), and it is simply the R-squared of the regression of the suspected multicollinear variable with the other explanatory variables.

Here is the formal mathematical syntax for the VIF in the context of a causal regression between smoking and lung cancer. This is the relationship we are interested in…


We suspect that smoking may be moving too closely with the other independent variables to get a good read on the causal relationship. Well, we can simply create a second regression where we regress smoking (the suspected collinear variable) with the other explanatory variables.


The equation for the VIF takes the R-squared from the second regression and uses the following algebraic transformation:


The rule of thumb is that if the VIF is greater than 5, there may be an issue with multicollinearity. If we would like to be generous, we would tolerate a VIF of up 10. One must be careful though when including squared variables or interaction terms in a regression, because by definition there is an expressed relationship between the linear and square versions of the same variable. Interaction terms and their component variables also tend to be highly collinear. This is expected and generally should not be a cause for concern.

Here is an example, with data on the number of cigarettes a person smokes and how that relates to demographics and income. The statistical program STATA has a built in VIF command that one can use after a regression with suspected multicollinearity has been run, that command is “VIF”.



cigarrets data




As you can see the VIF is high for the age and age squared variables, but that is expected. These variables are there to see if there is a diminishing return/non-linearity to the number of cigarettes a person smokes with respect to age. What we would be concerned about is education and the lincome variable. Labor economists have made their careers studying the relationship between education and wages, so we would be more concerned with multicollinearity in these variables. Using the VIF we can see education and income aren’t causing a troublesome amount of collinearity, so we can interpret their impact independent of each other. This is the value that the VIF can provide a researcher, the confidence to claim estimates free of multicollinearity.

If troublesome multicollinearity does exist in your regression, please reference this excellent guide for methods and transformations to purge your analysis of this problem:


Good luck.

Programming in R: Modelling Investment Portfolios with Matrix Algebra


Investment portfolios are a collection of investments.  These investments can be anything including real estate, merchandise inventory, or a collection of businesses in a multinational corporation. However, the term is most commonly used to describe an investment in stocks and bonds in financial markets. Whatever the context may be, a portfolio is a collection of assets purchased at a certain price, held for a certain time, may provide income/cost during the holding period, and then are sold for profit/loss.

Matrix algebra is a branch of mathematics that is often used to model investment portfolios.  The goal of this post is to introduce the used of matrix algebra via the programming language R to solve commonly asked questions about investment portfolios in stocks.   The expected return and the riskiness of the portfolio will be analyzed both analytically and computationally.

1. Vectors and Matrix Definition

The following example is for 3 assets but could easily be extended to a many asset model representation of the portfolio problem. The following notation is used to represent the asset returns, their joint normal distributions, expected returns, variance of returns, and the covariance of returns.

ScreenHunter_04 Dec. 28 11.34R represents the asset return for investments A, B, C. The returns are distributed as a multivariate normal, mu subscript i is the expected return for asset i, sigma squared subscript i is the variance or returns for asset i, and sigma subscript ij is the covariance of returns between asset i and j. The share of wealth invested in assets A, B, and C (not pictured) is represented by x subscript i. The following notation uses the definitions above to construct vectors and the variance-covariance matrix used to further define the model for portfolio returns.

ScreenHunter_04 Dec. 28 11.45

2. Portfolio Returns and Expected Returns

A portfolio’s returns is the weighted average of the individual returns.  The weights are the share of wealth invested each asset which is then multiplied by the return on that asset. One can represent the returns and the expected returns with vectors in the following way:

ScreenHunter_04 Dec. 28 11.55R subscript p,x represents the returns to portfolio p given a certain allocation of wealth x among the assets in portfolio p. Mu represents the expected portfolio returns given the same assets and allocation of wealth.

3. Portfolio risk or variance

The variance of a portfolio can also be written in vector/matrix notation. Recall that when multiplying a vector with itself one must transpose the second multiple. The matrix calculation is shown below with vectors and matrices (bold) and then the familiar variance formula is written out in non-matrix form.

ScreenHunter_04 Dec. 28 12.11

2. Modelling in R

The next step is to programmatically represent the model in R using vectors and matrices.  The following code creates a vector that contains 3 sets of assets with returns of 1%, 4%, and 2%.

ScreenHunter_04 Dec. 28 13.06

Next the variance covariance matrix sigma and the share of wealth invested in assets needs to be programmed into the system.

ScreenHunter_05 Dec. 28 13.12

The final step is to calculate the expected portfolio returns and the variance of the portfolio us cross products, matrix transposes, and matrix multiplication.

ScreenHunter_05 Dec. 28 13.18

These calculations suggested that the expected return on this portfolio is 2.3% with a variance of .0048.  Using the code above one can experiment on how different assets and asset allocation effect the risk and rewards of the portfolio.  In later post we will see how to obtain minimum variance portfolios by allocating shares in a way that reduces risk while maximizing profit, as well as other optimization techniques.

Stated Intentions and Demographics on New Product Purchase Forecasting

Using survey data from intentions to purchase personal computers (PC) and demographic variables Hsiao, Sun and Morwitz (2002) find that:

  1. A remarkably stable relationship between intentions and purchase over time.
  2. True intentions are not represented by stated intentions.  A better representation of the true intentions should be a weighted average of stated intentions.
  3. Family, education, and demographic variables are complementary to intentions in predicting purchasing behavior. Demographic variables should be used to improve predictive power.
  4. Modelling intentions based on demographics and then using that output to then model purchases is difficult to do accurately. There is value from asking potential customers about their intentions.
  5. There is significant evidence that exogenous shocks (fired, death in family, etc.) lead to a change in intentions and these are not captured in the social-demographic factors in their study. Further research is needed on how to model these.

The purpose of this posts is to introduce the logic and models used arrive to these conclusions.  Please note that these findings may not be externally valid for other products, because this study focused on the intent to purchase personal computers.  These findings may be different for other types of products, having said that the models below can be leveraged to test these kinds of hypothesis about intentions and demographic influences on new product purchases.


The first model assumes that information shapes people’s true intentions and these intentions translate into a true response (i.e. purchase).  It is assumed that peoples true intentions are influenced by social demographic variables.   The mathematical formula below describes these relationship.  True population variables are indicated with an asterisk while observed variables are missing asterisks. These postulations can be described mathematically as:

In order to build a probability model one must assume that crossing the threshold of purchasing be coded as a binary response.  It this construct it is irrelevant whether or not a person purchased 1 or more products or services, but any positive value is coded as a 1 and no purchase or a return of an item purchased outside the time period of observation is coded as a zero.  Furthermore this first model assumes that stated intentions and demographic information can help one determine true intentions; further more if stated intentions equals true intentions then the demographic information is irrelevant.  This can be stated more formally with a probability function (F) as follows


The second relaxes the assumption that people’s stated intentions equal their actual intentions.  Research has shown that asking people questions like “How likely are you to buy product X in the next 6 months” on a 5-point scale (definitely will=5; definitely will not buy =1) can be a better measure of true intentions. Further research has shown that people will tend to understate low intentions and overstate high intentions.  This second model assumes that true intentions are a weighted average of a stated intention scale, formally this model would look similar to model one, but with the following modelling on stated intentions


The third model uses family, education, and demographic variables along with a binary response variable for intentions.  It has been postulated that people may be giving their best-point prediction of a future event when answering questions such as “Do you wish to buy a certain product in the next so many months”.  In other words they answer the question as would a statistician in terms of estimating probabilities of certain event s happening, then translating that into a “yes” answer if their probability of actually buying the product is greater than 50% based on their assessment of their life at the time of the survey.  Under these assumptions the probability of purchasing given true intentions and demographic information is equivalent to the probability of purchasing given true intentions, demographics would add nothing,  but given that we have only stated intentions and only a probabilistic assessment of true intentions we need to state things differently. Theoretically models that omit demographic information are suboptimal likewise models that ignore stated intentions would also be lacking in explanatory power, hence both variables are used in this model


In addition to differences between stated intentions and purchases there can be factors that influence purchasing that may or may not be independent of demographics.  There could be a change in intentions due to several factors outside of the scope of the person responding to the survey (raise, promotion, fired, price changes, etc.).  This model accounts for shocks to individuals intentions based on these select factors

The Least Squares Assumptions

Canva - Assorted Boxes

Useful Books for This Topic: 

This post presents the ordinary least squares assumptions. The assumptions are critical in understanding when OLS will and will not give useful results. The objective of the following post is to define the assumptions of ordinary least squares. Another post will address methods to identify violations of these assumptions and provide potential solutions to dealing with violations of OLS assumptions.

ASSUMPTION #1: The conditional distribution of a given error term given a level of an independent variable x has a mean of zero.

This assumption states that the OLS regression errors will, on average, be equal to zero. This assumption still allows for over and underestimations of Y, but the OLS estimates will fluctuate around Y’s actual value.

ASSUMPTION #2: (X,Y)  for all n are independently and identically distributed.

The second assumption assumes that the observations of X and Y are not systematically biased. Typically randomly selected samples of X and Y are considered to be independent and identically distributed. This assumption is essential when considering cases where the regression analysis aims to examine the effects of a treatment X on an outcome Y. If the treatment is not randomly assigned, there is no guarantee that the X is causing Y. Imagine evaluating a program that provides job training to prisoners and would like to assess its success. If the application is voluntary, likely, treatment X is not randomly assigned. If married first-offenders with children are more likely to participate in the program and are also more likely to have success in the job market after prison, then X is not independently and identically distributed, which violates this assumption.

ASSUMPTION #3: Large outliers are unlikely.

Outliers are values in the data that are far outside of the full range of the data. The presence of significant outliers can make the regression results misleading. The OLS regression results weigh each pair of X, Y equally; thus, an outlier can significantly affect the slope and intercept of the regression line.

ASSUMPTION #4: No perfect multicollinearity.

Multicollinearity occurs in multiple regression analysis when one of the independent variables is a linear combination of the other. This correlation between inputs makes it so that the estimation of the individual regression parameters impossible. Fundamentally one is asking the regression analysis to answer an unanswerable question, namely, the effect of variable X on another variable Y after holding a third variable Z constant that is a linear combination of X.

Book List: 

Continue reading

Election Outcomes and Economic Performance

The following links contain the dataset and the STATA program used to generate the econometric estimates found in this post.

Data: Election Outcomes and Economic Peformance (1996)
STATA Program: Election Outcomes and Economic Performance

This blog post replicates the analysis of the relationship between economic performance and election outcomes done by Fair (1996).  The regression model will be used to predict the likelihood that President Obama is reelected based on several important political and economic variables identified in the analysis. Outlined below are the variables, data, and regression model predictions.  These estimates suggest  that president Obama will lose the next election by a narrow margin receiving 49.4 % of the popular vote.


The model will use several variables to predict the percentage of votes going to democratic presidential candidates (demvote) based on important political and economic factors.  The variables are:

incum = takes on a value of 1 if a democrat is the incumbent and -1 if the incumbent is republican
partyWH = takes on a value of 1 if a democrat is in the Whitehouse and -1 otherwise
gnews = number of quarters,from first 15 quarters of incumbent presidency, where per capita output was above 2.9
 inf = average annual inflation rate in the first 15 quarters of incumbent presidency

There are also several interaction terms between the variable partyWH and gnews and inf that are used in the regression model


The summary statistics show that democrats have received a little more than 49% of the vote in the last 21 presidential elections.  This has ranged between 36 t0 62% over the period under observation.


The model that will be used will be a simple linear regression model with the variable presented in the summary statistics.


The model explains about 65% of the variability in the percentage of votes going to democrats in a presidential election in the United States.  There is a small and statistically insignificant reduction in votes going to democrats control the Whitehouse.  Democratic incumbents have a 5% point advantage when they are the incumbents during the election.  The interaction between democratic control of the Whitehouse and good economic news is favorable for the democrats.  Every quarter of good economic growth in the first 15 quarters of a democratic presidency translates into nearly a 1% point advantage during the next election.  The effect inflation  during a first 15 quarters of a democratic presidency has nearly the opposite effect of robust economic growth, decreasing the percentage of democratic votes by .8% points for every 1% increase in average inflation during a democratically controlled Whitehouse. .


Using this model Fair (1996) was able to predict Bill Clinton’s presidential re-election within 4% points.  Given the current condition and relying on the model estimated above: we have a democratic president in the Whitehouse who is running for his second term, zero quarters with economic growth above 2.9% and and average inflation rate of 1.7% one would predict that president Obama would receive 49.4 % of the popular vote. According to these estimates Obama would lose the popular vote and most likely the election if economic conditions don’t improve.


Is this prediction statistically different from a draw?  Can the inclusion of unemployment change the overall model predictions given the abnormally high levels we are currently experiencing?  I will be looking into these questions shortly.