The Least Squares Assumptions

Canva - Assorted Boxes

Useful Books for This Topic: 

This post presents the ordinary least squares assumptions. The assumptions are critical in understanding when OLS will and will not give useful results. The objective of the following post is to define the assumptions of ordinary least squares. Another post will address methods to identify violations of these assumptions and provide potential solutions to dealing with violations of OLS assumptions.

ASSUMPTION #1: The conditional distribution of a given error term given a level of an independent variable x has a mean of zero.

This assumption states that the OLS regression errors will, on average, be equal to zero. This assumption still allows for over and underestimations of Y, but the OLS estimates will fluctuate around Y’s actual value.

ASSUMPTION #2: (X,Y)  for all n are independently and identically distributed.

The second assumption assumes that the observations of X and Y are not systematically biased. Typically randomly selected samples of X and Y are considered to be independent and identically distributed. This assumption is essential when considering cases where the regression analysis aims to examine the effects of a treatment X on an outcome Y. If the treatment is not randomly assigned, there is no guarantee that the X is causing Y. Imagine evaluating a program that provides job training to prisoners and would like to assess its success. If the application is voluntary, likely, treatment X is not randomly assigned. If married first-offenders with children are more likely to participate in the program and are also more likely to have success in the job market after prison, then X is not independently and identically distributed, which violates this assumption.

ASSUMPTION #3: Large outliers are unlikely.

Outliers are values in the data that are far outside of the full range of the data. The presence of significant outliers can make the regression results misleading. The OLS regression results weigh each pair of X, Y equally; thus, an outlier can significantly affect the slope and intercept of the regression line.

ASSUMPTION #4: No perfect multicollinearity.

Multicollinearity occurs in multiple regression analysis when one of the independent variables is a linear combination of the other. This correlation between inputs makes it so that the estimation of the individual regression parameters impossible. Fundamentally one is asking the regression analysis to answer an unanswerable question, namely, the effect of variable X on another variable Y after holding a third variable Z constant that is a linear combination of X.

Book List: 

About the Author

Portrait of JJ Espinoza by Charles Ng | Time On Film Photography

JJ Espinoza is Senior Full Stack Data Scientist, Macroeconomist, and Real Estate Investor. He has over ten years of experience working in the world’s most admired technology and entertainment companies. JJ is highly skilled in data science, computer programming, marketing, and leading teams of data scientists. He double-majored in math and economics at UCLA before going on to earn his master’s in economics, focusing on macro econometrics and international finance.

Connect with JJ on Linkedin or Twitter.