One of the difficulties in accessing the quality of an econometric or regression models is determining if any of the key regression assumptions have been violated. Regression analysis contains several key assumptions in order for the results to actually be in accordance with reality. In regression analysis one is trying to measure the impact of certain variables on an outcome that we are interested in understanding or influencing. In order to determine with a fair degree of accuracy how strong these relationships are a few assumptions must be made. If any of these assumptions are violated then the precision of the estimates can come into question. The goal of this posts is to explain what these assumptions are and most importantly how to test and potentially correct violated regression assumptions to obtain the most accurate measure of the phenomenon we are trying to measure. In particular, this post will focus on outliers, subsequent post will address other issues that can arise in regression analysis.
Outliers are observations that have a particularly large influence on the mean or average of numbers. Regression after all is just an algorithm for estimating the conditional mean or the average impact of one variable on another. Typically, when people speak of outliers they are talking about a one dimensional outlier, for example a really high priced home. However, regression analysis is a multidimensional in nature, so a home being really high priced might not be an issue given the number of bedrooms, bathrooms, location, neighborhood amenities, etc. An economist talking about an outlier is referring to a value, that even after accounting for major driving factors, is still inexplicably large or small. It could be because of a data error or simply just an anomaly, here is how one can test for outliers.
#Imports libraries necessary for analysis along with data on public school expenditures.
ps <- na.omit(PublicSchools)
#One way of quickly examining outliers is to plot a scatter plot.
plot(Expenditure ~ Income, data = ps, ylim = c(230, 830))
ps_lm <- lm(Expenditure ~ Income,data = ps)
#This method may work fine with a simple regression, but if you have a multiple regression then plotting is less useful. The alternative is to plot standardized residuals and a statistic called leverage which measures the influence of a point on the slope of a regression line. One popular measure of influence is Cook’s Distance which is defined as.
This can be thought of as the sum of the squared difference in prediction of outcomes Y based on deleting observation i divided by the mean standard error of the regression multiplied by the number of parameters estimated in a model. In other words, the higher this number the more influential an observation is, and this is based on how much a model’s estimates change relative to how variable the regression estimates are naturally. Here is Cook’s distance graphed with the residual error term of a regression. Researchers have suggested several cutoff levels or upper limits as to what is the acceptable influence an observation should have before being considered an outlier.
#Graphing is nice, but what if there are millions of observations or you’d like to measure outliers in a different way? There are multiple ways of defining outliers and quantifying their influence. Lucky for us R has built in functions to that can help us identify influential points using various statistics with one simple command.
#Using several measures one can see that Alaska is an outlier. It may skew the interpretation of the relationship between expenditure on education and income in a state. Typically, one can demonstrate these statistics and report both a regression with all data included and one with the outliers removed.