SAS Programming:Multiple Regression Analysis and Racial Discrimination in Soda Prices

This objective of this post is to quantify if there is a difference in soda prices in areas with a high concentration of blacks.  This analysis uses the data from a paper titled, “Do Fast-Food Chains Price Discriminate on the Race and Income Characteristics of an Area?” which appeared in the Journal of Business and Economic Statistics.  It consists of 409 zip-code level observations in New Jersey and Pennsylvania.  There are several variables representing race, income, competition and marginal costs variables in the data set that will be used to control for socioeconomic conditions which may be driving price differences. The multiple variable regression will be similar to that  found in the paper.  The results show that blacks are charged higher prices for soda in New Jersey and Pennsylvania in fast food restaurants even after controlling for income, competition, and cost structure differences across zip – codes .  These differences may be attributable to factors such as price elasticity, competition, and heterogeneous cost structures that are correlated with the proportion of blacks in a zip – code. However, one cannot rule out the case that there may be  systematic racial discrimination in the price of sodas in fast food restaurants, according to the analysis presented in the paper.  The following statistics and regression analysis is representative of those found in the research paper, although simplified a bit for the sake of brevity.


The following SAS code uses the “proc means” command to summarize two of the important socioeconomic variables that are going to be used in this analysis:  the percentage of blacks and the median wage earned per zip code.

The mean percentage of black in the data set are 11.3% with a low of 0 to a high of 98.2%.  The median family income is $47K a year with a standard deviation of 13K.


The regression equation controls for median income, proportion under poverty, density, and the crime rate per zip code.The variables in the model above are at the zip code level and their descriptions are below:

  • lsoda = natural logarithm for the price of soda.
  • prpblck = proportion of the black population
  • income = median income
  • propov = proportion under the poverty line
  • ldensity = natural logarithm of density
  • crmrte = crime rate

Using the “proc reg” code in SAS this model can be easily solved:

The SAS code above produces the tables below that describe the model and estimate the coefficients of the regression:

The “Parameter Estimates” above show that the coefficient estimate for “prpblck” is .065 and is statistically significant after controlling for other socioeconomic variables.  This means that if the proportion of blacks increases in a zip-code by 50% (50 percentage points) then one can expect to see soda prices increase by approximately 3.25%.  This is consistent with the results found in “Do Fast-Food Chains Price Discriminate on the Race and Income Characteristics of an Area?”.  The conclusion of the paper ends with this statement, warning of possible short-comings of this analysis and direction for further research:

“These results need not be evidence of discrimination but may reflect unmeasured cost differences across areas that are correlated with the proportion of the population that is black.  Explanations based on price discrimination, either due to differences in elasticities, differences in competition, or a taste for discrimination, cannot be excluded however.  More research into the incidence, cause and effects of price differences that are correlated with race would appear to be warranted.”

SAS Programming: Combining Data Sets and Subsetting with the “IN=” Option

There are times when a merger or two data sets requires further filtering to get to the information that is required.  The “IN=” option creates a tag for a data set that can be used to further filter after merging two data sets together.  This filtering is done with an “IF” statement.  This can be very useful when drilling down to investigate or summarize a set of data.

This post will merge together two data sets: 1) artificial derivatives sales data and 2) customer information.  Then the “IN=” and “IF” command combination will be used to compile a list of clients that have no derivatives sales data for the period under observation.  This can be used by a broker to identify which clients may need a little nudging or special attention since they have not purchased any derivatives recently. The data set is artificial and used for instructional purchases only.


The code above imports a CSV file with customer records and formats the names and phone numbers as strings with a field range of 100.  Then the “sort” command is used to sort by customer number which is the common variable between this data set and one that is imported with the code below.

Now that both data sets have been imported into SAS and sorted by “CustomerNumber”.  The following commands will create a new data set by merging the customer and sales files by the common variable “CustomerNumber”.  However, when the sales file is identified in the “merge” command the “IN=” option is added.  The command “( IN = recent)” tags the sales data with an indicator variable.  Then the “if” command tells SAS to merge the data sets and select only observations where the value for “recent” equals zero.

Armed with this list the broker can begin calling her clients to inquired on their drop off in derivatives purchases.

SAS Programming: Data Set Options

The SAS system contains several data options that play an important part in fine tuning SAS programs.  This post will focus on several options that can be included in the “data” commands when reading in data. These options will tell SAS which variables to keep, drop, and rename certain variables.  Other options tell SAS what observation to start reading from when importing.  The data used will be the GDP by state figures for 2010 that were merged together with the state name and state capital database in the previous post.


The keep option tells SAS which variables to keep in the data set.  The following code creates a new data set and keeps the variables “abbreviation”, “gdp”, and “name”.  Notice that the variable capital is not included because it was left out of the “keep” option.

Another way of eliminating the “capital” variable is to use the “drop” statement.

The options above can help to eliminate unnecessary variables that tend to accumulate when merging data.  Finally, the last option of this post concerns itself with renaming variables within a data set.  This command can either be used to create a new data set with the updated names or replace an existing data set with the new variable names.  The code below creates a new data set where the variables “gdp” and “names” have been renamed to “income” and “state” respectively.

Finally, it may be useful to print a subset of the data.  This can be accomplished by using the “firstobs” and the “obs” options only with the “proc print” command.

Notice how the “first obs = 2” command ensured that the output skipped the first state which was Alaska.  The second command “obs = 5” ensured that only the 5 observations were printed.  Noticed that the “obs” command begins with the original observation no matter where SAS is told to start with the “firstobs” command.

SAS Programming: One – to – One Merge

Matching two data sets can be done with  a “matching” variable.  A matching variable is one that is contained in both of the data sets.  For example, sales data may contain a customer number which is also found in a customer file that contains the customer’s contact information. One can create a SAS program that merges this two data sets into a third data set using the customer number.  If the matching variable only appears once in the data sets then the matching is said to be “one-to-one”.  On the other hand if one data set is more of a table whose values will appear several times in the other data set then it is considered to be one-to-many.

The objective of this post is to merge a “one-to-one” data sets and a one-to-many data set.  This “one – to – one” merge will consist of merging state level GDP information with the respective states name. The matching variable will be the abbreviated state name which appears in both data sets.


The following set of code reads in a CSV file into the temporary workfile and names it state GDP. It is very important to sort the data by the common variable which in this case is “abbreviation”.

The data consists of per capital GDP for 2010 at the state level.

The next line of code reads in a CSV file into the temporary workfile and names it statenames.  This file contains the state abbreviation, complete state name, and name of the state capital. Notice that before the data is populated some formatting is done to accomodate the different lengths that the data presents itself in.

Again, the data need to be sorted by the matching variable before the merging can take place:

The following code is the objective of this post.  Using the “merge” and “by” statements the following code merges the two data sets with a one-to-one relationship.

SAS Programming: Stacking Data with the Set Command

SAS has a command called “set” that can help stack data sets on top of one another.  This command maybe called when there are two sets of data from different time frames or divisions of a business that have the the same variables but different observations. If there is a variable in one data set that is not present in an another that are being stacked together then a missing value will be created.

The objective of this post is to stack two data sets that contain monthly observations for U.S. personal disposable income.  This data comes from the St. Louis Federal Reserve Bank’s FRED database:  This data was partitioned into two for the pursposes of this exercise.  One set of data consist of all observations before January 1980 and the remaining data set consists of all observations after that date.  The purpose of this post is to stack this data on top of each other so that they become one data set.


This first set of code merely bring in the two data sets and store them in the temporary work folder:

The SAS log shows that both data sets were imported correctly and tallies the variables and observations.  The first file labeled income contains 632 records while the second file contains 380 variables.

This next set of code contains the actual stacking of the data.  This code creates a temporary work file called “both”. Next, the “set” command is used to stack the files “income” and “income2” and the “by” command sorts the data by date.  Finally the “proc print” statement is used to output the data with the correct date format.  Using the “set” and “by” combination is called interleaving.  Interleaving is especially useful if each individual data set is sorted and stacking may undo this sorting.

The log shows that the two data sets were merged correctly.  There are a total of 1012 observations from the merger of “income” and “income 2”.



This new data set is now ready for analysis.

SAS Programming: Obtaining Subsets of Data

There are many times when an analyst needs to observe only a portion of the data they have in a particular file.  When working with large data sets it may be useful to take a subset of the data, create a new file with only the reduced form data (to preserve the integrity of the original data set) and begin running analysis on this subset.   Fortunately, the SAS system has the ability to manipulate large data sets through its “set” command. The benefits of coding this command instead of merely deleting observations is that it preserves data integrity and improves the transparency of the data cleaning process.

This post will continue to use the St. Louis Federal Reserve data on unemployment and money supply that have been used in all previous SAS posts.  The goal is to generate a subset of this data which contains years of high unemployment.  High unemployment will be defined as years where the unemployment rate was higher than 9%.  In addition to creating this subset a new variable will be created which takes the ratio of the monthly percentage change in the money supply and the unemployment rate.  This will be used to illustrate how SAS can create new variables out of old ones while creating a new data subset.

CODEThe first data command imports a CSV file stored in the K drive by specifying that its delimiter is a comma.  The “missover” command ensures that two consecutive commas are treated as missing values among a couple of other things (see two posts ago on importing CSV for further detail).  Next the date is formatted and finally the names of the variables are defined for SAS.

The second data command is the subject of this post. The first line specifies a temporary data set called high will be created in the work folder.  The second command, “set” calls that imported data called fed from the first command.  The third line is a logical statement that tells SAS the criteria for the new data set “high” based of values in the old data set “fed”.  This is the command that isolates only the observations with an unemployment rate greater than 9%.  The last line in this data set is to created a variable called “mtou” that is the ratio of the monthly percentage change in the money supply divided by the unemployment rate. This is the basic framework for generating a subset of a larger data set and creating a new variable.

The last command is a “proc” command that prints out the new data subset of the “fed” file called “high”.  The output of these commands looks like the following.

The log provides further information about the the partitioning of the data.  The log below shows that there were a total of 620 observations read in from the “fed” file.  The subset labeled “high” contains only 41 monthly observations where unemployment was higher than 9%.  In addition to the reduced observations an additional variable was created increasing the total number of variables to 4.


The new data set, “high” will also appear in the work folder along side the original “fed” data preserving the original data set.  The program can be easily modified if the criteria for high unemployment needs to change.  This can be done by simply changing the “9.0” in the code above to whatever threshold is called for in the analysis.