There are many times when an analyst needs to observe only a portion of the data they have in a particular file. When working with large data sets it may be useful to take a subset of the data, create a new file with only the reduced form data (to preserve the integrity of the original data set) and begin running analysis on this subset. Fortunately, the SAS system has the ability to manipulate large data sets through its “set” command. The benefits of coding this command instead of merely deleting observations is that it preserves data integrity and improves the transparency of the data cleaning process.
This post will continue to use the St. Louis Federal Reserve data on unemployment and money supply that have been used in all previous SAS posts. The goal is to generate a subset of this data which contains years of high unemployment. High unemployment will be defined as years where the unemployment rate was higher than 9%. In addition to creating this subset a new variable will be created which takes the ratio of the monthly percentage change in the money supply and the unemployment rate. This will be used to illustrate how SAS can create new variables out of old ones while creating a new data subset.
CODEThe first data command imports a CSV file stored in the K drive by specifying that its delimiter is a comma. The “missover” command ensures that two consecutive commas are treated as missing values among a couple of other things (see two posts ago on importing CSV for further detail). Next the date is formatted and finally the names of the variables are defined for SAS.
The second data command is the subject of this post. The first line specifies a temporary data set called high will be created in the work folder. The second command, “set” calls that imported data called fed from the first command. The third line is a logical statement that tells SAS the criteria for the new data set “high” based of values in the old data set “fed”. This is the command that isolates only the observations with an unemployment rate greater than 9%. The last line in this data set is to created a variable called “mtou” that is the ratio of the monthly percentage change in the money supply divided by the unemployment rate. This is the basic framework for generating a subset of a larger data set and creating a new variable.
The last command is a “proc” command that prints out the new data subset of the “fed” file called “high”. The output of these commands looks like the following.
The log provides further information about the the partitioning of the data. The log below shows that there were a total of 620 observations read in from the “fed” file. The subset labeled “high” contains only 41 monthly observations where unemployment was higher than 9%. In addition to the reduced observations an additional variable was created increasing the total number of variables to 4.
The new data set, “high” will also appear in the work folder along side the original “fed” data preserving the original data set. The program can be easily modified if the criteria for high unemployment needs to change. This can be done by simply changing the “9.0” in the code above to whatever threshold is called for in the analysis.