Accessing the U.S. Census API and Creating Heat Maps in R

Census-Bureau-Logo

The United States Census is rich with information that can be used by economist to understand demographic and social trends.  A popular way of presenting data from the U.S. Census data by plotting it on a map.  The goal of this post is obtain data from the U.S. Census via an API connection and plotting the data on a map using R. The reason for accessing data via an API as opposed to going to the U.S. Census website is speed, ease of querying the database, and repeatbility which I will demonstrate through a few examples.  Demonstrating R’s mapping feature provides a good start to learning how to map data in general.

ACCESSING DATA VIA THE CENSUS API

In order to access the Census API using R you need to download and install the ACS package in addition to going to requesting a secret key from the U.S. Census. This document contains the steps that got me started and contains more detailed information than is found in this post if you are interested in getting an in-depth knowledge of the ACS package. Once you’ve installed the package and requested your API key from the Census you are ready to get started, type the following commands to gain access to the U.S. Census database in addition to installing other libraries used in this post to manipulate data and create the heat maps:

library(acs)
library(sqldf)
library(ggplot2)
library(maps)

api.key.install(key=”your secret key here”)

One of the great features of the being able to access the Census API via the ACS package in R is that one can search for keywords in all the tables using a function called acs.lookup(). The following code searches for tables that contain the keyword ‘bachelor’ in order to find data tables that contain data about educational attainment.

acs.lookup(endyear = 2012, span = 5, dataset = “acs”, keyword = “bachelor”, case.sensitive=F )

Here is a sample of the output from the acs.lookup() function:

table.name
PLACE OF BIRTH BY EDUCATIONAL ATTAINMENT IN THE UNITED STATES
POVERTY STATUS IN THE PAST 12 MONTHS OF INDIVIDUALS BY SEX BY EDUCATIONAL ATTAINMENT

The next step is grab the data from one of the tables of interest. In order to grab the data from the table we must specify the geography of interest. In other words do we want to look at poverty status table by state, by county, by city, by congressional district, or by zip code or a combination of these geographies.  Please refer to the numerous ways of slicing Census data in the ACS documentation or the link provided above for further details. In this post we want to look at poverty per state, so we use the geo.make() function to create this geography before accessing the data.

states = geo.make(state=”*”)

Once the geography is created, we can use that object in the acs.fetch function to actually retrieve the data from the table for the latest year available of 2012 using the col.names = “pretty” to retrieve the table with textual names as opposed to coded variable names.

 census_poverty = acs.fetch(endyear = 2011, span = 5, geo = states, table.name=”POVERTY STATUS IN THE PAST 12 MONTHS OF INDIVIDUALS BY SEX BY EDUCATIONAL ATTAINMENT”,col.names=”pretty”)

In order to create a plot I export the Census data object created with the acs.fetch function as a csv then reimport it as a data frame.  There is probably a more direct way to do this but I haven’t found it.

census_poverty <- read.csv(“census_poverty.csv”, header=TRUE)

#Removing ‘.’ from dataframe names and replacing with ‘_’.
for (i in 1:length(colnames(census_poverty))) {
colnames(census_poverty)[i] <- gsub(‘[.]’, ‘_’, colnames(census_poverty)[i])
}

I finally calculated the poverty rate from the R data frame census_poverty created from the csv import of the Census data.
census_poverty$poverty_rate <- census_poverty$POVERTY_STATUS_IN_THE_PAST_12_MONTHS_OF_INDIVIDUALS_BY_SEX_BY_EDUCATIONAL_ATTAINMENT___Income_in_the_past_12_months_below_poverty_level__/census_poverty$POVERTY_STATUS_IN_THE_PAST_12_MONTHS_OF_INDIVIDUALS_BY_SEX_BY_EDUCATIONAL_ATTAINMENT___Total__

CREATING HEAT MAPS

I like to clean up the look of ggplots with the following theme:

my_theme <- theme(
panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
axis.title.x = element_text(colour = “black”,size = 12),
axis.text.x = element_text(colour = “black”, size = 12),
axis.title.y = element_text(colour = “black”,size = 12),
axis.text.y = element_text(colour = “black”, size = 12),
plot.title = element_text(colour = “black”, size = 16, face = “bold”),
axis.ticks = element_blank(),
axis.text.y = element_blank()

)

Then I import the example map data in order to get the data needed to create a map in ggplot2:

example(map_data)

Then I merge the map_data (called choro) with other data sets that I created from the Census along with the data on poverty created in this post using the sqldf package, one can use the merge function instead of the sqldf library, but it is worth learning the sqldf package if one wants to do more extensive data management using SQL statements in R, though I wouldn’t discount the data manipulation capabilities in R.

output1 <- sqldf(“SELECT lower(census_edu_hisp.X) as region, census_edu_hisp.hispanic_ba,census_edu_white. white_ba, census_poverty.poverty_rate
FROM census_edu_hisp
LEFT JOIN census_edu_white
ON census_edu_hisp.X = census_edu_white.X
LEFT JOIN census_poverty
ON census_edu_hisp.X = census_poverty.X”)

output2 <- sqldf(“SELECT *
FROM output1
LEFT JOIN choro
ON output1.region = choro.region “)

Finally, the data is ready for plotting on a heat map with the following code:

qplot(long,lat,data = output2,group = group, fill = poverty_rate, geom = “polygon”) +
scale_fill_gradient(low=’white’, high = muted(‘red’)) +
ggtitle(“% of Population Living in Poverty”) +
my_theme

The final product is this map that shows the % of people living in poverty by state:

CLICK FOR LARGER IMAGE

Povert Rates

Grammar of Graphics: ggplot2 in the R programming language.

grammar of graphics

The free and popular statistical programming language R contains a powerful graphing library called ggplot2.  The ‘gg’ stands for the grammar of graphics based of the data visualization work done by Leland Wilkinson.  The grammar of graphics ensures that the aesthetics and geometric proportions that are generated are representative of the actual values that underlie the graph.

The objective of this post is demonstrate some of the simple code one can implement in R to create simple and elegant graphs, best of all the code can be reused multiple times to create high quality representations of your data.  The repeatability of creating graphs in R code automates many visualizations that are often painstakingly created in programs like PowerPoint and Excel.  R also has many free libraries that can generate a vast array of graphs such as maps, circle graphs, network graphs, heat maps, etc. that are not readily available with Microsoft products. This post will focus on the most common and simple graphs used for data analysis, a subsequent post will contain more detailed and elaborate graphs.

CUSTOMIZED THEMES

In R you can create customized templates for graphs using ggplot. These themes control the appearance of text, axes, font type, font size, gridlines, etc. basically anything you can imagine controlling in the look and feel of the actual graph can be controlled via themes.  I find the following theme visually appealing and simple and is used in all the graphs in this post:

my_theme <- theme(

panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
axis.title.x = element_text(colour = “black”,size = 12),
axis.text.x = element_text(colour = “black”, size = 12),
axis.title.y = element_text(colour = “black”,size = 12),
axis.text.y = element_text(colour = “black”, size = 12),
plot.title = element_text(colour = “black”, size = 16, face = “bold”),
axis.ticks = element_blank(), axis.text.y = element_blank() )

BAR CHARTS

Bar charts are one of the most fundamental and basic charts for data analysis. simple bar chart This is the code that created the simple and elegant bar chart above:

ggplot(pg_mean, aes(x = group, y = weight))  + 
geom_bar(stat = “identity”, fill = “salmon1”, colour = “black”) +

ggtitle(“Title”) + 

xlab(“x label”) + ylab(“y label”) +

geom_text(aes(y = weight+0.3, label= weight)) + 

my_theme

Notice the structure of the of the syntax in ggplot code.  The ggplot code first creates an object that tells R what is to be plotted.  Then the geom_bar syntax tells R to create a bar chart.  Once the chart is created the title is created with ggtitle and the x and y labels are also added to the graph in case the variable names are cryptic and not end-user friendly.  The geom_text adds the labels at the top of the chart and finally the code with my predefined them called my_theme removes cleans up the appearance (removing gridlines, backgrounds, tick marks on axes, etc.).

 

#stacked bar graph stackedchart ce <- arrange(cabbage_exp, Date, Cultivar)
ce<-ddply(ce, “Date”, transform, label_y =cumsum(Weight))
ce <-ddply(ce, “Date”, transform, label_y = cumsum(Weight)-0.1*Weight)

ggplot(ce, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_bar(stat = “identity”, colour = “black”) +

ggtitle(“Title”) +
        xlab(“x label”) + ylab(“y label”) +
geom_text(aes(y = label_y + 0.8, label = Weight), colour = “black”) +

my_theme +

scale_fill_brewer(palette = “Pastel1”)

#grouped bar graph. grouped bar graph

ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
     geom_bar(position = “dodge”, colour = “black”, stat = “identity”)  +
     ggtitle(“Title”) +         xlab(“x label”) + ylab(“y label”) +
     geom_text(aes(y = Weight+0.1, label= Weight)) +
     my_theme +
s
cale_fill_brewer(palette = “Pastel1”)

LINE CHARTS

The next set of graphs are line graphs.  Line graphs are useful for depicting relationships across time.

twin line charts

#multiple line graph.


tg<-ddply(ToothGrowth, c(“supp”, “dose”), summarise, length = mean(len))

ggplot(tg, aes( x = dose, y = length, fill = supp)) +
     geom_line() +
     geom_point(size = 6, shape = 21) +
     geom_text(aes(y = length+4, label= length))+
     ggtitle(“Title”) +
     xlab(“x label”) + ylab(“y label”) +
     my_theme

filled line chart

#shaded area line graph.


sunspotyear <- data.frame(
     year = as.numeric(time(sunspot.year)),
     sunspots = as.numeric(sunspot.year)
)

ggplot(sunspotyear, aes (x = year, y = sunspots)) +
     geom_area(fill = “salmon1”) +
     geom_line() +
     ggtitle(“Title”) +
     xlab(“x label”) + ylab(“y label”) +
     my_theme

SCATTER PLOTS

The next section contains one of the workhorses of statistical graphics, namely the scatter plot which is a great way of showing the relationship between two continuous variables.

scatterplot

#basic scatter plot.


ggplot(heightweight, aes (x = ageYear, y = heightIn)) +
     geom_point(size = 4, shape = 21, colour = “black”, fill = “salmon1”) +
     ggtitle(“Title”) +
     ggtitle(“Title”) +
     xlab(“x label”) + ylab(“y label”) +
     my_theme

weighted graph

#scatter plot – mapping continous variable to size of dot.
ggplot(heightweight, aes(x=ageYear, y = heightIn, size = weightLb)) +
     geom_point(fill = “salmon1”) +
     ggtitle(“Title”) +
     ggtitle(“Title”) +
     xlab(“x label”) + ylab(“y label”) +
    my_theme

scatterplot matrix

#scatter plot matrix.
c2009 <- subset(countries , Year == 2009,
select = c(Name, GDP, laborrate, healthexp, infmortality))
pairs(c2009[, 2:5])