Grammar of Graphics: ggplot2 in the R programming language.

grammar of graphics

The free and popular statistical programming language R contains a powerful graphing library called ggplot2.  The ‘gg’ stands for the grammar of graphics based of the data visualization work done by Leland Wilkinson.  The grammar of graphics ensures that the aesthetics and geometric proportions that are generated are representative of the actual values that underlie the graph.

The objective of this post is demonstrate some of the simple code one can implement in R to create simple and elegant graphs, best of all the code can be reused multiple times to create high quality representations of your data.  The repeatability of creating graphs in R code automates many visualizations that are often painstakingly created in programs like PowerPoint and Excel.  R also has many free libraries that can generate a vast array of graphs such as maps, circle graphs, network graphs, heat maps, etc. that are not readily available with Microsoft products. This post will focus on the most common and simple graphs used for data analysis, a subsequent post will contain more detailed and elaborate graphs.

CUSTOMIZED THEMES

In R you can create customized templates for graphs using ggplot. These themes control the appearance of text, axes, font type, font size, gridlines, etc. basically anything you can imagine controlling in the look and feel of the actual graph can be controlled via themes.  I find the following theme visually appealing and simple and is used in all the graphs in this post:

my_theme <- theme(

panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
axis.title.x = element_text(colour = “black”,size = 12),
axis.text.x = element_text(colour = “black”, size = 12),
axis.title.y = element_text(colour = “black”,size = 12),
axis.text.y = element_text(colour = “black”, size = 12),
plot.title = element_text(colour = “black”, size = 16, face = “bold”),
axis.ticks = element_blank(), axis.text.y = element_blank() )

BAR CHARTS

Bar charts are one of the most fundamental and basic charts for data analysis. simple bar chart This is the code that created the simple and elegant bar chart above:

ggplot(pg_mean, aes(x = group, y = weight))  + 
geom_bar(stat = “identity”, fill = “salmon1”, colour = “black”) +

ggtitle(“Title”) + 

xlab(“x label”) + ylab(“y label”) +

geom_text(aes(y = weight+0.3, label= weight)) + 

my_theme

Notice the structure of the of the syntax in ggplot code.  The ggplot code first creates an object that tells R what is to be plotted.  Then the geom_bar syntax tells R to create a bar chart.  Once the chart is created the title is created with ggtitle and the x and y labels are also added to the graph in case the variable names are cryptic and not end-user friendly.  The geom_text adds the labels at the top of the chart and finally the code with my predefined them called my_theme removes cleans up the appearance (removing gridlines, backgrounds, tick marks on axes, etc.).

 

#stacked bar graph stackedchart ce <- arrange(cabbage_exp, Date, Cultivar)
ce<-ddply(ce, “Date”, transform, label_y =cumsum(Weight))
ce <-ddply(ce, “Date”, transform, label_y = cumsum(Weight)-0.1*Weight)

ggplot(ce, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_bar(stat = “identity”, colour = “black”) +

ggtitle(“Title”) +
        xlab(“x label”) + ylab(“y label”) +
geom_text(aes(y = label_y + 0.8, label = Weight), colour = “black”) +

my_theme +

scale_fill_brewer(palette = “Pastel1”)

#grouped bar graph. grouped bar graph

ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
     geom_bar(position = “dodge”, colour = “black”, stat = “identity”)  +
     ggtitle(“Title”) +         xlab(“x label”) + ylab(“y label”) +
     geom_text(aes(y = Weight+0.1, label= Weight)) +
     my_theme +
s
cale_fill_brewer(palette = “Pastel1”)

LINE CHARTS

The next set of graphs are line graphs.  Line graphs are useful for depicting relationships across time.

twin line charts

#multiple line graph.


tg<-ddply(ToothGrowth, c(“supp”, “dose”), summarise, length = mean(len))

ggplot(tg, aes( x = dose, y = length, fill = supp)) +
     geom_line() +
     geom_point(size = 6, shape = 21) +
     geom_text(aes(y = length+4, label= length))+
     ggtitle(“Title”) +
     xlab(“x label”) + ylab(“y label”) +
     my_theme

filled line chart

#shaded area line graph.


sunspotyear <- data.frame(
     year = as.numeric(time(sunspot.year)),
     sunspots = as.numeric(sunspot.year)
)

ggplot(sunspotyear, aes (x = year, y = sunspots)) +
     geom_area(fill = “salmon1”) +
     geom_line() +
     ggtitle(“Title”) +
     xlab(“x label”) + ylab(“y label”) +
     my_theme

SCATTER PLOTS

The next section contains one of the workhorses of statistical graphics, namely the scatter plot which is a great way of showing the relationship between two continuous variables.

scatterplot

#basic scatter plot.


ggplot(heightweight, aes (x = ageYear, y = heightIn)) +
     geom_point(size = 4, shape = 21, colour = “black”, fill = “salmon1”) +
     ggtitle(“Title”) +
     ggtitle(“Title”) +
     xlab(“x label”) + ylab(“y label”) +
     my_theme

weighted graph

#scatter plot – mapping continous variable to size of dot.
ggplot(heightweight, aes(x=ageYear, y = heightIn, size = weightLb)) +
     geom_point(fill = “salmon1”) +
     ggtitle(“Title”) +
     ggtitle(“Title”) +
     xlab(“x label”) + ylab(“y label”) +
    my_theme

scatterplot matrix

#scatter plot matrix.
c2009 <- subset(countries , Year == 2009,
select = c(Name, GDP, laborrate, healthexp, infmortality))
pairs(c2009[, 2:5])


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s