ggplot2 is for plotting in R, very flexible and ably designed by Hadley Wickham following a concept called “grammar of graphics” and anyway pretty awesome – so let’s jump right in with some simple examples that should help you get it going.
The basic concept
Basically ggplot2 consists of a set of functions addressing various aspects of a plot. They are joined by ‘+’ and by this form a unit describing your desired graphic. Because of well chosen default settings you don’t have to describe every detail of your plot but if necessary then it can be done. You can think of your plot as the result of different layers piled up and thus forming the plot. The data has to be fed as a single data frame keeping the figures in separate named columns (at times referred to as variables).
First of all we need “data” and also to load ggplot2:
1 2 3 4 5 6 7 8 9 10 11 |
# two variables a and b measured at four days (d) # as.Date() casts the date strings into dates R can handle df <- data.frame( a = c(1,2,1,2), b = c(2,2,1,1), d = as.Date(c('2012-01-01','2012-01-02','2012-01-03','2012-01-04')) ) # initial installation as usual with install.packages("ggplot2") library(ggplot2) |
Colored scatter plot with paths
1 2 3 4 5 |
# use df as the central data frame # use columns a and b for plotting # plot the points on the first layer ggplot(data = df, aes(a, b)) + geom_point() |
The function aes() (short for “aesthetics”) just takes the names of the columns you want to plot that’s it.
Now let’s spice it up a bit and color the points depending on the day and increase the size of the points.
1 2 3 |
# the size attribute in geom_point() sets the size of a point. ggplot(df,aes(a,b,color=format(d,"%d"))) + geom_point(size=5) |
The parameter color within the aes() function defines not what colors to use for which point but what sets of points are supposed to be colored the same. In this case we extract for every couple of values from columns a and b the day from the associated date. Every row has a different date, hence every point is colored differently. Let’s have a look at a different example where I use the remainder of a division with 2 for defining the separate point sets.
1 2 3 4 |
# df$a = c(1,2,1,2) # df$a %% 2 = c(1,0,1,0) ggplot(df,aes(a,b,color=a%%2)) + geom_point(size=5) |
Now let’s add lines to the plot, connecting the points according to their order in the data frame.
1 2 3 4 5 6 7 8 9 10 11 |
# ggplot2 handles the plot as an R object. In the last row # it is served to R and drawn. # the new group parameters just says all observations (a,b) belong # to one group (doesn't have to be 0 - any constant would do the trick. # otherwise ggplot2 complains when trying to apply geom_path() p <- ggplot(df,aes(a,b,color=format(d,"%d"),group=0)) p <- p + geom_point(size=5) p <- p + geom_path() p |
Here you can observe the layer concept at work. First the data layer, then two graphic / geometric layers – first the scatter plot then the paths connecting the points.
Because the coloration is based on format() R uses this formula for the legend title. But of course this is not pleasent to the eye and hence we change it to something more meaningful. The legend (layer) is addressed by a separate function.
1 2 3 4 |
# given that the last plot is still loaded in R we can address # it using last_plot() and just "add" the new legend layer. last_plot() + scale_color_discrete(name="date") |
I think this looks pretty nice already.
Two charts on one plot
Let’s start with a line plot of variable / column a along the time line in d.
1 |
ggplot(data=df, aes(d,a)) + geom_line() |
Obviously the labeling of the x-axis is not as we want it because there are even hours:minutes displayed. So next we specify that we just want to see the days by addressing the x-axis.
1 2 3 4 5 6 7 8 9 10 |
# ggplot2 packages its non-base functionality in thematic libraries. we need # this library specifically for the date_format() function. library(scales) # we address the last plot and "add" a new scaling to the x axis which tells # ggplot2 that we are dealing with dates on x, only want to see the day number # and only want one laber per day. last_plot() + scale_x_date(labels = date_format("%d"), breaks="1 day") |
Now we take it a step further displaying two charts in one plot. Because we are now addressing to column pairs d,a and d,b we have to move the “aesthetics” from ggplot() to the respective geom_line().
1 2 3 4 5 |
p <- ggplot(data=df) p <- p + geom_line(aes(d,a)) p <- p + geom_line(aes(d,b)) p <- p + scale_x_date(labels = date_format("%d"), breaks="1 day") p |
Now we have the two charts in one plot but two things about it are problematic:
- The title of the y-axis is only referring to variable a.
- We don’t know which line is representing which variable.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
p <- ggplot(data=df) # color="a" says "this set of points are supposed to be colored the same # and regarding this feature are called 'a'." p <- p + geom_line(aes(d,a,color="a")) p <- p + geom_line(aes(d,b,color="b")) p <- p + scale_x_date(labels = date_format("%d"), breaks="1 day") p <- p + scale_y_continuous(name="a, b") p <- p + scale_color_discrete(name="col") p |
Exchanging the underlying data frame
Let’s say we want to reuse the above specified plot but we want to use different data. For this purpose we there is special operator ‘%+%’.
1 2 3 4 5 6 7 8 9 10 |
df2 <- data.frame( a=c(2,2,2,1), b=c(1.5,1,1,2), d=as.Date(c('2012-02-01','2012-02-02','2012-02-03','2012-02-04')) ) # this exchanges the data frame used in the data setting to df2 p <- p %+% df2 p |
How to progress?
The good news is that ggplot2 has big community and it seems like almost every question has been addressed already somewhere and can be tracked down with Google. If you can’t find the solution to a problem I recommend to write a question on stackoverflow.com. But as ggplot2 is a very powerful tool it is a good idea to learn it more thoroughly than just by trial and error. For this purpose I recommend the book “ggplot2 – Elegant Graphics for Data Analysis” which is written by its creator. It is referring to a prior version and hence partly outdated but this doesn’t bother much because a lot still is the same and it communicates very well the concept behind ggplot2 which is most important.
Also definitely check out these two official sources:
If this article was helpful to you or you just enjoyed it – then don’t hesitate to leave a comment or share it using the links below this box (that would be uber-awesome)! If not – then you’re welcome to tell me what could be improved? Maybe you even have a suggestion for a new article?
– Thanks, Raffael