Return to course site

Using ggplot2

Using ggplot2 and qplot

========================================================

Prepared by Shane Mueller

Method Overview

The ggplot2 library is a follow-up of the ggplot library, and stands for the ‘grammar of graphics’. It produces attractive, professional-looking graphics that are good, especially for presentations. This comes at a cost of some of the flexibility that standard R graphics give, but it is often worthwhile. The ggplot2 library was developed by Hadley Wickham, who also developed reshape2 and dplyr. Because of its peculiarities, you may end up using these other libraries to make full use of its power.

There is a lot of basic introductory tutorials for ggplot out there. See links in the resources below.

Finally, ggplot has pretty nice control over the graphics formats you save. It has functions that will create several different formats directly, with specified dimensions and dpi, so you are not at the whim of the RStudio window for creating your figure resolutions.

There is a more modern graphics library being developed by the developers of ggplot2 called ggvis, which intends to make interactive web graphics. If you understand ggplot2, ggvis should be easy to pick up and let you do some exciting things.

Overview of Functions

There are two main ways to use ggplot2. The ‘traditional’ way (using the ggplot() function) can be difficult but offers a lot of power. The simple way (using qplot) is more straight-forwrad, but can be a bit more limiting.

The easy way: qplot

qplot stands for quickplot, and either function name can be used. It wraps up all major plotting methods into one function. If we look at the function definitions, some of the arguments are similar to what appear in the normal plot() functions: you specify x and y values, labels, limits, etc. The geom argument specifies what type of plot you want to create.

qplot(x, y = NULL, ..., data, facets = NULL, margins = FALSE,
  geom = "auto", stat = list(NULL), position = list(NULL), xlim = c(NA,
  NA), ylim = c(NA, NA), log = "", main = NULL,
  xlab = deparse(substitute(x)), ylab = deparse(substitute(y)), asp = NA)

Let’s look at the crabs data set in the MASS library. This data set has 200 observations of crabs (100 male and 100 female) on 5 different measures, with two species (Blue or orange). FL=frontal lobe size; RW rear width; CL=carapace length, CW = carapace width, BD = body depth.

  sp sex index   FL  RW   CL   CW  BD
1  B   M     1  8.1 6.7 16.1 19.0 7.0
2  B   M     2  8.8 7.7 18.1 20.8 7.4
3  B   M     3  9.2 7.8 19.0 22.4 7.7
4  B   M     4  9.6 7.9 20.1 23.1 8.2
5  B   M     5  9.8 8.0 20.3 23.0 8.2
6  B   M     6 10.8 9.0 23.0 26.5 9.8

Let’s look at the frontal lobe width. Doing qplot with one continuous variable will just plot a histogram.

I’m not enamored with how this looks. Let’s spruce it up a bit:

What if we wanted a separate histogram for each species? We can use the facet argument to make separate plots for each level of another variable.

Or, we can make facets in a grid along two IVs:

The rows and columns aren’t entirely easy to read, but you could change the names of these with dplyr.

boxplots

These histograms can be encapsulated into a boxplot, but now you need to specify this with the geom argument

You could add the points on top of these by specifying a list in the goem

And of course, this could be faceted too:

Making scatter plots with qplot

If you want to plot points against one another, you specify an x and y value (again, faceted by sex). But we will color each one by its species.

Let’s see how frontal size changes as carapace width changes. To do this, let’s start by creating quantiles of CW

## Making barplots with qplot (deprecated)

Bar plots will often want you to create a statistic on the data. Although in the past, qplot could do this from the raw data, qplot can no longer do this (and users are directed to the more complex ggplot()).

###How to make a ‘matplot’ in ggplot2 In standard R, the matplot function is handy because it allows you to take a matrix of values and plot several series against eachother.

Let’s say we want to do something like this, but with qplot/ggplot. We quickly find out that it is not easy. It doesn’t work because ggplot expects a single variable, and you are to tell it how it needs to be plotted. We can do one series, but not multiple:

To do multiple series, we can rely on melt in reshape2

  species sex aggname variable  value
1       B   F     F B       FL 13.270
2       O   F     F O       FL 17.594
3       B   M     M B       FL 14.842
4       O   M     M O       FL 16.626
5       B   F     F B       RW 12.138
6       O   F     F O       RW 14.836
  species sex variable  value
1       B   F       FL 13.270
2       O   F       FL 17.594
3       B   M       FL 14.842
4       O   M       FL 16.626
5       B   F       RW 12.138
6       O   F       RW 14.836

Adding adornments

ggplot is called ‘grammar of graphics’ because it is a true grammar for composing more complex graphics. In the previous example, suppose we wanted to add points to the plot. We can do so with syntax like this:

For simple graphs with minimal adornment, this method is fine. But qplot is a shortcut for a more complex set of operations using the ggplot function. If you want to do more complex things, you will need to use ggplot()

The powerful way: ggplot

The ggplot function provides much more precise control and access over how things are formatted, aggregated, and displayed. ggplot requires you to use a data fram in a ‘melted’ format (qplot seems to be a bit more forgiving). All of the additional adornments work fine, but now you need to think harder about how to organize your data in the ggplot function. If you do this right, it makes organizing things fairly easy.

Now, instead of qplot, we need to use ggplot. It works almost like the qplot, but instead of specifying x,y directly, we need to add logic that tells how the ‘aesthetics’ map onto the data colmumns. This is done via the aes() argument. Now, we need a geom function added to the function to display anything.

Error bars

Suppose we want to add error bars. We need to first compute standard errors for our data set, which shouldn’t be too hard.

  species sex variable value.x   value.y     ymin     ymax
1       B   F       FL  13.270 0.3716291 12.89837 13.64163
2       O   F       FL  17.594 0.4205867 17.17341 18.01459
3       B   M       FL  14.842 0.4529008 14.38910 15.29490
4       O   M       FL  16.626 0.4970920 16.12891 17.12309
5       B   F       RW  12.138 0.3448855 11.79311 12.48289
6       O   F       RW  14.836 0.3321146 14.50389 15.16811

Now, we have a single data frame with value.x which has the mean values, and ymin/ymax which are the top and bottom of the error bars (+/- one s.e.)

This is effective but ugly. We can try a few other tweaks:

Adding a legend

The current figure is not much help, because we don’t know what each line means. ggplot will add a legend when it makes sense; sometimes it does not add a legend when you think it also might make sense. In the above figure, although each series was separated with a different line, there was no way of distinguishing the points via a legend. Add this and you will get a legend. Here, we set both colour and shape to depend on a specific value, and the legends will show up

Saving

You can save graphics just like you always do, but the ggsave() functions offers a method for saving the latest graphic you created.

ggsave saves based on the file extension, and currently can save as ps, tex (pictex), pdf, tiff, png, bmp and wmf (windows only).

Other features

The breadth of things you can do with ggplot is truly amazing. The default themes are nice, but you can change these fairly easily if you dig into examples provided in many places.

Exercises

Reporting

The default settings of ggplot2 are best for on-screen graphics. Some journals prefer black-and-white. Alternate themes are available that will still produce attractive black-and-white images.

Most journals will be a bit picky about the graphic format of images. Many will want .tiff format, which is a compressed graphic format that is lossless. .png works this way as well, but you are best off sticking with something they know and understand.

During pdf creation, a journal sometimes compresses image-based graphics and they can be lossy. With standard R graphics, you can save as a .ps format (rename to .eps), and most will be able to handle this as well. However, ggplot renders graphics to a rasterized image, and so overall you images will probably be larger and there could be quality issues if you don’t use a high enough dpi or image size.

Minimal example with ggvis

Assumptions and Limitations

ggplot is fairly flexible, and can produce beautiful graphics, but it can be a little frustrating to use when you are used to the layering and drawing of traditional R graphics. Many additional libraries aim to make ggplot more flexible, produce custom plots, or give alternative (easier) syntax to access the plots.

Resources

Shane T. Mueller shanem@mtu.edu

2019-01-16