46  Introduction to ggplot2

ggplot2, created by Hadley Wickham (Wickham 2011), follows the Grammar of Graphics approach of Leland Wilkinson (Wilkinson 2012) and has a very different syntax than base R graphics functions. It is based on the grid graphics package and is for the most part not compatible with base graphics.

The general idea is to start by defining the data and then add and/or modify graphical elements in a stepwise manner, which allows one to build complex and layered visualizations. A simplified interface to ggplot graphics is provided in the qplot() function of ggplot2. This chapter focuses on the basics of the ggplot() function, which is more flexible and important to learn.

Warning

Do not try to combine base R graphics and ggplot2 graphics in the same plot. They are based on different graphics engines and are not compatible.

46.1 Setup

46.1.1 Packages

Load ggplot2

library(ggplot2)
library(dplyr)
library(tidyr)
library(tibble)
# You can alternatively load all tidyverse packages with:
# library(tidyverse)

46.1.2 Synthetic Data

library(data.table)
set.seed(2022)
dt <- data.table(
    PID = sample(8001:9000, size = 100),
    Age = rnorm(100, mean = 33, sd = 8),
    Weight = rnorm(100, mean = 70, sd = 9),
    SysBP = rnorm(100, mean = 110, sd = 6),
    DiaBP = rnorm(100, mean = 80, sd = 6),
    Sex = factor(sample(c("Female", "Male"), size = 100, replace = TRUE))
)
dt[, SysBP := SysBP + 0.5 * Age]
dt[Sex == "Male", Weight := Weight + rnorm(.N, mean = 16, sd = 1.5)]
dt[Sex == "Male", Age := Age + rnorm(.N, mean = 6, sd = 1.8)]
dt <- as_tibble(dt)

Define a color palette, palette_, and a version of the same palette at 2/3 transparency, palette_a, for use in plots:

palette_ <- c("#43A4AC", "#FA9860")
palette_a <- adjustcolor(palette_, 0.666)
Note

Confusingly, ggplot2 uses the aes() function, short for aesthetics, to define the plot data.

46.2 Box plot

ggplot requires a categorical x-axis to draw boxplots. This means we need to convert our dataset from wide to long format.

dt_long <- dt |> pivot_longer(
    cols = c("SysBP", "DiaBP"),
    names_to = "Measure",
    values_to = "Value"
)
dt_long
# A tibble: 200 × 6
     PID   Age Weight Sex    Measure Value
   <int> <dbl>  <dbl> <fct>  <chr>   <dbl>
 1  8228  28.5   82.2 Male   SysBP   122. 
 2  8228  28.5   82.2 Male   DiaBP    79.1
 3  8435  50.1   72.6 Female SysBP   136. 
 4  8435  50.1   72.6 Female DiaBP    84.3
 5  8718  31.0   73.0 Female SysBP   124. 
 6  8718  31.0   73.0 Female DiaBP    78.1
 7  8823  30.0   77.5 Male   SysBP   126. 
 8  8823  30.0   77.5 Male   DiaBP    80.9
 9  8843  40.5   86.7 Male   SysBP   133. 
10  8843  40.5   86.7 Male   DiaBP    68.6
# ℹ 190 more rows
p <- ggplot(dt_long, aes(Measure, Value)) + 
    geom_boxplot()
p

We can specify color and fill to change the color of the boxplot border and fill, respectively.

p <- ggplot(dt_long, aes(Measure, Value)) + 
    geom_boxplot(color = palette_[1:2], fill = palette_a[1:2])
p

46.2.1 Grouped boxplot

p <- ggplot(dt, aes(x = Sex, y = Age)) + 
    geom_boxplot(colour = palette_[2:1], fill = palette_a[2:1])
p

46.3 Histogram

p <- ggplot(dt, aes(Age)) + 
    geom_histogram()
p

We can specify the number of bins to use with the bins argument and the border and fill colors with color and fill, respectively, as above. xlab() can be used to define the x-axis label.

p <- ggplot(dt, aes(Age)) + 
    geom_histogram(bins = 24, color = palette_[1], fill = palette_a[1]) +
    xlab("Age (years)")
p

46.3.1 Grouped Histogram

p <- ggplot(dt, aes(x = Age, fill = Sex)) + 
    geom_histogram(bins = 24, position = "identity")
p

scale_fill_manual can be used to define the colors of the bars:

p <- ggplot(dt, aes(x = Age, fill = Sex)) + 
    geom_histogram(bins = 24, position = "identity") +
    scale_fill_manual(values = palette_a[2:1])
p

position = "identity" displays overlapping bars. Alternatively, position = "dodge" can be used to display groups’ bars side by side instead. Finally, position = "stack", is the (unfortunate) default and results in vertically stacked bars, which can be confusing.

p <- ggplot(dt, aes(x = Age, fill = Sex)) + 
    geom_histogram(bins = 24, position = "dodge") +
    scale_fill_manual(values = palette_a[2:1])
p

p <- ggplot(dt, aes(x = Age, fill = Sex)) + 
    geom_histogram(bins = 24, position = "stack") +
    scale_fill_manual(values = palette_a[2:1])
p

46.4 Density plot

p <- ggplot(dt, aes(x = Age)) + 
    geom_density(color = palette_[1], fill = palette_a[1])
p

46.4.1 Grouped density plot

p <- ggplot(dt, aes(x = Age, fill = Sex)) + 
    geom_density(color = NA) +
    scale_fill_manual(values = palette_a[2:1])
p

46.5 Barplot

schools <- data.frame(UCSF = 4, Stanford = 7, Penn = 12)

ggplot2 requires an explicit column in the data that define the categorical x-axis:

schools_df <- data.frame(
    University = factor(colnames(schools), 
        levels = c("UCSF", "Stanford", "Penn")),
     N_schools = as.numeric(schools[1, ])
)
ggplot(schools_df, aes(University, N_schools)) +
  geom_bar(
    stat = "identity",
    color = palette_a[1],
    fill = palette_a[1]
  )

46.6 Scatterplot

p <- ggplot(dt, aes(Age, SysBP)) +
    geom_point(color = palette_a[1])
p

46.6.1 Grouped Scatterplot

p <- ggplot(dt, aes(Age, SysBP, col = Sex)) +
    geom_point() +
    scale_color_manual(values = palette_[2:1])
p

46.7 Save plot to file

We’ll use the grouped boxplot example from above to show how to save each type of plot to file, using a PDF output as an example.

p <- ggplot(dt, aes(x = Sex, y = Age)) + 
    geom_boxplot(colour = palette_[2:1], fill = palette_a[2:1])
ggsave("Age_by_Sex_ggplot.pdf", p,
    width = 5.5, height = 5.5, scale = 1, units = "in")

46.8 Resources