20 Aggregate

R includes a number of commands to apply functions on splits of your data. aggregate() is a powerful tools to perform such “group-by” operations.

The function accepts either:

a formula as the first argument and a data.frame passed to the data argument
an R objects (vector, data.frame, list) as the first argument and one or more factors passed to the by argument

We shall see how to perform each operation below with each approach.

The formula interface might be easier to work with interactively on the console. Note that while you can programmatically create a formula, it is easier to use vector inputs when calling aggregate() programmatically.

For this example, we shall use the penguin data from the palmerpenguins package:

library(palmerpenguins)
str(penguins)

tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

See example below for 1 or multiple variables by 1 or more groups using either the formula interface, or working directly on objects with $-indexing or using with():

20.1 Single variable by single grouping

Note that the formula method defaults to na.action = na.omit

Using the formula interface:

aggregate(bill_length_mm ~ species,
          data = penguins,
          mean, na.rm = TRUE)

    species bill_length_mm
1    Adelie       38.79139
2 Chinstrap       48.83382
3    Gentoo       47.50488

Using R objects directly:

aggregate(penguins$bill_length_mm,
          by = list(penguins$species),
          mean, na.rm = TRUE)

    Group.1        x
1    Adelie 38.79139
2 Chinstrap 48.83382
3    Gentoo 47.50488

Note that, unlike the formula notation, if your input is a vector which is unnamed, the output columns are also unnamed.

If instead of passing a vector, you pass a data.frame or list with one or more named elements, the output includes the names:

aggregate(penguins["bill_length_mm"],
          by = penguins["species"],
          mean, na.rm = TRUE)

    species bill_length_mm
1    Adelie       38.79139
2 Chinstrap       48.83382
3    Gentoo       47.50488

By creating a list instead of indexing the given data.frame also allows you to set custom names:

aggregate(list(`Bill length` = penguins$bill_length_mm),
          by = list(Species = penguins$species),
          mean, na.rm = TRUE)

    Species Bill.length
1    Adelie    38.79139
2 Chinstrap    48.83382
3    Gentoo    47.50488

20.2 Multiple variables by single grouping

Formula notation:

aggregate(cbind(bill_length_mm, flipper_length_mm) ~ species,
          data = penguins,
          mean)

    species bill_length_mm flipper_length_mm
1    Adelie       38.79139          189.9536
2 Chinstrap       48.83382          195.8235
3    Gentoo       47.50488          217.1870

Objects:

aggregate(penguins[, c("bill_length_mm", "flipper_length_mm")],
          by = list(Species = penguins$species),
          mean, na.rm = TRUE)

    Species bill_length_mm flipper_length_mm
1    Adelie       38.79139          189.9536
2 Chinstrap       48.83382          195.8235
3    Gentoo       47.50488          217.1870

20.3 Single variable by multiple groups

Formula notation:

aggregate(bill_length_mm ~ species + island, data = penguins, mean)

    species    island bill_length_mm
1    Adelie    Biscoe       38.97500
2    Gentoo    Biscoe       47.50488
3    Adelie     Dream       38.50179
4 Chinstrap     Dream       48.83382
5    Adelie Torgersen       38.95098

Objects:

aggregate(penguins["bill_length_mm"],
          by = list(Species = penguins$species, 
                    Island = penguins$island),
          mean, na.rm = TRUE)

    Species    Island bill_length_mm
1    Adelie    Biscoe       38.97500
2    Gentoo    Biscoe       47.50488
3    Adelie     Dream       38.50179
4 Chinstrap     Dream       48.83382
5    Adelie Torgersen       38.95098

20.4 Multiple variables by multiple groupings

Formula notation:

aggregate(cbind(bill_length_mm, flipper_length_mm) ~ species + island,
          data = penguins, mean)

    species    island bill_length_mm flipper_length_mm
1    Adelie    Biscoe       38.97500          188.7955
2    Gentoo    Biscoe       47.50488          217.1870
3    Adelie     Dream       38.50179          189.7321
4 Chinstrap     Dream       48.83382          195.8235
5    Adelie Torgersen       38.95098          191.1961

Objects:

aggregate(penguins[, c("bill_length_mm", "flipper_length_mm")],
          by = list(Species = penguins$species, 
                    Island = penguins$island),
          mean, na.rm = TRUE)

    Species    Island bill_length_mm flipper_length_mm
1    Adelie    Biscoe       38.97500          188.7955
2    Gentoo    Biscoe       47.50488          217.1870
3    Adelie     Dream       38.50179          189.7321
4 Chinstrap     Dream       48.83382          195.8235
5    Adelie Torgersen       38.95098          191.1961

20.5 Using `with()`

R’s with() allows you to use expression of the form with(data, expression). data can be a data.frame, list, or environment, and within the expression you can refer to any elements of data directly by their name.

For example, with(df, expression) means you can use the data.frame’s column names directly within the expression without the need to use df[["column_name"]] or df$column_name:

with(penguins,
     aggregate(list(`Bill length` = bill_length_mm),
               by = list(Species = species),
               mean, na.rm = TRUE))

    Species Bill.length
1    Adelie    38.79139
2 Chinstrap    48.83382
3    Gentoo    47.50488

20.6 See also

tapply() for an alternative methods of applying function on subsets of a single variable (probably faster).
For large datasets, it is recommended to use data.table for fast group-by data summarization.

20.1 Single variable by single grouping

20.2 Multiple variables by single grouping

20.3 Single variable by multiple groups

20.4 Multiple variables by multiple groupings

20.5 Using with()

20.6 See also

20.5 Using `with()`