R includes a number of commands to apply functions on splits of your data. aggregate() is a powerful tools to perform such “group-by” operations.
The function accepts either:
a formula as the first argument and a data.frame passed to the data argument
an R objects (vector, data.frame, list) as the first argument and one or more factors passed to the by argument
We shall see how to perform each operation below with each approach.
The formula interface might be easier to work with interactively on the console. Note that while you can programmatically create a formula, it is easier to use vector inputs when calling aggregate() programmatically.
For this example, we shall use the penguin data from the palmerpenguins package:
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
See example below for 1 or multiple variables by 1 or more groups using either the formula interface, or working directly on objects with $-indexing or using with():
20.1 Single variable by single grouping
Note that the formula method defaults to na.action = na.omit
Using the formula interface:
aggregate(bill_length_mm~species, data =penguins,mean, na.rm =TRUE)
R’s with() allows you to use expression of the form with(data, expression). data can be a data.frame, list, or environment, and within the expression you can refer to any elements of data directly by their name.
For example, with(df, expression) means you can use the data.frame’s column names directly within the expression without the need to use df[["column_name"]] or df$column_name:
with(penguins,aggregate(list(`Bill length` =bill_length_mm), by =list(Species =species),mean, na.rm =TRUE))