24  The Apply Family

Loop functions are some of the most widely used R functions. They replace longer expressions created with a for loop, for example.
They can result in more compact and readable code.

Function Description
apply() Apply function over array margins (i.e. over one or more dimensions)
lapply() Return a list where each element is the result of applying a function to each element of the input
sapply() Same as lapply(), but returns the simplest possible R object (instead of always returning a list)
vapply() Same as sapply(), but with a pre-specified return type: this is safer and may also be faster
tapply() Apply a function to elements of groups defined by a factor
mapply() Multivariate sapply(): Apply a function using the 1st elements of the inputs vectors, then using the 2nd, 3rd, etc.
Figure 24.1: *apply() function family summary (Best to read through this chapter first and then refer back to this figure)

24.1 apply()

Tip

apply() applies a function over one or more dimensions of an array of 2 dimensions or more (this includes matrices) or a data frame:

apply(array, MARGIN, FUN)

MARGIN can be an integer vector or character indicating the dimensions over which ‘FUN’ will be applied.

By convention, rows come first (just like in indexing), therefore:

  • MARGIN = 1: apply function on each row
  • MARGIN = 2: apply function on each column

Let’s create an example dataset:

dat <- data.frame(Age = rnorm(50, mean = 42, sd = 8),
                  Weight = rnorm(50, mean = 80, sd = 10),
                  Height = rnorm(50, mean = 1.72, sd = 0.14),
                  SBP = rnorm(50, mean = 134, sd = 4))
head(dat)
       Age   Weight   Height      SBP
1 48.16460 88.55235 1.582764 133.7574
2 48.44521 76.57719 1.572241 134.1560
3 33.93637 93.27066 1.671029 137.0709
4 40.44267 93.60998 1.843725 133.3902
5 45.33717 94.70261 1.748060 131.2815
6 39.51068 91.54904 1.623145 131.7906

Let’s calculate the mean value of each column:

dat_column_mean <- apply(dat, MARGIN = 2, FUN = mean) 
dat_column_mean
       Age     Weight     Height        SBP 
 41.889752  83.543067   1.704655 134.091223 
Tip

Hint: It is possibly easiest to think of the “MARGIN” as the dimension you want to keep.
In the above case, we want the mean for each variable, i.e. we want to keep columns and collapse rows.

Purely as an example to understand what apply() does, here is the equivalent procedure using a for-loop. You notice how much more code is needed, and why apply() and similar functions might be very convenient for many different tasks.

dat_column_mean <- numeric(ncol(dat))
names(dat_column_mean) <- names(dat)

for (i in seq(dat)) {
  dat_column_mean[i] <- mean(dat[, i])
}
dat_column_mean
       Age     Weight     Height        SBP 
 41.889752  83.543067   1.704655 134.091223 

Let’s create a different example dataset, where we record weight at multiple timepoints:

dat2 <- data.frame(ID = seq(8001, 8020),
                   Weight_week_1 = rnorm(20, mean = 110, sd = 10))
dat2$Weight_week_3 <- dat2$Weight_week_1 + rnorm(20, mean = -2, sd = 1)
dat2$Weight_week_5 <- dat2$Weight_week_3 + rnorm(20, mean = -3, sd = 1.1)
dat2$Weight_week_7 <- dat2$Weight_week_5 + rnorm(20, mean = -1.8, sd = 1.3)
dat2
     ID Weight_week_1 Weight_week_3 Weight_week_5 Weight_week_7
1  8001     105.82478     103.37164     100.71798      99.47703
2  8002     105.48415     100.93332      97.23739      94.26977
3  8003     120.95794     116.98895     114.23984     114.08036
4  8004     114.05294     111.62552     108.93863     107.20867
5  8005      85.67305      82.95501      81.02380      81.03016
6  8006     112.02879     108.79129     103.79578     102.33458
7  8007      99.73759      97.92762      95.96941      92.69766
8  8008     107.44467     104.79624     100.90447     102.27286
9  8009     100.00119      98.00724      95.00727      92.86830
10 8010      97.04331      96.26564      94.54846      93.35433
11 8011     105.20776     102.98488      98.31741      93.60206
12 8012     105.52480     104.96172     103.10312     102.21607
13 8013     116.46133     115.41325     113.64038     110.12251
14 8014     108.69661     108.03059     106.19866     105.26812
15 8015     117.66636     114.88695     111.69464     110.95197
16 8016     109.84529     109.30767     106.00337     102.95822
17 8017     113.64041     111.78063     107.56287     105.95331
18 8018     103.61945     101.57644      96.61125      93.42433
19 8019     126.22185     124.74886     121.80256     119.24130
20 8020      91.46144      91.25641      86.22691      83.97778

Let’s get the mean weight per week:

apply(dat2[, -1], 2, mean)
Weight_week_1 Weight_week_3 Weight_week_5 Weight_week_7 
     107.3297      105.3305      102.1772      100.3655 

Let’s get the mean weight per individual across all weeks:

apply(dat2[, -1], 1, mean)
 [1] 102.34786  99.48116 116.56677 110.45644  82.67050 106.73761  96.58307
 [8] 103.85456  96.47100  95.30294 100.02803 103.95143 113.90937 107.04850
[15] 113.79998 107.02864 109.73431  98.80787 123.00364  88.23063
Caution

apply() converts 2-dimensional objects to matrices before applying the function. Therefore, if applied on a data.frame with mixed data types, it will be coerced to a character matrix.

This is explained in the apply() documentation under “Details”:

“If X is not an array but an object of a class with a non-null dim value (such as a data frame), apply attempts to coerce it to an array via as.matrix if it is two-dimensional (e.g., a data frame) or via as.array.”

Because of the above, see what happens when you use apply on the iris data.frame which contains 4 numeric variables and one factor:

str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
apply(iris, 2, class)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
 "character"  "character"  "character"  "character"  "character" 

24.2 lapply()

Tip

lapply() applies a function on each element of its input and returns a list of the outputs.

Note: The ‘elements’ of a data frame are its columns (remember, a data frame is a list with equal-length elements). The ‘elements’ of a matrix are each cell one by one, by column. Therefore, unlike apply(), lapply() has a very different effect on a data frame and a matrix. lapply() is commonly used to iterate over the columns of a data frame.

Tip

lapply() is the only function of the *apply() family that always returns a list.

dat_median <- lapply(dat, median)
dat_median
$Age
[1] 40.45594

$Weight
[1] 83.04049

$Height
[1] 1.67579

$SBP
[1] 133.771

To understand what lapply() does, here is the equivalent for-loop:

dat_median <- vector("list", length = 4)
names(dat_median) <- colnames(dat)
for (i in 1:4) {
  dat_median[[i]] <- median(dat[, i])
}
dat_median
$Age
[1] 40.45594

$Weight
[1] 83.04049

$Height
[1] 1.67579

$SBP
[1] 133.771

24.3 sapply()

sapply() is an alias for lapply(), followed by a call to simplify2array().
(Check the source code for sapply() by typing sapply at the console).

Note

Unlike lapply(), the output of sapply() is variable, when the argument simplify is set to TRUE, which is the default:
It is the simplest R object that can hold the data type/s resulting from the operations, i.e. a vector, matrix, data frame, or list.

dat_median <- sapply(dat, median)
dat_median
      Age    Weight    Height       SBP 
 40.45594  83.04049   1.67579 133.77104 
dat_summary <- data.frame(Mean = sapply(dat, mean),
                           SD = sapply(dat, sd))
dat_summary
             Mean        SD
Age     41.889752 8.0172308
Weight  83.543067 9.1753281
Height   1.704655 0.1356778
SBP    134.091223 4.3224136

24.3.1 Example: Get index of numeric variables

Let’s use sapply() to get an index of numeric columns in dat2:

head(dat2)
    ID Weight_week_1 Weight_week_3 Weight_week_5 Weight_week_7
1 8001     105.82478     103.37164     100.71798      99.47703
2 8002     105.48415     100.93332      97.23739      94.26977
3 8003     120.95794     116.98895     114.23984     114.08036
4 8004     114.05294     111.62552     108.93863     107.20867
5 8005      85.67305      82.95501      81.02380      81.03016
6 8006     112.02879     108.79129     103.79578     102.33458

logical index of numeric columns:

numidl <- sapply(dat2, is.numeric)
numidl
           ID Weight_week_1 Weight_week_3 Weight_week_5 Weight_week_7 
         TRUE          TRUE          TRUE          TRUE          TRUE 

integer index of numeric columns:

numidi <- which(sapply(dat2, is.numeric))
numidi
           ID Weight_week_1 Weight_week_3 Weight_week_5 Weight_week_7 
            1             2             3             4             5 

24.4 vapply()

Much less commonly used (possibly underused) than lapply() or sapply(), vapply() allows you to specify what the expected output looks like - for example a numeric vector of length 2, a character vector of length 1.

This can have two advantages:

  • It is safer against errors
  • It will sometimes be a little faster

You add the argument FUN.VALUE which must be of the correct type and length of the expected result of each iteration.

vapply(dat, median, FUN.VALUE = 0.0)
      Age    Weight    Height       SBP 
 40.45594  83.04049   1.67579 133.77104 

Here, each iteration returns the median of each column, i.e. a numeric vector of length 1.

Therefore FUN.VALUE can be any numeric scalar.

For example, if we instead returned the range of each column, FUN.VALUE should be a numeric vector of length 2:

vapply(dat, range, FUN.VALUE = rep(0.0, 2))
          Age    Weight   Height      SBP
[1,] 24.25842  66.66278 1.399964 124.8212
[2,] 63.62827 105.93721 2.081870 146.2187

If FUN.VALUE does not match the returned value, we get an informative error:

vapply(dat, range, FUN.VALUE = 0.0)
Error in vapply(dat, range, FUN.VALUE = 0): values must be length 1,
 but FUN(X[[1]]) result is length 2

24.5 tapply()

tapply() is one way (of many) to apply a function on subgroups of data as defined by one or more factors.
In the following example, we calculate the mean Sepal.Length by species on the iris dataset:

dat$Group <- factor(sample(c("A", "B", "C"), size = 50, replace = TRUE))
head(dat)
       Age   Weight   Height      SBP Group
1 48.16460 88.55235 1.582764 133.7574     A
2 48.44521 76.57719 1.572241 134.1560     C
3 33.93637 93.27066 1.671029 137.0709     B
4 40.44267 93.60998 1.843725 133.3902     B
5 45.33717 94.70261 1.748060 131.2815     B
6 39.51068 91.54904 1.623145 131.7906     A
mean_Age_by_Group <- tapply(dat[["Age"]], dat["Group"], mean)
mean_Age_by_Group
Group
       A        B        C 
41.52139 45.20608 38.57328 

The for-loop equivalent of the above is:

groups <- levels(dat$Group)
mean_Age_by_Group <- vector("numeric", length = length(groups))
names(mean_Age_by_Group) <- groups

for (i in seq(groups)) {
  mean_Age_by_Group[i] <- 
    mean(dat$Age[dat$Group == groups[i]])
}
mean_Age_by_Group
       A        B        C 
41.52139 45.20608 38.57328 

24.6 mapply()

The functions we have looked at so far work well when you iterating over elements of a single object.

mapply() allows you to execute a function that accepts two or more inputs, say fn(x, z) using the i-th element of each input, and will return:
fn(x[1], z[1]), fn(x[2], z[2]), …, fn(x[n], z[n])

Let’s create a simple function that accepts two numeric arguments, and two vectors length 5 each:

raise <- function(x, power) x^power
x <- 2:6
p <- 6:2

Use mapply to raise each x to the corresponding p:

out <- mapply(raise, x, p)
out
[1]  64 243 256 125  36

The above is equivalent to:

out <- vector("numeric", length = 5)
for (i in seq(5)) {
  out[i] <- raise(x[i], p[i])
}
out
[1]  64 243 256 125  36

24.7 *apply()ing on matrices vs. data frames

To consolidate some of what was learned above, let’s focus on the difference between working on a matrix vs. a data frame.
First, let’s create a matrix and a data frame with the same data:

amat <- matrix(21:70, nrow = 10)
colnames(amat) <- paste0("Feature_", 1:ncol(amat))
amat
      Feature_1 Feature_2 Feature_3 Feature_4 Feature_5
 [1,]        21        31        41        51        61
 [2,]        22        32        42        52        62
 [3,]        23        33        43        53        63
 [4,]        24        34        44        54        64
 [5,]        25        35        45        55        65
 [6,]        26        36        46        56        66
 [7,]        27        37        47        57        67
 [8,]        28        38        48        58        68
 [9,]        29        39        49        59        69
[10,]        30        40        50        60        70
adf <- as.data.frame(amat)
adf
   Feature_1 Feature_2 Feature_3 Feature_4 Feature_5
1         21        31        41        51        61
2         22        32        42        52        62
3         23        33        43        53        63
4         24        34        44        54        64
5         25        35        45        55        65
6         26        36        46        56        66
7         27        37        47        57        67
8         28        38        48        58        68
9         29        39        49        59        69
10        30        40        50        60        70

We’ve seen that with apply() we specify the dimension to operate on and it works the same way on both matrices and data frames:

apply(amat, 2, mean)
Feature_1 Feature_2 Feature_3 Feature_4 Feature_5 
     25.5      35.5      45.5      55.5      65.5 
apply(adf, 2, mean)
Feature_1 Feature_2 Feature_3 Feature_4 Feature_5 
     25.5      35.5      45.5      55.5      65.5 

However, sapply() (and lapply(), vapply()) acts on each element of the object, therefore it is not meaningful to pass a matrix to it:

sapply(amat, mean)
 [1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[26] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

The above returns the mean of each element, i.e. the element itself, which is meaningless.

Since a data frame is a list, and its columns are its elements, it works great for column operations on data frames:

sapply(adf, mean)
Feature_1 Feature_2 Feature_3 Feature_4 Feature_5 
     25.5      35.5      45.5      55.5      65.5 

If you want to use sapply() on a matrix, you could iterate over an integer sequence as shown in the previous section:

sapply(1:ncol(amat), function(i) mean(amat[, i]))
[1] 25.5 35.5 45.5 55.5 65.5

This is shown to help emphasize the differences between the function and the data structures. In practice, you would use apply() on a matrix.

24.8 Anonymous functions

Anonymous functions are just like regular functions but they are not assigned to an object - i.e. they are not “named”.
They are usually passed as arguments to other functions to be used once, hence no need to assign them.

Anonymous functions are often used with the apply family of functions.

Example of a simple regular function:

squared <- function(x) {
  x^2
}

Since this is a short function definition, it can also be written in a single line:

squared <- function(x) x^2

An anonymous function definition is just like a regular function - minus it is not assigned:

function(x) x^2

Since R version 4.1 (May 2021), a compact anonymous function syntax is available, where a single back slash replaces function:

\(x) x^2

Let’s use the squared() function within sapply() to square the first four columns of the iris dataset. In these examples, we often wrap functions around head() which prints the first few lines of an object to avoid:

head(dat[, 1:4])
       Age   Weight   Height      SBP
1 48.16460 88.55235 1.582764 133.7574
2 48.44521 76.57719 1.572241 134.1560
3 33.93637 93.27066 1.671029 137.0709
4 40.44267 93.60998 1.843725 133.3902
5 45.33717 94.70261 1.748060 131.2815
6 39.51068 91.54904 1.623145 131.7906
dat_sq <- sapply(dat[, 1:4], squared)
head(dat_sq)
          Age   Weight   Height      SBP
[1,] 2319.828 7841.518 2.505141 17891.04
[2,] 2346.938 5864.066 2.471943 17997.84
[3,] 1151.677 8699.417 2.792338 18788.42
[4,] 1635.610 8762.828 3.399320 17792.94
[5,] 2055.459 8968.584 3.055714 17234.84
[6,] 1561.094 8381.226 2.634601 17368.77

Let’s do the same as above, but this time using an anonymous function:

dat_sqtoo <- sapply(dat[, 1:4], function(x) x^2)
head(dat_sqtoo)
          Age   Weight   Height      SBP
[1,] 2319.828 7841.518 2.505141 17891.04
[2,] 2346.938 5864.066 2.471943 17997.84
[3,] 1151.677 8699.417 2.792338 18788.42
[4,] 1635.610 8762.828 3.399320 17792.94
[5,] 2055.459 8968.584 3.055714 17234.84
[6,] 1561.094 8381.226 2.634601 17368.77

The entire anonymous function definition is passed to the FUN argument.

24.9 Iterating over a sequence instead of an object

With lapply(), sapply() and vapply() there is a very simple trick that may often come in handy:

Instead of iterating over elements of an object, you can iterate over an integer index of whichever elements you want to access and use it accordingly within the anonymous function.

This alternative approach is much closer to how we would use an integer sequence in a for loop.

It will be clearer through an example, where we get the mean of the first four columns of iris:

# original way: iterate through elements i.e. columns:
sapply(dat, function(i) mean(i))
Warning in mean.default(i): argument is not numeric or logical: returning NA
       Age     Weight     Height        SBP      Group 
 41.889752  83.543067   1.704655 134.091223         NA 
# alternative way: iterate over integer index of elements:
sapply(1:4, function(i) mean(dat[, i]))
[1]  41.889752  83.543067   1.704655 134.091223
# equivalent to:
for (i in 1:4) {
  mean(dat[, i])
}

Notice that in this approach, since you are not passing the object (dat, in the above example) as the input to lapply(), it needs to be accessed within the anonymous function.