24 The Apply Family
Loop functions are some of the most widely used R functions. They replace longer expressions created with a for
loop, for example.
They can result in more compact and readable code.
Function | Description |
---|---|
apply() |
Apply function over array margins (i.e. over one or more dimensions) |
lapply() |
Return a list where each element is the result of applying a function to each element of the input |
sapply() |
Same as lapply() , but returns the simplest possible R object (instead of always returning a list) |
vapply() |
Same as sapply() , but with a pre-specified return type: this is safer and may also be faster |
tapply() |
Apply a function to elements of groups defined by a factor |
mapply() |
Multivariate sapply() : Apply a function using the 1st elements of the inputs vectors, then using the 2nd, 3rd, etc. |
24.1 apply()
apply()
applies a function over one or more dimensions of an array of 2 dimensions or more (this includes matrices) or a data frame:
apply(array, MARGIN, FUN)
MARGIN
can be an integer vector or character indicating the dimensions over which ‘FUN’ will be applied.
By convention, rows come first (just like in indexing), therefore:
-
MARGIN = 1
: apply function on each row -
MARGIN = 2
: apply function on each column
Let’s create an example dataset:
dat <- data.frame(Age = rnorm(50, mean = 42, sd = 8),
Weight = rnorm(50, mean = 80, sd = 10),
Height = rnorm(50, mean = 1.72, sd = 0.14),
SBP = rnorm(50, mean = 134, sd = 4))
head(dat)
Age Weight Height SBP
1 33.53517 79.99227 1.437829 138.6748
2 35.23837 76.06266 1.782321 142.0525
3 46.03977 86.39338 1.516113 140.9807
4 51.51741 87.18997 1.583751 137.9623
5 40.47961 68.31458 1.790967 135.1164
6 35.91377 78.44711 1.734494 137.1234
Let’s calculate the mean value of each column:
dat_column_mean <- apply(dat, MARGIN = 2, FUN = mean)
dat_column_mean
Age Weight Height SBP
41.355183 80.976511 1.697381 134.461052
Hint: It is possibly easiest to think of the “MARGIN” as the dimension you want to keep.
In the above case, we want the mean for each variable, i.e. we want to keep columns and collapse rows.
Purely as an example to understand what apply()
does, here is the equivalent procedure using a for-loop. You notice how much more code is needed, and why apply()
and similar functions might be very convenient for many different tasks.
dat_column_mean <- numeric(ncol(dat))
names(dat_column_mean) <- names(dat)
for (i in seq(dat)) {
dat_column_mean[i] <- mean(dat[, i])
}
dat_column_mean
Age Weight Height SBP
41.355183 80.976511 1.697381 134.461052
Let’s create a different example dataset, where we record weight at multiple timepoints:
dat2 <- data.frame(ID = seq(8001, 8020),
Weight_week_1 = rnorm(20, mean = 110, sd = 10))
dat2$Weight_week_3 <- dat2$Weight_week_1 + rnorm(20, mean = -2, sd = 1)
dat2$Weight_week_5 <- dat2$Weight_week_3 + rnorm(20, mean = -3, sd = 1.1)
dat2$Weight_week_7 <- dat2$Weight_week_5 + rnorm(20, mean = -1.8, sd = 1.3)
dat2
ID Weight_week_1 Weight_week_3 Weight_week_5 Weight_week_7
1 8001 111.04289 106.95935 102.94787 102.26928
2 8002 103.47004 99.86460 97.28305 94.69024
3 8003 107.84784 105.56062 103.45063 99.78788
4 8004 110.51762 106.99228 103.04543 100.48099
5 8005 99.40784 97.95236 94.94075 92.40207
6 8006 84.47913 81.95258 79.39812 76.72131
7 8007 122.33160 118.92493 115.52585 112.26189
8 8008 110.39227 109.09013 107.37802 105.08889
9 8009 120.47485 117.36276 114.55699 112.60647
10 8010 126.07963 123.21005 119.54050 120.28907
11 8011 102.07681 98.88423 97.50637 96.38840
12 8012 105.66671 102.57275 101.96515 99.71209
13 8013 103.17450 101.36019 96.82128 95.65377
14 8014 110.04679 108.47441 105.37404 103.83360
15 8015 107.93642 106.05161 104.30609 99.92998
16 8016 94.20825 91.72154 88.78518 87.30292
17 8017 111.65724 109.86404 107.08026 103.46465
18 8018 125.34182 122.31146 119.47672 116.62459
19 8019 98.89971 97.72247 93.65328 89.70032
20 8020 113.11428 110.18300 107.58005 108.57944
Let’s get the mean weight per week:
apply(dat2[, -1], 2, mean)
Weight_week_1 Weight_week_3 Weight_week_5 Weight_week_7
108.4083 105.8508 103.0308 100.8894
Let’s get the mean weight per individual across all weeks:
apply(dat2[, -1], 1, mean)
[1] 105.80485 98.82698 104.16174 105.25908 96.17575 80.63779 117.26107
[8] 107.98732 116.25027 122.27981 98.71396 102.47918 99.25244 106.93221
[15] 104.55603 90.50447 108.01655 120.93865 94.99395 109.86419
apply()
converts 2-dimensional objects to matrices before applying the function. Therefore, if applied on a data.frame with mixed data types, it will be coerced to a character matrix.
This is explained in the apply()
documentation under “Details”:
“If X is not an array but an object of a class with a non-null dim value (such as a data frame), apply attempts to coerce it to an array via as.matrix if it is two-dimensional (e.g., a data frame) or via as.array.”
Because of the above, see what happens when you use apply on the iris
data.frame which contains 4 numeric variables and one factor:
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
apply(iris, 2, class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
"character" "character" "character" "character" "character"
24.2 lapply()
lapply()
applies a function on each element of its input and returns a list of the outputs.
Note: The ‘elements’ of a data frame are its columns (remember, a data frame is a list with equal-length elements). The ‘elements’ of a matrix are each cell one by one, by column. Therefore, unlike apply()
, lapply()
has a very different effect on a data frame and a matrix. lapply()
is commonly used to iterate over the columns of a data frame.
lapply()
is the only function of the *apply()
family that always returns a list.
dat_median <- lapply(dat, median)
dat_median
$Age
[1] 40.8132
$Weight
[1] 81.48206
$Height
[1] 1.697626
$SBP
[1] 135.3969
To understand what lapply()
does, here is the equivalent for-loop:
24.3 sapply()
sapply()
is an alias for lapply()
, followed by a call to simplify2array()
.
(Check the source code for sapply()
by typing sapply
at the console).
dat_median <- sapply(dat, median)
dat_median
Age Weight Height SBP
40.813197 81.482061 1.697626 135.396861
dat_summary <- data.frame(Mean = sapply(dat, mean),
SD = sapply(dat, sd))
dat_summary
Mean SD
Age 41.355183 9.214191
Weight 80.976511 9.219459
Height 1.697381 0.132754
SBP 134.461052 4.460144
24.3.1 Example: Get index of numeric variables
Let’s use sapply()
to get an index of numeric columns in dat2:
head(dat2)
ID Weight_week_1 Weight_week_3 Weight_week_5 Weight_week_7
1 8001 111.04289 106.95935 102.94787 102.26928
2 8002 103.47004 99.86460 97.28305 94.69024
3 8003 107.84784 105.56062 103.45063 99.78788
4 8004 110.51762 106.99228 103.04543 100.48099
5 8005 99.40784 97.95236 94.94075 92.40207
6 8006 84.47913 81.95258 79.39812 76.72131
logical index of numeric columns:
numidl <- sapply(dat2, is.numeric)
numidl
ID Weight_week_1 Weight_week_3 Weight_week_5 Weight_week_7
TRUE TRUE TRUE TRUE TRUE
integer index of numeric columns:
24.4 vapply()
Much less commonly used (possibly underused) than lapply()
or sapply()
, vapply()
allows you to specify what the expected output looks like - for example a numeric vector of length 2, a character vector of length 1.
This can have two advantages:
- It is safer against errors
- It will sometimes be a little faster
You add the argument FUN.VALUE
which must be of the correct type and length of the expected result of each iteration.
vapply(dat, median, FUN.VALUE = 0.0)
Age Weight Height SBP
40.813197 81.482061 1.697626 135.396861
Here, each iteration returns the median of each column, i.e. a numeric vector of length 1.
Therefore FUN.VALUE
can be any numeric scalar.
For example, if we instead returned the range of each column, FUN.VALUE
should be a numeric vector of length 2:
Age Weight Height SBP
[1,] 21.36434 59.10752 1.430183 121.0585
[2,] 66.19577 98.14315 1.982392 142.0525
If FUN.VALUE
does not match the returned value, we get an informative error:
vapply(dat, range, FUN.VALUE = 0.0)
Error in vapply(dat, range, FUN.VALUE = 0): values must be length 1,
but FUN(X[[1]]) result is length 2
24.5 tapply()
tapply()
is one way (of many) to apply a function on subgroups of data as defined by one or more factors.
Age Weight Height SBP Group
1 33.53517 79.99227 1.437829 138.6748 C
2 35.23837 76.06266 1.782321 142.0525 B
3 46.03977 86.39338 1.516113 140.9807 A
4 51.51741 87.18997 1.583751 137.9623 B
5 40.47961 68.31458 1.790967 135.1164 C
6 35.91377 78.44711 1.734494 137.1234 A
mean_Age_by_Group <- tapply(dat[["Age"]], dat["Group"], mean)
mean_Age_by_Group
Group
A B C
39.95888 43.84275 39.67471
The for-loop equivalent of the above is:
24.6 mapply()
The functions we have looked at so far work well when you iterating over elements of a single object.
mapply()
allows you to execute a function that accepts two or more inputs, say fn(x, z)
using the i-th element of each input, and will return:fn(x[1], z[1])
, fn(x[2], z[2])
, …, fn(x[n], z[n])
Let’s create a simple function that accepts two numeric arguments, and two vectors length 5 each:
raise <- function(x, power) x^power
x <- 2:6
p <- 6:2
Use mapply to raise each x
to the corresponding p
:
out <- mapply(raise, x, p)
out
[1] 64 243 256 125 36
The above is equivalent to:
24.7 *apply()
ing on matrices vs. data frames
To consolidate some of what was learned above, let’s focus on the difference between working on a matrix vs. a data frame.
First, let’s create a matrix and a data frame with the same data:
Feature_1 Feature_2 Feature_3 Feature_4 Feature_5
[1,] 21 31 41 51 61
[2,] 22 32 42 52 62
[3,] 23 33 43 53 63
[4,] 24 34 44 54 64
[5,] 25 35 45 55 65
[6,] 26 36 46 56 66
[7,] 27 37 47 57 67
[8,] 28 38 48 58 68
[9,] 29 39 49 59 69
[10,] 30 40 50 60 70
adf <- as.data.frame(amat)
adf
Feature_1 Feature_2 Feature_3 Feature_4 Feature_5
1 21 31 41 51 61
2 22 32 42 52 62
3 23 33 43 53 63
4 24 34 44 54 64
5 25 35 45 55 65
6 26 36 46 56 66
7 27 37 47 57 67
8 28 38 48 58 68
9 29 39 49 59 69
10 30 40 50 60 70
We’ve seen that with apply()
we specify the dimension to operate on and it works the same way on both matrices and data frames:
apply(amat, 2, mean)
Feature_1 Feature_2 Feature_3 Feature_4 Feature_5
25.5 35.5 45.5 55.5 65.5
apply(adf, 2, mean)
Feature_1 Feature_2 Feature_3 Feature_4 Feature_5
25.5 35.5 45.5 55.5 65.5
However, sapply()
(and lapply()
, vapply()
) acts on each element of the object, therefore it is not meaningful to pass a matrix to it:
sapply(amat, mean)
[1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[26] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
The above returns the mean of each element, i.e. the element itself, which is meaningless.
Since a data frame is a list, and its columns are its elements, it works great for column operations on data frames:
sapply(adf, mean)
Feature_1 Feature_2 Feature_3 Feature_4 Feature_5
25.5 35.5 45.5 55.5 65.5
If you want to use sapply()
on a matrix, you could iterate over an integer sequence as shown in the previous section:
This is shown to help emphasize the differences between the function and the data structures. In practice, you would use apply()
on a matrix.
24.8 Anonymous functions
Anonymous functions are just like regular functions but they are not assigned to an object - i.e. they are not “named”.
They are usually passed as arguments to other functions to be used once, hence no need to assign them.
Anonymous functions are often used with the apply family of functions.
Example of a simple regular function:
squared <- function(x) {
x^2
}
Since this is a short function definition, it can also be written in a single line:
squared <- function(x) x^2
An anonymous function definition is just like a regular function - minus it is not assigned:
function(x) x^2
Since R version 4.1 (May 2021), a compact anonymous function syntax is available, where a single back slash replaces function
:
\(x) x^2
Let’s use the squared()
function within sapply()
to square the first four columns of the iris dataset. In these examples, we often wrap functions around head()
which prints the first few lines of an object to avoid:
head(dat[, 1:4])
Age Weight Height SBP
1 33.53517 79.99227 1.437829 138.6748
2 35.23837 76.06266 1.782321 142.0525
3 46.03977 86.39338 1.516113 140.9807
4 51.51741 87.18997 1.583751 137.9623
5 40.47961 68.31458 1.790967 135.1164
6 35.91377 78.44711 1.734494 137.1234
Age Weight Height SBP
[1,] 1124.608 6398.763 2.067352 19230.71
[2,] 1241.743 5785.528 3.176667 20178.91
[3,] 2119.660 7463.815 2.298598 19875.57
[4,] 2654.043 7602.091 2.508268 19033.58
[5,] 1638.599 4666.882 3.207564 18256.44
[6,] 1289.799 6153.950 3.008468 18802.83
Let’s do the same as above, but this time using an anonymous function:
Age Weight Height SBP
[1,] 1124.608 6398.763 2.067352 19230.71
[2,] 1241.743 5785.528 3.176667 20178.91
[3,] 2119.660 7463.815 2.298598 19875.57
[4,] 2654.043 7602.091 2.508268 19033.58
[5,] 1638.599 4666.882 3.207564 18256.44
[6,] 1289.799 6153.950 3.008468 18802.83
The entire anonymous function definition is passed to the FUN
argument.
24.9 Iterating over a sequence instead of an object
With lapply()
, sapply()
and vapply()
there is a very simple trick that may often come in handy:
Instead of iterating over elements of an object, you can iterate over an integer index of whichever elements you want to access and use it accordingly within the anonymous function.
This alternative approach is much closer to how we would use an integer sequence in a for
loop.
It will be clearer through an example, where we get the mean of the first four columns of iris:
Warning in mean.default(i): argument is not numeric or logical: returning NA
Age Weight Height SBP Group
41.355183 80.976511 1.697381 134.461052 NA
[1] 41.355183 80.976511 1.697381 134.461052
# equivalent to:
for (i in 1:4) {
mean(dat[, i])
}
Notice that in this approach, since you are not passing the object (dat, in the above example) as the input to lapply()
, it needs to be accessed within the anonymous function.