Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
The dplyr package offers extensive functionality for data manipulation.
dplyr’s functions are named after verbs (like filter
, select
, etc.) and are commonly used with the pipe operator to build pipelines. The package offers a large number of functions in total, often with multiple versions of the same “verb”. It has undergone many major changes since its introduction, so always make sure to consult the latest documentation.
Core operations include:
filter()
: identify casesselect()
: identify columnsmutate()
: create new variables as a function of existing onesarrange()
: reorder the rows/cases of a datasetsummarize()
: apply functions on columnsgroup_by()
: identify columns to group rows/cases by - all of the above can be applied on a grouped data framedplyr operates on data.frames as well as the tidyverse’s data.frame replacement, known as tibble.
Note that dplyr masks some builtin functions when loaded.
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
As an example, we read in the Heart Failure Clinical Records dataset from the UCI Machine Learning Repository using readr.
dat <- read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv",
col_types = cols(
age = col_double(),
anaemia = col_factor(c("0", "1")),
creatinine_phosphokinase = col_double(),
diabetes = col_factor(c("0", "1")),
ejection_fraction = col_double(),
high_blood_pressure = col_factor(c("0", "1")),
platelets = col_double(),
serum_creatinine = col_double(),
serum_sodium = col_double(),
sex = col_factor(c("0", "1")),
smoking = col_factor(c("0", "1")),
time = col_double(),
DEATH_EVENT = col_factor(c("0", "1")),
)
)
Single condition
dat |>
filter(age > 60)
# A tibble: 137 × 13
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <fct> <dbl> <fct> <dbl>
1 75 0 582 0 20
2 65 0 146 0 20
3 65 1 160 1 20
4 90 1 47 0 40
5 75 1 246 0 15
6 65 0 157 0 65
7 80 1 123 0 35
8 75 1 81 0 38
9 62 0 231 0 25
10 82 1 379 0 50
# ℹ 127 more rows
# ℹ 8 more variables: high_blood_pressure <fct>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <fct>, smoking <fct>,
# time <dbl>, DEATH_EVENT <fct>
Multiple conditions
dat |>
filter(age > 60, anaemia == "1")
# A tibble: 61 × 13
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <fct> <dbl> <fct> <dbl>
1 65 1 160 1 20
2 90 1 47 0 40
3 75 1 246 0 15
4 80 1 123 0 35
5 75 1 81 0 38
6 82 1 379 0 50
7 87 1 149 0 38
8 70 1 125 0 25
9 65 1 52 0 25
10 65 1 128 1 30
# ℹ 51 more rows
# ℹ 8 more variables: high_blood_pressure <fct>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <fct>, smoking <fct>,
# time <dbl>, DEATH_EVENT <fct>
Columns can be specified by name, index, or pattern-matching.
dat |>
select("age", "diabetes", "ejection_fraction")
# A tibble: 299 × 3
age diabetes ejection_fraction
<dbl> <fct> <dbl>
1 75 0 20
2 55 0 38
3 65 0 20
4 50 0 20
5 65 1 20
6 90 0 40
7 75 0 15
8 60 1 60
9 65 0 65
10 80 0 35
# ℹ 289 more rows
# A tibble: 299 × 2
platelets serum_sodium
<dbl> <dbl>
1 265000 130
2 263358. 136
3 162000 129
4 210000 137
5 327000 116
6 204000 132
7 127000 137
8 454000 131
9 263358. 138
10 388000 133
# ℹ 289 more rows
dat |>
select(7:9)
# A tibble: 299 × 3
platelets serum_creatinine serum_sodium
<dbl> <dbl> <dbl>
1 265000 1.9 130
2 263358. 1.1 136
3 162000 1.3 129
4 210000 1.9 137
5 327000 2.7 116
6 204000 2.1 132
7 127000 1.2 137
8 454000 1.1 131
9 263358. 1.5 138
10 388000 9.4 133
# ℹ 289 more rows
For example, select all columns between platelets
and serum_sodium
in the order they appear in the data.frame/tibble:
dat |>
select(platelets:serum_sodium)
# A tibble: 299 × 3
platelets serum_creatinine serum_sodium
<dbl> <dbl> <dbl>
1 265000 1.9 130
2 263358. 1.1 136
3 162000 1.3 129
4 210000 1.9 137
5 327000 2.7 116
6 204000 2.1 132
7 127000 1.2 137
8 454000 1.1 131
9 263358. 1.5 138
10 388000 9.4 133
# ℹ 289 more rows
dat |>
select(starts_with("serum_"))
# A tibble: 299 × 2
serum_creatinine serum_sodium
<dbl> <dbl>
1 1.9 130
2 1.1 136
3 1.3 129
4 1.9 137
5 2.7 116
6 2.1 132
7 1.2 137
8 1.1 131
9 1.5 138
10 9.4 133
# ℹ 289 more rows
Operate on a single column:
Operate on multiple columns using across()
:
# A tibble: 1 × 2
age ejection_fraction
<dbl> <dbl>
1 60.8 38.1
with single groupin variable:
# A tibble: 2 × 2
sex `mean(age)`
<fct> <dbl>
1 0 59.8
2 1 61.4
with multiple grouping variables:
# A tibble: 2 × 3
sex age serum_sodium
<fct> <dbl> <dbl>
1 0 59.8 137.
2 1 61.4 137.
`summarise()` has grouped output by 'smoking'. You can override using the
`.groups` argument.
# A tibble: 4 × 4
# Groups: smoking [2]
smoking anaemia age serum_sodium
<fct> <fct> <dbl> <dbl>
1 0 0 59.7 136.
2 0 1 61.8 137.
3 1 0 60.4 136.
4 1 1 62.6 137
arrange()
sorts a data.frame/tibble, i.e. reorders the rows/cases of a dataset.
# A tibble: 299 × 3
age sex serum_sodium
<dbl> <fct> <dbl>
1 40 1 136
2 40 0 140
3 40 0 141
4 40 1 137
5 40 1 136
6 40 1 142
7 40 1 132
8 41 1 140
9 42 0 136
10 42 1 140
# ℹ 289 more rows
Add a new column named Age_days
:
dat <- dat |>
mutate(Age_days = age * 365)
dat$Age_days
[1] 27375.00 20075.00 23725.00 18250.00 23725.00 32850.00 27375.00 21900.00
[9] 23725.00 29200.00 27375.00 22630.00 16425.00 18250.00 17885.00 29930.00
[17] 31755.00 16425.00 25550.00 17520.00 23725.00 23725.00 24820.00 19345.00
[25] 27375.00 29200.00 34675.00 25550.00 21170.00 29930.00 34310.00 31025.00
[33] 18250.00 18250.00 23725.00 25185.00 32850.00 29930.00 21900.00 21900.00
[41] 25550.00 18250.00 25550.00 26280.00 21900.00 18250.00 18615.00 21900.00
[49] 29200.00 20805.00 24820.00 19345.00 21900.00 25550.00 21900.00 34675.00
[57] 25550.00 21900.00 17885.00 26280.00 16425.00 18250.00 20075.00 16425.00
[65] 16425.00 21900.00 15330.00 26280.00 25550.00 23725.00 14965.00 21170.00
[73] 31025.00 23725.00 25185.00 21900.00 25550.00 15330.00 27375.00 20075.00
[81] 25550.00 24455.00 21900.00 28835.00 21535.00 18615.00 20075.00 23725.00
[89] 16060.00 20805.00 25550.00 21900.00 15330.00 21900.00 21170.00 21170.00
[97] 22995.00 25550.00 21900.00 22995.00 23725.00 27375.00 29200.00 15330.00
[105] 21900.00 26280.00 20075.00 16425.00 22995.00 16425.00 31025.00 20075.00
[113] 18250.00 25550.00 21900.00 21170.00 21900.00 31025.00 23725.00 31390.00
[121] 21900.00 24090.00 21900.00 21900.00 21900.00 15695.00 16790.00 21170.00
[129] 22265.00 19345.00 19345.00 21900.00 16790.00 22995.00 29565.00 27375.00
[137] 23725.00 24820.00 22630.00 18250.00 29200.00 16790.00 18250.00 22265.00
[145] 26280.00 18250.00 18980.00 23360.00 27375.00 21900.00 26280.00 22630.00
[153] 18250.00 18250.00 23725.00 21900.00 18980.00 18250.00 31025.00 21535.00
[161] 24090.00 16425.00 22995.00 18250.00 16425.00 29200.00 19345.00 21535.00
[169] 23725.00 25550.00 18615.00 18980.00 25550.00 18250.00 23725.00 21900.00
[177] 25185.00 17885.00 22995.00 20075.00 14600.00 21535.00 23725.00 27375.00
[185] 21170.00 22143.46 18250.00 21900.00 22143.46 14600.00 29200.00 23360.00
[193] 18250.00 26645.00 16425.00 28105.00 16425.00 23725.00 18250.00 21900.00
[201] 22995.00 16425.00 25550.00 21900.00 28470.00 18250.00 14600.00 31025.00
[209] 21900.00 17885.00 25550.00 18250.00 28470.00 17520.00 23725.00 26645.00
[217] 25550.00 19710.00 24820.00 20075.00 26645.00 23725.00 15330.00 17155.00
[225] 21170.00 27375.00 21170.00 20075.00 23725.00 26280.00 21900.00 25550.00
[233] 14600.00 19345.00 19345.00 28105.00 27375.00 25550.00 23725.00 20075.00
[241] 25550.00 23725.00 14600.00 26645.00 19710.00 22265.00 20075.00 23360.00
[249] 14600.00 19345.00 18250.00 20075.00 18250.00 25550.00 19345.00 18980.00
[257] 23725.00 21170.00 16425.00 19345.00 20075.00 22630.00 23725.00 24820.00
[265] 22265.00 18250.00 20075.00 20440.00 16425.00 14600.00 16060.00 18615.00
[273] 24455.00 15330.00 21900.00 16425.00 25550.00 25550.00 18250.00 20075.00
[281] 25550.00 25550.00 15330.00 23725.00 18250.00 20075.00 21900.00 16425.00
[289] 23725.00 32850.00 16425.00 21900.00 18980.00 22995.00 22630.00 20075.00
[297] 16425.00 16425.00 18250.00
Note that ungroup()
is required to return the data.frame/tibble to its original ungrouped state. If you don’t do this, all subsequent operations will be performed on the same grouping.
Inspect the new column:
dat |>
select(sex, serum_sodium, demeaned_sodium_bysex)
# A tibble: 299 × 3
sex serum_sodium demeaned_sodium_bysex
<fct> <dbl> <dbl>
1 1 130 -6.54
2 1 136 -0.536
3 1 129 -7.54
4 1 137 0.464
5 0 116 -20.8
6 1 132 -4.54
7 1 137 0.464
8 1 131 -5.54
9 0 138 1.21
10 1 133 -3.54
# ℹ 289 more rows
dplyr includes a number of ways to identify multiple variables. The latest version of dplyr suggests using across()
within dplyr functions that allow specifying columns, as we’ve used above.
This replaces separate functions previously used for each of filter/select/mutate/summarize/arrange
that had independent functions ending with *_all()
, *_at()
, *_each()
, *_if()
.
Other than specifying multiple columns by name, they can also be specified by index, by range, or by pattern-matching.
# A tibble: 1 × 2
platelets serum_sodium
<dbl> <dbl>
1 263358. 137.
For example, select all columns between platelets
and serum_sodium
in the order they appear in the data.frame/tibble:
dat |>
summarize(across(starts_with("serum_"), mean))
# A tibble: 1 × 2
serum_creatinine serum_sodium
<dbl> <dbl>
1 1.39 137.
where()
A predicate function is a function that returns a logical value, i.e. either TRUE
or FALSE
.
For example, you can use is.*
functions to identify columns based on their type, e.g. numeric or factor.
# A tibble: 1 × 9
age creatinine_phosphokinase ejection_fraction platelets serum_creatinine
<dbl> <dbl> <dbl> <dbl> <dbl>
1 60.8 582. 38.1 263358. 1.39
# ℹ 4 more variables: serum_sodium <dbl>, time <dbl>, Age_days <dbl>,
# demeaned_sodium_bysex <dbl>
To use a summary function that returns more than one row per group, use reframe()
instead of summarize()
. Unfortunately, using table()
in this way does not output level names, which limits its usefulness.
where()
It is often useful to be able to get the number of unique values per column, e.g. in order to identify which columns may need to be converted to factors. dplyr
includes the function n_distinct()
which counts the number of unique values of a vector.
The following example shows the syntax dplyr’s formula shorthand to build an anonymous function using n_distinct()
in order to select columns with fewer than 10 unique values:
dat |>
select(where(~ n_distinct(.x) < 10))
# A tibble: 299 × 6
anaemia diabetes high_blood_pressure sex smoking DEATH_EVENT
<fct> <fct> <fct> <fct> <fct> <fct>
1 0 0 1 1 0 1
2 0 0 0 1 0 1
3 0 0 0 1 1 1
4 1 0 0 1 0 1
5 1 1 0 0 0 1
6 1 0 1 1 1 1
7 1 0 0 1 0 1
8 1 1 0 1 1 1
9 0 0 0 0 0 1
10 1 0 1 1 1 1
# ℹ 289 more rows
which is equivalent to using the standard R anonymous function syntax:
dat |>
select(where(\(x) n_distinct(x) < 10))
# A tibble: 299 × 6
anaemia diabetes high_blood_pressure sex smoking DEATH_EVENT
<fct> <fct> <fct> <fct> <fct> <fct>
1 0 0 1 1 0 1
2 0 0 0 1 0 1
3 0 0 0 1 1 1
4 1 0 0 1 0 1
5 1 1 0 0 0 1
6 1 0 1 1 1 1
7 1 0 0 1 0 1
8 1 1 0 1 1 1
9 0 0 0 0 0 1
10 1 0 1 1 1 1
# ℹ 289 more rows
n_distinct(x)
is a more concise way of writing length(unique(x))
:
# A tibble: 299 × 6
anaemia diabetes high_blood_pressure sex smoking DEATH_EVENT
<fct> <fct> <fct> <fct> <fct> <fct>
1 0 0 1 1 0 1
2 0 0 0 1 0 1
3 0 0 0 1 1 1
4 1 0 0 1 0 1
5 1 1 0 0 0 1
6 1 0 1 1 1 1
7 1 0 0 1 0 1
8 1 1 0 1 1 1
9 0 0 0 0 0 1
10 1 0 1 1 1 1
# ℹ 289 more rows
To mutate those columns to factors, we combine across()
and where()
with the formula notation seen above within mutate()
:
dat <- dat |>
mutate(across(where(~ n_distinct(.) < 10), as.factor))
str(dat)
tibble [299 × 15] (S3: tbl_df/tbl/data.frame)
$ age : num [1:299] 75 55 65 50 65 90 75 60 65 80 ...
$ anaemia : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 1 2 ...
$ creatinine_phosphokinase: num [1:299] 582 7861 146 111 160 ...
$ diabetes : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
$ ejection_fraction : num [1:299] 20 38 20 20 20 40 15 60 65 35 ...
$ high_blood_pressure : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
$ platelets : num [1:299] 265000 263358 162000 210000 327000 ...
$ serum_creatinine : num [1:299] 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
$ serum_sodium : num [1:299] 130 136 129 137 116 132 137 131 138 133 ...
$ sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
$ smoking : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
$ time : num [1:299] 4 6 7 7 8 8 10 10 10 10 ...
$ DEATH_EVENT : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ Age_days : num [1:299] 27375 20075 23725 18250 23725 ...
$ demeaned_sodium_bysex : num [1:299] -6.536 -0.536 -7.536 0.464 -20.79 ...
Using dplyr on data.frames or tibbles does not offer any performance benefits over performing the equivalent operations in base R, and may in fact be slower.
A number of alternative backends, at different stages of maturity, are available that offer performance benefits for large datasets.