40  Introduction to dplyr

The dplyr package offers extensive functionality for data manipulation.

dplyr’s functions are named after verbs (like filter, select, etc.) and are commonly used with the pipe operator to build pipelines. The package offers a large number of functions in total, often with multiple versions of the same “verb”. It has undergone many major changes since its introduction, so always make sure to consult the latest documentation.

Core operations include:

dplyr operates on data.frames as well as the tidyverse’s data.frame replacement, known as tibble.

Note that dplyr masks some builtin functions when loaded.


Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

As an example, we read in the Heart Failure Clinical Records dataset from the UCI Machine Learning Repository using readr.

dat <- read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv",
    col_types = cols(
        age = col_double(),
        anaemia = col_factor(c("0", "1")),
        creatinine_phosphokinase = col_double(),
        diabetes = col_factor(c("0", "1")),
        ejection_fraction = col_double(),
        high_blood_pressure = col_factor(c("0", "1")),
        platelets = col_double(),
        serum_creatinine = col_double(),
        serum_sodium = col_double(),
        sex = col_factor(c("0", "1")),
        smoking = col_factor(c("0", "1")),
        time = col_double(),
        DEATH_EVENT = col_factor(c("0", "1")),
        )
)
Note

dplyr and the tidyverse in general operate on data.frames and tibbles, which can be used as a drop-in replacement for data.frames, and heavily promote the use of the pipe operator |>.

40.1 Filter

Single condition

dat |> 
    filter(age > 60)
# A tibble: 137 × 13
     age anaemia creatinine_phosphokinase diabetes ejection_fraction
   <dbl> <fct>                      <dbl> <fct>                <dbl>
 1    75 0                            582 0                       20
 2    65 0                            146 0                       20
 3    65 1                            160 1                       20
 4    90 1                             47 0                       40
 5    75 1                            246 0                       15
 6    65 0                            157 0                       65
 7    80 1                            123 0                       35
 8    75 1                             81 0                       38
 9    62 0                            231 0                       25
10    82 1                            379 0                       50
# ℹ 127 more rows
# ℹ 8 more variables: high_blood_pressure <fct>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, sex <fct>, smoking <fct>,
#   time <dbl>, DEATH_EVENT <fct>

Multiple conditions

dat |> 
    filter(age > 60, anaemia == "1")
# A tibble: 61 × 13
     age anaemia creatinine_phosphokinase diabetes ejection_fraction
   <dbl> <fct>                      <dbl> <fct>                <dbl>
 1    65 1                            160 1                       20
 2    90 1                             47 0                       40
 3    75 1                            246 0                       15
 4    80 1                            123 0                       35
 5    75 1                             81 0                       38
 6    82 1                            379 0                       50
 7    87 1                            149 0                       38
 8    70 1                            125 0                       25
 9    65 1                             52 0                       25
10    65 1                            128 1                       30
# ℹ 51 more rows
# ℹ 8 more variables: high_blood_pressure <fct>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, sex <fct>, smoking <fct>,
#   time <dbl>, DEATH_EVENT <fct>

40.2 Select

Columns can be specified by name, index, or pattern-matching.

40.2.1 By column name

dat |> 
    select("age", "diabetes", "ejection_fraction")
# A tibble: 299 × 3
     age diabetes ejection_fraction
   <dbl> <fct>                <dbl>
 1    75 0                       20
 2    55 0                       38
 3    65 0                       20
 4    50 0                       20
 5    65 1                       20
 6    90 0                       40
 7    75 0                       15
 8    60 1                       60
 9    65 0                       65
10    80 0                       35
# ℹ 289 more rows

40.2.2 By integer column index

dat |> 
    select(c(7, 9))
# A tibble: 299 × 2
   platelets serum_sodium
       <dbl>        <dbl>
 1   265000           130
 2   263358.          136
 3   162000           129
 4   210000           137
 5   327000           116
 6   204000           132
 7   127000           137
 8   454000           131
 9   263358.          138
10   388000           133
# ℹ 289 more rows
dat |> 
    select(7:9)
# A tibble: 299 × 3
   platelets serum_creatinine serum_sodium
       <dbl>            <dbl>        <dbl>
 1   265000               1.9          130
 2   263358.              1.1          136
 3   162000               1.3          129
 4   210000               1.9          137
 5   327000               2.7          116
 6   204000               2.1          132
 7   127000               1.2          137
 8   454000               1.1          131
 9   263358.              1.5          138
10   388000               9.4          133
# ℹ 289 more rows

40.2.3 By character name range

For example, select all columns between platelets and serum_sodium in the order they appear in the data.frame/tibble:

dat |> 
    select(platelets:serum_sodium)
# A tibble: 299 × 3
   platelets serum_creatinine serum_sodium
       <dbl>            <dbl>        <dbl>
 1   265000               1.9          130
 2   263358.              1.1          136
 3   162000               1.3          129
 4   210000               1.9          137
 5   327000               2.7          116
 6   204000               2.1          132
 7   127000               1.2          137
 8   454000               1.1          131
 9   263358.              1.5          138
10   388000               9.4          133
# ℹ 289 more rows

40.2.4 Pattern-matching

dat |> 
    select(starts_with("serum_"))
# A tibble: 299 × 2
   serum_creatinine serum_sodium
              <dbl>        <dbl>
 1              1.9          130
 2              1.1          136
 3              1.3          129
 4              1.9          137
 5              2.7          116
 6              2.1          132
 7              1.2          137
 8              1.1          131
 9              1.5          138
10              9.4          133
# ℹ 289 more rows
dat |> 
    select(ends_with("fraction"))
# A tibble: 299 × 1
   ejection_fraction
               <dbl>
 1                20
 2                38
 3                20
 4                20
 5                20
 6                40
 7                15
 8                60
 9                65
10                35
# ℹ 289 more rows

40.3 Summarize

Operate on a single column:

dat |> 
    summarize(mean(age))
# A tibble: 1 × 1
  `mean(age)`
        <dbl>
1        60.8

Operate on multiple columns using across():

dat |> 
    summarize(across(c(age, ejection_fraction), mean))
# A tibble: 1 × 2
    age ejection_fraction
  <dbl>             <dbl>
1  60.8              38.1

40.3.1 Grouped operation on single column

with single groupin variable:

dat |> 
    group_by(sex) |> 
    summarize(mean(age))
# A tibble: 2 × 2
  sex   `mean(age)`
  <fct>       <dbl>
1 0            59.8
2 1            61.4

with multiple grouping variables:

dat |> 
    group_by(sex, diabetes) |>
    summarize(mean(age))
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
# A tibble: 4 × 3
# Groups:   sex [2]
  sex   diabetes `mean(age)`
  <fct> <fct>          <dbl>
1 0     0               61.6
2 0     1               58.2
3 1     0               62.0
4 1     1               60.4

40.3.2 Grouped operation on multiple columns

dat |> 
    group_by(sex) |> 
    summarize(across(c(age, serum_sodium), mean))
# A tibble: 2 × 3
  sex     age serum_sodium
  <fct> <dbl>        <dbl>
1 0      59.8         137.
2 1      61.4         137.
dat |> 
    group_by(smoking, anaemia) |> 
    summarize(across(c(age, serum_sodium), mean))
`summarise()` has grouped output by 'smoking'. You can override using the
`.groups` argument.
# A tibble: 4 × 4
# Groups:   smoking [2]
  smoking anaemia   age serum_sodium
  <fct>   <fct>   <dbl>        <dbl>
1 0       0        59.7         136.
2 0       1        61.8         137.
3 1       0        60.4         136.
4 1       1        62.6         137 

40.4 Arrange

arrange() sorts a data.frame/tibble, i.e. reorders the rows/cases of a dataset.

dat |> 
    select(age, sex, serum_sodium) |> 
    arrange(age)
# A tibble: 299 × 3
     age sex   serum_sodium
   <dbl> <fct>        <dbl>
 1    40 1              136
 2    40 0              140
 3    40 0              141
 4    40 1              137
 5    40 1              136
 6    40 1              142
 7    40 1              132
 8    41 1              140
 9    42 0              136
10    42 1              140
# ℹ 289 more rows

40.4.1 Grouped

dat |> 
    select(age, sex, serum_sodium) |> 
    group_by(sex) |> 
    arrange(age)
# A tibble: 299 × 3
# Groups:   sex [2]
     age sex   serum_sodium
   <dbl> <fct>        <dbl>
 1    40 1              136
 2    40 0              140
 3    40 0              141
 4    40 1              137
 5    40 1              136
 6    40 1              142
 7    40 1              132
 8    41 1              140
 9    42 0              136
10    42 1              140
# ℹ 289 more rows

40.5 Mutate

Add a new column named Age_days:

dat <- dat |> 
    mutate(Age_days = age * 365)
dat$Age_days
  [1] 27375.00 20075.00 23725.00 18250.00 23725.00 32850.00 27375.00 21900.00
  [9] 23725.00 29200.00 27375.00 22630.00 16425.00 18250.00 17885.00 29930.00
 [17] 31755.00 16425.00 25550.00 17520.00 23725.00 23725.00 24820.00 19345.00
 [25] 27375.00 29200.00 34675.00 25550.00 21170.00 29930.00 34310.00 31025.00
 [33] 18250.00 18250.00 23725.00 25185.00 32850.00 29930.00 21900.00 21900.00
 [41] 25550.00 18250.00 25550.00 26280.00 21900.00 18250.00 18615.00 21900.00
 [49] 29200.00 20805.00 24820.00 19345.00 21900.00 25550.00 21900.00 34675.00
 [57] 25550.00 21900.00 17885.00 26280.00 16425.00 18250.00 20075.00 16425.00
 [65] 16425.00 21900.00 15330.00 26280.00 25550.00 23725.00 14965.00 21170.00
 [73] 31025.00 23725.00 25185.00 21900.00 25550.00 15330.00 27375.00 20075.00
 [81] 25550.00 24455.00 21900.00 28835.00 21535.00 18615.00 20075.00 23725.00
 [89] 16060.00 20805.00 25550.00 21900.00 15330.00 21900.00 21170.00 21170.00
 [97] 22995.00 25550.00 21900.00 22995.00 23725.00 27375.00 29200.00 15330.00
[105] 21900.00 26280.00 20075.00 16425.00 22995.00 16425.00 31025.00 20075.00
[113] 18250.00 25550.00 21900.00 21170.00 21900.00 31025.00 23725.00 31390.00
[121] 21900.00 24090.00 21900.00 21900.00 21900.00 15695.00 16790.00 21170.00
[129] 22265.00 19345.00 19345.00 21900.00 16790.00 22995.00 29565.00 27375.00
[137] 23725.00 24820.00 22630.00 18250.00 29200.00 16790.00 18250.00 22265.00
[145] 26280.00 18250.00 18980.00 23360.00 27375.00 21900.00 26280.00 22630.00
[153] 18250.00 18250.00 23725.00 21900.00 18980.00 18250.00 31025.00 21535.00
[161] 24090.00 16425.00 22995.00 18250.00 16425.00 29200.00 19345.00 21535.00
[169] 23725.00 25550.00 18615.00 18980.00 25550.00 18250.00 23725.00 21900.00
[177] 25185.00 17885.00 22995.00 20075.00 14600.00 21535.00 23725.00 27375.00
[185] 21170.00 22143.46 18250.00 21900.00 22143.46 14600.00 29200.00 23360.00
[193] 18250.00 26645.00 16425.00 28105.00 16425.00 23725.00 18250.00 21900.00
[201] 22995.00 16425.00 25550.00 21900.00 28470.00 18250.00 14600.00 31025.00
[209] 21900.00 17885.00 25550.00 18250.00 28470.00 17520.00 23725.00 26645.00
[217] 25550.00 19710.00 24820.00 20075.00 26645.00 23725.00 15330.00 17155.00
[225] 21170.00 27375.00 21170.00 20075.00 23725.00 26280.00 21900.00 25550.00
[233] 14600.00 19345.00 19345.00 28105.00 27375.00 25550.00 23725.00 20075.00
[241] 25550.00 23725.00 14600.00 26645.00 19710.00 22265.00 20075.00 23360.00
[249] 14600.00 19345.00 18250.00 20075.00 18250.00 25550.00 19345.00 18980.00
[257] 23725.00 21170.00 16425.00 19345.00 20075.00 22630.00 23725.00 24820.00
[265] 22265.00 18250.00 20075.00 20440.00 16425.00 14600.00 16060.00 18615.00
[273] 24455.00 15330.00 21900.00 16425.00 25550.00 25550.00 18250.00 20075.00
[281] 25550.00 25550.00 15330.00 23725.00 18250.00 20075.00 21900.00 16425.00
[289] 23725.00 32850.00 16425.00 21900.00 18980.00 22995.00 22630.00 20075.00
[297] 16425.00 16425.00 18250.00

40.5.1 Mutate by group

dat <- dat |> 
    group_by(sex) |> 
    mutate(demeaned_sodium_bysex = serum_sodium - mean(serum_sodium)) |> 
    ungroup()
Important

Note that ungroup() is required to return the data.frame/tibble to its original ungrouped state. If you don’t do this, all subsequent operations will be performed on the same grouping.

Inspect the new column:

dat |> 
    select(sex, serum_sodium, demeaned_sodium_bysex)
# A tibble: 299 × 3
   sex   serum_sodium demeaned_sodium_bysex
   <fct>        <dbl>                 <dbl>
 1 1              130                -6.54 
 2 1              136                -0.536
 3 1              129                -7.54 
 4 1              137                 0.464
 5 0              116               -20.8  
 6 1              132                -4.54 
 7 1              137                 0.464
 8 1              131                -5.54 
 9 0              138                 1.21 
10 1              133                -3.54 
# ℹ 289 more rows

40.6 Specifying multiple columns

dplyr includes a number of ways to identify multiple variables. The latest version of dplyr suggests using across() within dplyr functions that allow specifying columns, as we’ve used above.

This replaces separate functions previously used for each of filter/select/mutate/summarize/arrange that had independent functions ending with *_all(), *_at(), *_each(), *_if().

Other than specifying multiple columns by name, they can also be specified by index, by range, or by pattern-matching.

40.6.1 By integer column index

dat |> 
    summarize(across(c(7, 9), mean))
# A tibble: 1 × 2
  platelets serum_sodium
      <dbl>        <dbl>
1   263358.         137.
dat |> 
    summarize(across(7:9, mean))
# A tibble: 1 × 3
  platelets serum_creatinine serum_sodium
      <dbl>            <dbl>        <dbl>
1   263358.             1.39         137.

40.6.2 By character name range

For example, select all columns between platelets and serum_sodium in the order they appear in the data.frame/tibble:

dat |> 
    summarize(across(platelets:serum_sodium, mean))
# A tibble: 1 × 3
  platelets serum_creatinine serum_sodium
      <dbl>            <dbl>        <dbl>
1   263358.             1.39         137.

40.6.3 Pattern-matching

dat |> 
    summarize(across(starts_with("serum_"), mean))
# A tibble: 1 × 2
  serum_creatinine serum_sodium
             <dbl>        <dbl>
1             1.39         137.
dat |> 
    summarize(across(ends_with("fraction"), mean))
# A tibble: 1 × 1
  ejection_fraction
              <dbl>
1              38.1

40.6.4 Using predicate function wrapped in where()

Tip

A predicate function is a function that returns a logical value, i.e. either TRUE or FALSE.

For example, you can use is.* functions to identify columns based on their type, e.g. numeric or factor.

dat |> 
    summarize(across(where(is.numeric), mean))
# A tibble: 1 × 9
    age creatinine_phosphokinase ejection_fraction platelets serum_creatinine
  <dbl>                    <dbl>             <dbl>     <dbl>            <dbl>
1  60.8                     582.              38.1   263358.             1.39
# ℹ 4 more variables: serum_sodium <dbl>, time <dbl>, Age_days <dbl>,
#   demeaned_sodium_bysex <dbl>

To use a summary function that returns more than one row per group, use reframe() instead of summarize(). Unfortunately, using table() in this way does not output level names, which limits its usefulness.

dat |> 
    reframe(across(where(is.factor), table))
# A tibble: 2 × 6
  anaemia     diabetes    high_blood_pressure sex         smoking    DEATH_EVENT
  <table[1d]> <table[1d]> <table[1d]>         <table[1d]> <table[1d> <table[1d]>
1 170         174         194                 105         203        203        
2 129         125         105                 194          96         96        

40.6.5 Using anonymous predicate function with where()

It is often useful to be able to get the number of unique values per column, e.g. in order to identify which columns may need to be converted to factors. dplyr includes the function n_distinct() which counts the number of unique values of a vector.

The following example shows the syntax dplyr’s formula shorthand to build an anonymous function using n_distinct() in order to select columns with fewer than 10 unique values:

dat |> 
    select(where(~ n_distinct(.x) < 10))
# A tibble: 299 × 6
   anaemia diabetes high_blood_pressure sex   smoking DEATH_EVENT
   <fct>   <fct>    <fct>               <fct> <fct>   <fct>      
 1 0       0        1                   1     0       1          
 2 0       0        0                   1     0       1          
 3 0       0        0                   1     1       1          
 4 1       0        0                   1     0       1          
 5 1       1        0                   0     0       1          
 6 1       0        1                   1     1       1          
 7 1       0        0                   1     0       1          
 8 1       1        0                   1     1       1          
 9 0       0        0                   0     0       1          
10 1       0        1                   1     1       1          
# ℹ 289 more rows

which is equivalent to using the standard R anonymous function syntax:

dat |> 
    select(where(\(x) n_distinct(x) < 10))
# A tibble: 299 × 6
   anaemia diabetes high_blood_pressure sex   smoking DEATH_EVENT
   <fct>   <fct>    <fct>               <fct> <fct>   <fct>      
 1 0       0        1                   1     0       1          
 2 0       0        0                   1     0       1          
 3 0       0        0                   1     1       1          
 4 1       0        0                   1     0       1          
 5 1       1        0                   0     0       1          
 6 1       0        1                   1     1       1          
 7 1       0        0                   1     0       1          
 8 1       1        0                   1     1       1          
 9 0       0        0                   0     0       1          
10 1       0        1                   1     1       1          
# ℹ 289 more rows

n_distinct(x) is a more concise way of writing length(unique(x)):

dat |> 
    select(where(\(x) length(unique(x)) < 10))
# A tibble: 299 × 6
   anaemia diabetes high_blood_pressure sex   smoking DEATH_EVENT
   <fct>   <fct>    <fct>               <fct> <fct>   <fct>      
 1 0       0        1                   1     0       1          
 2 0       0        0                   1     0       1          
 3 0       0        0                   1     1       1          
 4 1       0        0                   1     0       1          
 5 1       1        0                   0     0       1          
 6 1       0        1                   1     1       1          
 7 1       0        0                   1     0       1          
 8 1       1        0                   1     1       1          
 9 0       0        0                   0     0       1          
10 1       0        1                   1     1       1          
# ℹ 289 more rows

To mutate those columns to factors, we combine across() and where() with the formula notation seen above within mutate():

dat <- dat |> 
    mutate(across(where(~ n_distinct(.) < 10), as.factor))
str(dat)
tibble [299 × 15] (S3: tbl_df/tbl/data.frame)
 $ age                     : num [1:299] 75 55 65 50 65 90 75 60 65 80 ...
 $ anaemia                 : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 1 2 ...
 $ creatinine_phosphokinase: num [1:299] 582 7861 146 111 160 ...
 $ diabetes                : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
 $ ejection_fraction       : num [1:299] 20 38 20 20 20 40 15 60 65 35 ...
 $ high_blood_pressure     : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
 $ platelets               : num [1:299] 265000 263358 162000 210000 327000 ...
 $ serum_creatinine        : num [1:299] 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
 $ serum_sodium            : num [1:299] 130 136 129 137 116 132 137 131 138 133 ...
 $ sex                     : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
 $ smoking                 : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
 $ time                    : num [1:299] 4 6 7 7 8 8 10 10 10 10 ...
 $ DEATH_EVENT             : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ Age_days                : num [1:299] 27375 20075 23725 18250 23725 ...
 $ demeaned_sodium_bysex   : num [1:299] -6.536 -0.536 -7.536 0.464 -20.79 ...

40.7 dplyr backends

Using dplyr on data.frames or tibbles does not offer any performance benefits over performing the equivalent operations in base R, and may in fact be slower.

A number of alternative backends, at different stages of maturity, are available that offer performance benefits for large datasets.

40.8 Resources

40.9 See also