55  Data Queries across implementations

This chapter shows how to perform filter, select, and group-by operations using different data frame implementations:

Base R and dplyr are both slow for large datasets. data.table is very fast and features a very concise and efficient syntax, but is limited to in-memory data. polars is very fast and can handle both in-memory and out-of-memory data, but features a possibly cumbersome syntax. tidypolars allows using dplyr syntax with polars data frames. DuckDB is also very fast and can handle both in-memory and out-of-memory data, but uses SQL syntax, not great for data frame operations. duckplyr uses dplyr syntax with DuckDB data frames.

Note

Throughout this chapter, we will add head() as needed only in examples that would result in printing the entire data frame.

55.1 Setup

55.1.1 Load packages

Caution

Note the masking of functions that happens when all these packages are loaded together. In practice, you may want/have to stick to using a single approach at a time.

Following isntructions on the duckplyr docs, we will not load duckplyr directly, but rather use the duckplyr::as_duckplyr_df to convert the iris data frame to to a duckplyr data frame.


Attaching package: 'dplyr'
The following objects are masked from 'package:data.table':

    between, first, last
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

55.1.2 DuckDB

The following code enables reporting of fallbacks to dplyr when duckplyr fails.

Sys.setenv(DUCKPLYR_FALLBACK_VERBOSE = TRUE)

55.1.3 Create data frames

We will use the iris built-in dataset for this chapter and create a data.table, a tibble, a RPolarsDataFrame, and a duckplyr version of it.

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
iris_dt <- as.data.table(iris)
iris_dt
     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
            <num>       <num>        <num>       <num>    <fctr>
  1:          5.1         3.5          1.4         0.2    setosa
  2:          4.9         3.0          1.4         0.2    setosa
  3:          4.7         3.2          1.3         0.2    setosa
  4:          4.6         3.1          1.5         0.2    setosa
  5:          5.0         3.6          1.4         0.2    setosa
 ---                                                            
146:          6.7         3.0          5.2         2.3 virginica
147:          6.3         2.5          5.0         1.9 virginica
148:          6.5         3.0          5.2         2.0 virginica
149:          6.2         3.4          5.4         2.3 virginica
150:          5.9         3.0          5.1         1.8 virginica
iris_tbl <- as_tibble(iris)
iris_tbl
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows
iris_pl <- as_polars_df(iris)
iris_pl
shape: (150, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat       │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa    │
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa    │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa    │
│ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa    │
│ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa    │
│ …            ┆ …           ┆ …            ┆ …           ┆ …         │
│ 6.7          ┆ 3.0         ┆ 5.2          ┆ 2.3         ┆ virginica │
│ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ virginica │
│ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ virginica │
│ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ virginica │
│ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘
iris_db <- duckplyr::as_duckplyr_df(iris)
The duckplyr package is configured to fall back to dplyr when it encounters an
incompatibility. Fallback events can be collected and uploaded for analysis to
guide future development. By default, no data will be collected or uploaded.
→ Run `duckplyr::fallback_sitrep()` to review the current settings.
head(iris_db)
The duckplyr package is configured to fall back to dplyr when it encounters an
incompatibility. Fallback events can be collected and uploaded for analysis to
guide future development. By default, no data will be collected or uploaded.
ℹ A fallback situation just occurred. The following information would have been
  recorded:
  {"version":"0.4.1","message":"Can't convert columns of class <factor> to
  relational. Affected
  column:\n`...5`.","name":"head","x":{"...1":"numeric","...2":"numeric","...3":"numeric","...4":"numeric","...5":"factor"},"args":{"n":6}}
→ Run `duckplyr::fallback_sitrep()` to review the current settings.
→ Run `Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = 1)` to enable fallback logging,
  and `Sys.setenv(DUCKPLYR_FALLBACK_VERBOSE = TRUE)` in addition to enable
  printing of fallback situations to the console.
→ Run `duckplyr::fallback_review()` to review the available reports, and
  `duckplyr::fallback_upload()` to upload them.
ℹ See `?duckplyr::fallback()` for details.
ℹ This message will be displayed once every eight hours.
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

55.2 Select

Select columns Sepal.Length and Species from the iris dataset.

iris[, c("Sepal.Length", "Species")] |> head()
  Sepal.Length Species
1          5.1  setosa
2          4.9  setosa
3          4.7  setosa
4          4.6  setosa
5          5.0  setosa
6          5.4  setosa
iris_dt[, .(Sepal.Length, Species)]
     Sepal.Length   Species
            <num>    <fctr>
  1:          5.1    setosa
  2:          4.9    setosa
  3:          4.7    setosa
  4:          4.6    setosa
  5:          5.0    setosa
 ---                       
146:          6.7 virginica
147:          6.3 virginica
148:          6.5 virginica
149:          6.2 virginica
150:          5.9 virginica
iris_tbl |> 
  select(Sepal.Length, Species)
# A tibble: 150 × 2
   Sepal.Length Species
          <dbl> <fct>  
 1          5.1 setosa 
 2          4.9 setosa 
 3          4.7 setosa 
 4          4.6 setosa 
 5          5   setosa 
 6          5.4 setosa 
 7          4.6 setosa 
 8          5   setosa 
 9          4.4 setosa 
10          4.9 setosa 
# ℹ 140 more rows
iris_pl$select(
  pl$col(c("Sepal.Length", "Species"))
)
shape: (150, 2)
┌──────────────┬───────────┐
│ Sepal.Length ┆ Species   │
│ ---          ┆ ---       │
│ f64          ┆ cat       │
╞══════════════╪═══════════╡
│ 5.1          ┆ setosa    │
│ 4.9          ┆ setosa    │
│ 4.7          ┆ setosa    │
│ 4.6          ┆ setosa    │
│ 5.0          ┆ setosa    │
│ …            ┆ …         │
│ 6.7          ┆ virginica │
│ 6.3          ┆ virginica │
│ 6.5          ┆ virginica │
│ 6.2          ┆ virginica │
│ 5.9          ┆ virginica │
└──────────────┴───────────┘
iris_pl |>
  select(Sepal.Length, Species)
shape: (150, 2)
┌──────────────┬───────────┐
│ Sepal.Length ┆ Species   │
│ ---          ┆ ---       │
│ f64          ┆ cat       │
╞══════════════╪═══════════╡
│ 5.1          ┆ setosa    │
│ 4.9          ┆ setosa    │
│ 4.7          ┆ setosa    │
│ 4.6          ┆ setosa    │
│ 5.0          ┆ setosa    │
│ …            ┆ …         │
│ 6.7          ┆ virginica │
│ 6.3          ┆ virginica │
│ 6.5          ┆ virginica │
│ 6.2          ┆ virginica │
│ 5.9          ┆ virginica │
└──────────────┴───────────┘
iris_db |> 
  select(Sepal.Length, Species) |> head()
  Sepal.Length Species
1          5.1  setosa
2          4.9  setosa
3          4.7  setosa
4          4.6  setosa
5          5.0  setosa
6          5.4  setosa

55.3 Filter

We look at an example combining 2 conditions, one on a numeric and one on a factor column:

Filter rows where Sepal.Length is less than 5.6 and Species is versicolor or virginica.

iris[iris$Sepal.Length < 5.6 & iris$Species %in% c("versicolor", "virginica"), ]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
54           5.5         2.3          4.0         1.3 versicolor
58           4.9         2.4          3.3         1.0 versicolor
60           5.2         2.7          3.9         1.4 versicolor
61           5.0         2.0          3.5         1.0 versicolor
81           5.5         2.4          3.8         1.1 versicolor
82           5.5         2.4          3.7         1.0 versicolor
85           5.4         3.0          4.5         1.5 versicolor
90           5.5         2.5          4.0         1.3 versicolor
91           5.5         2.6          4.4         1.2 versicolor
94           5.0         2.3          3.3         1.0 versicolor
99           5.1         2.5          3.0         1.1 versicolor
107          4.9         2.5          4.5         1.7  virginica

or

subset(iris, Sepal.Length < 5.6 & iris$Species %in% c("versicolor", "virginica"))
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
54           5.5         2.3          4.0         1.3 versicolor
58           4.9         2.4          3.3         1.0 versicolor
60           5.2         2.7          3.9         1.4 versicolor
61           5.0         2.0          3.5         1.0 versicolor
81           5.5         2.4          3.8         1.1 versicolor
82           5.5         2.4          3.7         1.0 versicolor
85           5.4         3.0          4.5         1.5 versicolor
90           5.5         2.5          4.0         1.3 versicolor
91           5.5         2.6          4.4         1.2 versicolor
94           5.0         2.3          3.3         1.0 versicolor
99           5.1         2.5          3.0         1.1 versicolor
107          4.9         2.5          4.5         1.7  virginica
iris_dt[Sepal.Length < 5.6 & iris$Species %in% c("versicolor", "virginica")]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
           <num>       <num>        <num>       <num>     <fctr>
 1:          5.5         2.3          4.0         1.3 versicolor
 2:          4.9         2.4          3.3         1.0 versicolor
 3:          5.2         2.7          3.9         1.4 versicolor
 4:          5.0         2.0          3.5         1.0 versicolor
 5:          5.5         2.4          3.8         1.1 versicolor
 6:          5.5         2.4          3.7         1.0 versicolor
 7:          5.4         3.0          4.5         1.5 versicolor
 8:          5.5         2.5          4.0         1.3 versicolor
 9:          5.5         2.6          4.4         1.2 versicolor
10:          5.0         2.3          3.3         1.0 versicolor
11:          5.1         2.5          3.0         1.1 versicolor
12:          4.9         2.5          4.5         1.7  virginica
iris_tbl |>
  filter(
    Sepal.Length < 5.6,
    Species %in% c("versicolor", "virginica")
)
# A tibble: 12 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          5.5         2.3          4           1.3 versicolor
 2          4.9         2.4          3.3         1   versicolor
 3          5.2         2.7          3.9         1.4 versicolor
 4          5           2            3.5         1   versicolor
 5          5.5         2.4          3.8         1.1 versicolor
 6          5.5         2.4          3.7         1   versicolor
 7          5.4         3            4.5         1.5 versicolor
 8          5.5         2.5          4           1.3 versicolor
 9          5.5         2.6          4.4         1.2 versicolor
10          5           2.3          3.3         1   versicolor
11          5.1         2.5          3           1.1 versicolor
12          4.9         2.5          4.5         1.7 virginica 
iris_pl$filter(
  pl$col("Sepal.Length") < 5.6,
  pl$col("Species")$is_in(c("versicolor", "virginica"))
)
shape: (12, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬────────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species    │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---        │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat        │
╞══════════════╪═════════════╪══════════════╪═════════════╪════════════╡
│ 5.5          ┆ 2.3         ┆ 4.0          ┆ 1.3         ┆ versicolor │
│ 4.9          ┆ 2.4         ┆ 3.3          ┆ 1.0         ┆ versicolor │
│ 5.2          ┆ 2.7         ┆ 3.9          ┆ 1.4         ┆ versicolor │
│ 5.0          ┆ 2.0         ┆ 3.5          ┆ 1.0         ┆ versicolor │
│ 5.5          ┆ 2.4         ┆ 3.8          ┆ 1.1         ┆ versicolor │
│ …            ┆ …           ┆ …            ┆ …           ┆ …          │
│ 5.5          ┆ 2.5         ┆ 4.0          ┆ 1.3         ┆ versicolor │
│ 5.5          ┆ 2.6         ┆ 4.4          ┆ 1.2         ┆ versicolor │
│ 5.0          ┆ 2.3         ┆ 3.3          ┆ 1.0         ┆ versicolor │
│ 5.1          ┆ 2.5         ┆ 3.0          ┆ 1.1         ┆ versicolor │
│ 4.9          ┆ 2.5         ┆ 4.5          ┆ 1.7         ┆ virginica  │
└──────────────┴─────────────┴──────────────┴─────────────┴────────────┘
iris_pl |>
  filter(
    Sepal.Length < 5.6,
    Species %in% c("versicolor", "virginica")
  )
shape: (12, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬────────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species    │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---        │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat        │
╞══════════════╪═════════════╪══════════════╪═════════════╪════════════╡
│ 5.5          ┆ 2.3         ┆ 4.0          ┆ 1.3         ┆ versicolor │
│ 4.9          ┆ 2.4         ┆ 3.3          ┆ 1.0         ┆ versicolor │
│ 5.2          ┆ 2.7         ┆ 3.9          ┆ 1.4         ┆ versicolor │
│ 5.0          ┆ 2.0         ┆ 3.5          ┆ 1.0         ┆ versicolor │
│ 5.5          ┆ 2.4         ┆ 3.8          ┆ 1.1         ┆ versicolor │
│ …            ┆ …           ┆ …            ┆ …           ┆ …          │
│ 5.5          ┆ 2.5         ┆ 4.0          ┆ 1.3         ┆ versicolor │
│ 5.5          ┆ 2.6         ┆ 4.4          ┆ 1.2         ┆ versicolor │
│ 5.0          ┆ 2.3         ┆ 3.3          ┆ 1.0         ┆ versicolor │
│ 5.1          ┆ 2.5         ┆ 3.0          ┆ 1.1         ┆ versicolor │
│ 4.9          ┆ 2.5         ┆ 4.5          ┆ 1.7         ┆ virginica  │
└──────────────┴─────────────┴──────────────┴─────────────┴────────────┘
iris_db |>
  filter(
    Sepal.Length < 5.6,
    Species %in% c("versicolor", "virginica")
  )
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           5.5         2.3          4.0         1.3 versicolor
2           4.9         2.4          3.3         1.0 versicolor
3           5.2         2.7          3.9         1.4 versicolor
4           5.0         2.0          3.5         1.0 versicolor
5           5.5         2.4          3.8         1.1 versicolor
6           5.5         2.4          3.7         1.0 versicolor
7           5.4         3.0          4.5         1.5 versicolor
8           5.5         2.5          4.0         1.3 versicolor
9           5.5         2.6          4.4         1.2 versicolor
10          5.0         2.3          3.3         1.0 versicolor
11          5.1         2.5          3.0         1.1 versicolor
12          4.9         2.5          4.5         1.7  virginica

55.4 Group-by

Get the mean of Sepal.Length by Species.

aggregate(Sepal.Length ~ Species, iris, mean)
     Species Sepal.Length
1     setosa        5.006
2 versicolor        5.936
3  virginica        6.588
iris_dt[, .(mean(Sepal.Length)), by = Species]
      Species    V1
       <fctr> <num>
1:     setosa 5.006
2: versicolor 5.936
3:  virginica 6.588
iris_tbl |> 
  group_by(Species) |> 
  summarize(mean(Sepal.Length))
# A tibble: 3 × 2
  Species    `mean(Sepal.Length)`
  <fct>                     <dbl>
1 setosa                     5.01
2 versicolor                 5.94
3 virginica                  6.59
iris_pl$group_by("Species")$agg(pl$mean("Sepal.Length"))
shape: (3, 2)
┌────────────┬──────────────┐
│ Species    ┆ Sepal.Length │
│ ---        ┆ ---          │
│ cat        ┆ f64          │
╞════════════╪══════════════╡
│ setosa     ┆ 5.006        │
│ virginica  ┆ 6.588        │
│ versicolor ┆ 5.936        │
└────────────┴──────────────┘

Note that, unlike dplyr, tidypolars requires summarize inputs to be named (at least at the time of writing).

iris_pl |>
  group_by(Species) |>
  summarize(mean_Sepal.Length = mean(Sepal.Length))
shape: (3, 2)
┌────────────┬───────────────────┐
│ Species    ┆ mean_Sepal.Length │
│ ---        ┆ ---               │
│ cat        ┆ f64               │
╞════════════╪═══════════════════╡
│ virginica  ┆ 6.588             │
│ versicolor ┆ 5.936             │
│ setosa     ┆ 5.006             │
└────────────┴───────────────────┘
Groups [3]: Species
Maintain order: FALSE
iris_db |> 
  group_by(Species) |> 
  summarize(mean(Sepal.Length))
# A tibble: 3 × 2
  Species    `mean(Sepal.Length)`
  <fct>                     <dbl>
1 setosa                     5.01
2 versicolor                 5.94
3 virginica                  6.59

55.5 References