This chapter shows how to perform filter, select, and group-by operations using different data frame implementations:
base R’s data.frame
data.table
’s data.table
dplyr
’s tbl_df
polars
’ RPolarsDataFrame
tidypolars
duckplyr
Base R and dplyr are both slow for large datasets. data.table
is very fast and features a very concise and efficient syntax, but is limited to in-memory data. polars
is very fast and can handle both in-memory and out-of-memory data, but features a possibly cumbersome syntax. tidypolars
allows using dplyr
syntax with polars
data frames. DuckDB
is also very fast and can handle both in-memory and out-of-memory data, but uses SQL syntax, not great for data frame operations. duckplyr
uses dplyr
syntax with DuckDB
data frames.
Throughout this chapter, we will add head()
as needed only in examples that would result in printing the entire data frame.
Setup
Load packages
Note the masking of functions that happens when all these packages are loaded together. In practice, you may want/have to stick to using a single approach at a time.
Following isntructions on the duckplyr docs , we will not load duckplyr directly, but rather use the duckplyr::as_duckplyr_df
to convert the iris
data frame to to a duckplyr data frame.
Attaching package: 'dplyr'
The following objects are masked from 'package:data.table':
between, first, last
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
DuckDB
The following code enables reporting of fallbacks to dplyr
when duckplyr
fails.
Create data frames
We will use the iris
built-in dataset for this chapter and create a data.table
, a tibble
, a RPolarsDataFrame
, and a duckplyr
version of it.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<num> <num> <num> <num> <fctr>
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
---
146: 6.7 3.0 5.2 2.3 virginica
147: 6.3 2.5 5.0 1.9 virginica
148: 6.5 3.0 5.2 2.0 virginica
149: 6.2 3.4 5.4 2.3 virginica
150: 5.9 3.0 5.1 1.8 virginica
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
shape: (150, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ cat │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │
│ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ setosa │
│ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 6.7 ┆ 3.0 ┆ 5.2 ┆ 2.3 ┆ virginica │
│ 6.3 ┆ 2.5 ┆ 5.0 ┆ 1.9 ┆ virginica │
│ 6.5 ┆ 3.0 ┆ 5.2 ┆ 2.0 ┆ virginica │
│ 6.2 ┆ 3.4 ┆ 5.4 ┆ 2.3 ┆ virginica │
│ 5.9 ┆ 3.0 ┆ 5.1 ┆ 1.8 ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘
The duckplyr package is configured to fall back to dplyr when it encounters an
incompatibility. Fallback events can be collected and uploaded for analysis to
guide future development. By default, no data will be collected or uploaded.
→ Run `duckplyr::fallback_sitrep()` to review the current settings.
The duckplyr package is configured to fall back to dplyr when it encounters an
incompatibility. Fallback events can be collected and uploaded for analysis to
guide future development. By default, no data will be collected or uploaded.
ℹ A fallback situation just occurred. The following information would have been
recorded:
{"version":"0.4.1","message":"Can't convert columns of class <factor> to
relational. Affected
column:\n`...5`.","name":"head","x":{"...1":"numeric","...2":"numeric","...3":"numeric","...4":"numeric","...5":"factor"},"args":{"n":6}}
→ Run `duckplyr::fallback_sitrep()` to review the current settings.
→ Run `Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = 1)` to enable fallback logging,
and `Sys.setenv(DUCKPLYR_FALLBACK_VERBOSE = TRUE)` in addition to enable
printing of fallback situations to the console.
→ Run `duckplyr::fallback_review()` to review the available reports, and
`duckplyr::fallback_upload()` to upload them.
ℹ See `?duckplyr::fallback()` for details.
ℹ This message will be displayed once every eight hours.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Select
Select columns Sepal.Length
and Species
from the iris
dataset.
iris [ , c ( "Sepal.Length" , "Species" ) ] |> head ( )
Sepal.Length Species
1 5.1 setosa
2 4.9 setosa
3 4.7 setosa
4 4.6 setosa
5 5.0 setosa
6 5.4 setosa
iris_dt [ , . ( Sepal.Length , Species ) ]
Sepal.Length Species
<num> <fctr>
1: 5.1 setosa
2: 4.9 setosa
3: 4.7 setosa
4: 4.6 setosa
5: 5.0 setosa
---
146: 6.7 virginica
147: 6.3 virginica
148: 6.5 virginica
149: 6.2 virginica
150: 5.9 virginica
iris_tbl |>
select ( Sepal.Length , Species )
# A tibble: 150 × 2
Sepal.Length Species
<dbl> <fct>
1 5.1 setosa
2 4.9 setosa
3 4.7 setosa
4 4.6 setosa
5 5 setosa
6 5.4 setosa
7 4.6 setosa
8 5 setosa
9 4.4 setosa
10 4.9 setosa
# ℹ 140 more rows
iris_pl $ select (
pl $ col ( c ( "Sepal.Length" , "Species" ) )
)
shape: (150, 2)
┌──────────────┬───────────┐
│ Sepal.Length ┆ Species │
│ --- ┆ --- │
│ f64 ┆ cat │
╞══════════════╪═══════════╡
│ 5.1 ┆ setosa │
│ 4.9 ┆ setosa │
│ 4.7 ┆ setosa │
│ 4.6 ┆ setosa │
│ 5.0 ┆ setosa │
│ … ┆ … │
│ 6.7 ┆ virginica │
│ 6.3 ┆ virginica │
│ 6.5 ┆ virginica │
│ 6.2 ┆ virginica │
│ 5.9 ┆ virginica │
└──────────────┴───────────┘
iris_pl |>
select ( Sepal.Length , Species )
shape: (150, 2)
┌──────────────┬───────────┐
│ Sepal.Length ┆ Species │
│ --- ┆ --- │
│ f64 ┆ cat │
╞══════════════╪═══════════╡
│ 5.1 ┆ setosa │
│ 4.9 ┆ setosa │
│ 4.7 ┆ setosa │
│ 4.6 ┆ setosa │
│ 5.0 ┆ setosa │
│ … ┆ … │
│ 6.7 ┆ virginica │
│ 6.3 ┆ virginica │
│ 6.5 ┆ virginica │
│ 6.2 ┆ virginica │
│ 5.9 ┆ virginica │
└──────────────┴───────────┘
iris_db |>
select ( Sepal.Length , Species ) |> head ( )
Sepal.Length Species
1 5.1 setosa
2 4.9 setosa
3 4.7 setosa
4 4.6 setosa
5 5.0 setosa
6 5.4 setosa
Filter
We look at an example combining 2 conditions, one on a numeric and one on a factor column:
Filter rows where Sepal.Length
is less than 5.6 and Species
is versicolor or virginica.
iris [ iris $ Sepal.Length < 5.6 & iris $ Species %in% c ( "versicolor" , "virginica" ) , ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
54 5.5 2.3 4.0 1.3 versicolor
58 4.9 2.4 3.3 1.0 versicolor
60 5.2 2.7 3.9 1.4 versicolor
61 5.0 2.0 3.5 1.0 versicolor
81 5.5 2.4 3.8 1.1 versicolor
82 5.5 2.4 3.7 1.0 versicolor
85 5.4 3.0 4.5 1.5 versicolor
90 5.5 2.5 4.0 1.3 versicolor
91 5.5 2.6 4.4 1.2 versicolor
94 5.0 2.3 3.3 1.0 versicolor
99 5.1 2.5 3.0 1.1 versicolor
107 4.9 2.5 4.5 1.7 virginica
or
subset ( iris , Sepal.Length < 5.6 & iris $ Species %in% c ( "versicolor" , "virginica" ) )
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
54 5.5 2.3 4.0 1.3 versicolor
58 4.9 2.4 3.3 1.0 versicolor
60 5.2 2.7 3.9 1.4 versicolor
61 5.0 2.0 3.5 1.0 versicolor
81 5.5 2.4 3.8 1.1 versicolor
82 5.5 2.4 3.7 1.0 versicolor
85 5.4 3.0 4.5 1.5 versicolor
90 5.5 2.5 4.0 1.3 versicolor
91 5.5 2.6 4.4 1.2 versicolor
94 5.0 2.3 3.3 1.0 versicolor
99 5.1 2.5 3.0 1.1 versicolor
107 4.9 2.5 4.5 1.7 virginica
iris_dt [ Sepal.Length < 5.6 & iris $ Species %in% c ( "versicolor" , "virginica" ) ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<num> <num> <num> <num> <fctr>
1: 5.5 2.3 4.0 1.3 versicolor
2: 4.9 2.4 3.3 1.0 versicolor
3: 5.2 2.7 3.9 1.4 versicolor
4: 5.0 2.0 3.5 1.0 versicolor
5: 5.5 2.4 3.8 1.1 versicolor
6: 5.5 2.4 3.7 1.0 versicolor
7: 5.4 3.0 4.5 1.5 versicolor
8: 5.5 2.5 4.0 1.3 versicolor
9: 5.5 2.6 4.4 1.2 versicolor
10: 5.0 2.3 3.3 1.0 versicolor
11: 5.1 2.5 3.0 1.1 versicolor
12: 4.9 2.5 4.5 1.7 virginica
iris_tbl |>
filter (
Sepal.Length < 5.6 ,
Species %in% c ( "versicolor" , "virginica" )
)
# A tibble: 12 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.5 2.3 4 1.3 versicolor
2 4.9 2.4 3.3 1 versicolor
3 5.2 2.7 3.9 1.4 versicolor
4 5 2 3.5 1 versicolor
5 5.5 2.4 3.8 1.1 versicolor
6 5.5 2.4 3.7 1 versicolor
7 5.4 3 4.5 1.5 versicolor
8 5.5 2.5 4 1.3 versicolor
9 5.5 2.6 4.4 1.2 versicolor
10 5 2.3 3.3 1 versicolor
11 5.1 2.5 3 1.1 versicolor
12 4.9 2.5 4.5 1.7 virginica
iris_pl $ filter (
pl $ col ( "Sepal.Length" ) < 5.6 ,
pl $ col ( "Species" ) $ is_in ( c ( "versicolor" , "virginica" ) )
)
shape: (12, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬────────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ cat │
╞══════════════╪═════════════╪══════════════╪═════════════╪════════════╡
│ 5.5 ┆ 2.3 ┆ 4.0 ┆ 1.3 ┆ versicolor │
│ 4.9 ┆ 2.4 ┆ 3.3 ┆ 1.0 ┆ versicolor │
│ 5.2 ┆ 2.7 ┆ 3.9 ┆ 1.4 ┆ versicolor │
│ 5.0 ┆ 2.0 ┆ 3.5 ┆ 1.0 ┆ versicolor │
│ 5.5 ┆ 2.4 ┆ 3.8 ┆ 1.1 ┆ versicolor │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 5.5 ┆ 2.5 ┆ 4.0 ┆ 1.3 ┆ versicolor │
│ 5.5 ┆ 2.6 ┆ 4.4 ┆ 1.2 ┆ versicolor │
│ 5.0 ┆ 2.3 ┆ 3.3 ┆ 1.0 ┆ versicolor │
│ 5.1 ┆ 2.5 ┆ 3.0 ┆ 1.1 ┆ versicolor │
│ 4.9 ┆ 2.5 ┆ 4.5 ┆ 1.7 ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴────────────┘
iris_pl |>
filter (
Sepal.Length < 5.6 ,
Species %in% c ( "versicolor" , "virginica" )
)
shape: (12, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬────────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ cat │
╞══════════════╪═════════════╪══════════════╪═════════════╪════════════╡
│ 5.5 ┆ 2.3 ┆ 4.0 ┆ 1.3 ┆ versicolor │
│ 4.9 ┆ 2.4 ┆ 3.3 ┆ 1.0 ┆ versicolor │
│ 5.2 ┆ 2.7 ┆ 3.9 ┆ 1.4 ┆ versicolor │
│ 5.0 ┆ 2.0 ┆ 3.5 ┆ 1.0 ┆ versicolor │
│ 5.5 ┆ 2.4 ┆ 3.8 ┆ 1.1 ┆ versicolor │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 5.5 ┆ 2.5 ┆ 4.0 ┆ 1.3 ┆ versicolor │
│ 5.5 ┆ 2.6 ┆ 4.4 ┆ 1.2 ┆ versicolor │
│ 5.0 ┆ 2.3 ┆ 3.3 ┆ 1.0 ┆ versicolor │
│ 5.1 ┆ 2.5 ┆ 3.0 ┆ 1.1 ┆ versicolor │
│ 4.9 ┆ 2.5 ┆ 4.5 ┆ 1.7 ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴────────────┘
iris_db |>
filter (
Sepal.Length < 5.6 ,
Species %in% c ( "versicolor" , "virginica" )
)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.5 2.3 4.0 1.3 versicolor
2 4.9 2.4 3.3 1.0 versicolor
3 5.2 2.7 3.9 1.4 versicolor
4 5.0 2.0 3.5 1.0 versicolor
5 5.5 2.4 3.8 1.1 versicolor
6 5.5 2.4 3.7 1.0 versicolor
7 5.4 3.0 4.5 1.5 versicolor
8 5.5 2.5 4.0 1.3 versicolor
9 5.5 2.6 4.4 1.2 versicolor
10 5.0 2.3 3.3 1.0 versicolor
11 5.1 2.5 3.0 1.1 versicolor
12 4.9 2.5 4.5 1.7 virginica
Group-by
Get the mean of Sepal.Length
by Species
.
aggregate ( Sepal.Length ~ Species , iris , mean )
Species Sepal.Length
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
iris_dt [ , . ( mean ( Sepal.Length ) ) , by = Species ]
Species V1
<fctr> <num>
1: setosa 5.006
2: versicolor 5.936
3: virginica 6.588
# A tibble: 3 × 2
Species `mean(Sepal.Length)`
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
iris_pl $ group_by ( "Species" ) $ agg ( pl $ mean ( "Sepal.Length" ) )
shape: (3, 2)
┌────────────┬──────────────┐
│ Species ┆ Sepal.Length │
│ --- ┆ --- │
│ cat ┆ f64 │
╞════════════╪══════════════╡
│ setosa ┆ 5.006 │
│ virginica ┆ 6.588 │
│ versicolor ┆ 5.936 │
└────────────┴──────────────┘
Note that, unlike dplyr, tidypolars requires summarize
inputs to be named (at least at the time of writing).
shape: (3, 2)
┌────────────┬───────────────────┐
│ Species ┆ mean_Sepal.Length │
│ --- ┆ --- │
│ cat ┆ f64 │
╞════════════╪═══════════════════╡
│ virginica ┆ 6.588 │
│ versicolor ┆ 5.936 │
│ setosa ┆ 5.006 │
└────────────┴───────────────────┘
Groups [3]: Species
Maintain order: FALSE
# A tibble: 3 × 2
Species `mean(Sepal.Length)`
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59