56  DuckDB

DuckDB is a popular and powerful “in-process SQL OLAP database management system

It provides APIs in multiple languages, including R, Python, and Julia.

In practical terms, it provides a very efficient way to read, query, and write big data in R, which includes datasets that are bigger than memory.

For example, it allows lazy-reading of large datasets with optional filtering, selecting, and grouping operations before reading into memory.

Chapter 55 includes examples using duckplyr

Note

This section is a work in progress.

Loading required package: DBI

56.1 Installation

56.2 DuckDB API

DuckDB uses the R Database Interface (DBI). Learn more about using DBI here.

56.3 Setup Connection

By default, duckdb()’s dbdir argument defaults to DBDIR_MEMORY, which creates a non-persistent, in-memory object.

con <- dbConnect(duckdb::duckdb())

56.4 Settings

dbExecute(con, "PRAGMA enable_progress_bar;")
[1] 0

56.5 Read CSV

You can directly read into data.frame, which may be best for smaller datasets:

ir <- dbGetQuery(con, 
    "SELECT * FROM read_csv_auto('./iris.csv');"
)
head(ir)
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

For larger datasets, you can create a DuckDB table that can be queried without loading the dataset into memory:

dbExecute(con,
    "CREATE TABLE iris AS SELECT * FROM './iris.csv';"
)
[1] 150

56.5.1 List tables

dbGetQuery(con, "PRAGMA show_tables;")
  name
1 iris

Using the R API:

[1] "iris"

56.6 Filter

dbGetQuery(con,
    "SELECT * FROM iris WHERE Species in ('setosa', 'versicolor')") |> head()
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
dbExecute(con,
    "
    CREATE TABLE ir AS SELECT * FROM 
    read_csv_auto('/Users/egenn/icloud/Data/iris.csv', 
    delim=',', header=True);
    "
)
[1] 150

56.6.1 Fetch data to data.frame

ir = dbGetQuery(con, "SELECT * from iris")
head(ir)
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
class(ir)
[1] "data.frame"

56.6.2 Fetch data to data.table:

ir = dbGetQuery(con, "SELECT * from iris")
class(ir)
[1] "data.frame"
head(ir)
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

56.7 Close connection

When you are done working with the database, you should close the connection:

56.8 Resources