Loading required package: DBI
56 DuckDB
DuckDB is a popular and powerful “in-process SQL OLAP database management system”
It provides APIs in multiple languages, including R, Python, and Julia.
In practical terms, it provides a very efficient way to read, query, and write big data in R, which includes datasets that are bigger than memory.
For example, it allows lazy-reading of large datasets with optional filtering, selecting, and grouping operations before reading into memory.
Chapter 55 includes examples using duckplyr
This section is a work in progress.
56.1 Installation
install.packages("duckdb")
56.2 DuckDB API
DuckDB uses the R Database Interface (DBI). Learn more about using DBI here.
56.3 Setup Connection
By default, duckdb()
’s dbdir
argument defaults to DBDIR_MEMORY
, which creates a non-persistent, in-memory object.
56.4 Settings
dbExecute(con, "PRAGMA enable_progress_bar;")
[1] 0
56.5 Read CSV
You can directly read into data.frame, which may be best for smaller datasets:
ir <- dbGetQuery(con,
"SELECT * FROM read_csv_auto('./iris.csv');"
)
head(ir)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
For larger datasets, you can create a DuckDB table that can be queried without loading the dataset into memory:
dbExecute(con,
"CREATE TABLE iris AS SELECT * FROM './iris.csv';"
)
[1] 150
56.5.1 List tables
dbGetQuery(con, "PRAGMA show_tables;")
name
1 iris
Using the R API:
dbListTables(con)
[1] "iris"
56.6 Filter
dbGetQuery(con,
"SELECT * FROM iris WHERE Species in ('setosa', 'versicolor')") |> head()
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
dbExecute(con,
"
CREATE TABLE ir AS SELECT * FROM
read_csv_auto('/Users/egenn/icloud/Data/iris.csv',
delim=',', header=True);
"
)
[1] 150
56.6.1 Fetch data to data.frame
ir = dbGetQuery(con, "SELECT * from iris")
head(ir)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
class(ir)
[1] "data.frame"
56.6.2 Fetch data to data.table:
ir = dbGetQuery(con, "SELECT * from iris")
56.7 Close connection
When you are done working with the database, you should close the connection:
dbDisconnect(con)