install.packages("data.table")
34 Introduction
The data.table
package provides a modern and highly optimized version of R’s data.frame structure. It is highly memory efficient and can automatically parallelize internal operations to achieve substantial speed improvements over data.frames. The data.table package weighs in at just a few kilobytes, has zero dependencies, and maintains compatibility with R going back many versions.
Advantages of data.table include:
- Fast and efficient reading, writing, and handling of big datasets
- fast read & write of delimited files with
fread()
andfwrite()
- in-place operations without creating unnecessary copies of data
- fast read & write of delimited files with
- Compact and flexible syntax for data manipulation great for handling small or big data
In health data science, it is common to handle very large datasets, especially when working with electronic health record (EHR) data. In such cases, we often have to read, clean, reshape, transform, and merge multiple tables of different dimensions, often featuring many millions of rows and thousands of columns. The benefits of data.table
become immediately apparent in such scenarios.
34.1 Installation
To install from CRAN:
data.table
includes a built-in command to update to the latest development version:
data.table::update.dev.pkg()
34.2 Note on OpenMP support
data.table
automatically parallelizes operations behind the scenes when possible. It uses the OpenMP library to support parallelization. The current version of macOS comes with disabled support for OpenMP.
Currently, if you install data.table
and OpenMP support is not detected, a message is printed to the console when you load the library with library(data.table)
informing you that it is running on a single thread. You can still use data.table
without OpenMP support.
The data.table installation wiki describes how to enable OpenMP support in the macOS compiler. The recommended option is to download the libraries from the mac.r-project site and copy them to the usr/local/lib
and usr/local/include
directories as appropriate.
After adding OpenMP support, you can compile the latest version of data.table
:
remotes::install_github("Rdatatable/data.table")
If everything works correctly, when you now load the library, it will inform you how many threads are being used.