34  Introduction

The data.table package provides a modern and highly optimized version of R’s data.frame structure. It is highly memory efficient and can automatically parallelize internal operations to achieve substantial speed improvements over data.frames. The data.table package weighs in at just a few kilobytes, has zero dependencies, and maintains compatibility with R going back many versions.

Advantages of data.table include:

In health data science, it is common to handle very large datasets, especially when working with electronic health record (EHR) data. In such cases, we often have to read, clean, reshape, transform, and merge multiple tables of different dimensions, often featuring many millions of rows and thousands of columns. The benefits of data.table become immediately apparent in such scenarios.

34.1 Installation

To install from CRAN:

install.packages("data.table")

data.table includes a built-in command to update to the latest development version:

data.table::update.dev.pkg()

34.2 Note on OpenMP support

data.table automatically parallelizes operations behind the scenes when possible. It uses the OpenMP library to support parallelization. The current version of macOS comes with disabled support for OpenMP.

Currently, if you install data.table and OpenMP support is not detected, a message is printed to the console when you load the library with library(data.table) informing you that it is running on a single thread. You can still use data.table without OpenMP support.

The data.table installation wiki describes how to enable OpenMP support in the macOS compiler. The recommended option is to download the libraries from the mac.r-project site and copy them to the usr/local/lib and usr/local/include directories as appropriate.

After adding OpenMP support, you can compile the latest version of data.table:

remotes::install_github("Rdatatable/data.table")

If everything works correctly, when you now load the library, it will inform you how many threads are being used.

34.3 Resources