58 Data Pipeline Overview

58.1 Get access to Data

Health-related data comes from many sources, including:

Electronic Health Records (EPIC)
Lab/Clinical research data
Public datasets, e.g. NIH, UK Biobank, etc.

58.2 Handle and inspect data in the command line

Particularly useful for data sets of unknown structure (e.g. to find what delimiter is used) and very large data (will it fit into memory?)

Intro to the system shell

58.3 Read Data into R

Using R’s read.csv(), read.table()
Using data.table’s fread()
Using readr’s read_csv()
Using specialized packages for third-party data formats

58.4 Clean data names & values

Using string operations
Using factor() to define factor levels

58.5 Define Data Types

Using the ‘colClasses’ argument in read.csv(), or fread()

or

Coercing data types using as.numeric(), as.character(), factor(), as.Date(), as.POSIXct(), etc.

58.6 Reshape

Convert long to wide or vice versa, as needed.

Using base reshape()
Using data.table’s dcast() and melt()
Using tidyr’s pivot_wider() and pivot_longer()

58.7 Join data sets

If you have data in multiple files that need to be merged, you can easily joining them:

Using merge() for data.frames or data.tables

58.8 Transform data

Data transformations will depend on the analysis or analyses you wish to perform. Note that we often need to perform different data transformation for different statistical tests or machine learning models (supervised, or unsupervised learning).

58.9 Visualize

Visualization is essential before, during, and data preparation, hypothesis testing, supervised, and unsupervised learning

Using base graphics: boxplot(), hist(), plot(), barplot(), etc.
Using ggplot2
Using plotly interactive plots

58.10 Summarize & Aggregate

58.11 Statistical Hypothesis Testing

t.test(), wilcox.test(), aov(), kruskal.test()
Generalized Linear Models: glm()

58.12 Predictive Modeling

Perform classification, regression, survival analysis

GLMNET, Classification and Regression Trees (CART), Random Forest, Gradient Boosting, etc.

58.13 Decomposition

Do dimensionality reduction / matrix factorization:

PCA, ICA, NMF, UMAP, t-SNE, etc.

58.14 Clustering

Group cases based on similarity across multiple features:

K-means, Fuzzy C-means, HOPACH, Spectral Clustering, etc.

58.15 Saving data to disk

Save your cleaned dataset to disk:

base write.csv()
data.table’s fwrite()
base saveRDS()

58.16 Program your own functions!

For all the above operations, you will often be better off writing your own customized functions using the above base and third-party packages for your specific data needs and analysis goals.

Functions

58.17 Always document your code!

Always remember to add in-line comments (#) to your functions, scripts, Quarto documents for your future self, your collaborators, and the world.

58.18 Share your code on GitHub

Consider sharing your code on GitHub to allow review by others. This may be done at any time during your work, you should especially consider to publish code along with published manuscripts.