58  Data Pipeline Overview

DataPipeline

58.1 Get access to Data

Health-related data comes from many sources, including:

  • Electronic Health Records (EPIC)
  • Lab/Clinical research data
  • Public datasets, e.g. NIH, UK Biobank, etc.

58.2 Handle and inspect data in the command line

Particularly useful for data sets of unknown structure (e.g. to find what delimiter is used) and very large data (will it fit into memory?)

58.3 Read Data into R

58.4 Clean data names & values

58.5 Define Data Types

or

58.6 Reshape

Convert long to wide or vice versa, as needed.

58.7 Join data sets

If you have data in multiple files that need to be merged, you can easily joining them:

58.8 Transform data

Data transformations will depend on the analysis or analyses you wish to perform. Note that we often need to perform different data transformation for different statistical tests or machine learning models (supervised, or unsupervised learning).

58.9 Visualize

Visualization is essential before, during, and data preparation, hypothesis testing, supervised, and unsupervised learning

58.10 Summarize & Aggregate

58.11 Statistical Hypothesis Testing

58.12 Predictive Modeling

Perform classification, regression, survival analysis

58.13 Decomposition

Do dimensionality reduction / matrix factorization:

58.14 Clustering

Group cases based on similarity across multiple features:

58.15 Saving data to disk

Save your cleaned dataset to disk:

58.16 Program your own functions!

For all the above operations, you will often be better off writing your own customized functions using the above base and third-party packages for your specific data needs and analysis goals.

58.17 Always document your code!

Always remember to add in-line comments (#) to your functions, scripts, Quarto documents for your future self, your collaborators, and the world.

58.18 Share your code on GitHub

Consider sharing your code on GitHub to allow review by others. This may be done at any time during your work, you should especially consider to publish code along with published manuscripts.