25  Working with Data frames

Core data.frame operations:

Operation Functions
Filter
  • [
  • subset(x)
Find duplicate rows
  • Get index of duplicated: duplicated(x)
  • Keep only unique rows: unique(x)
Select
  • [; [[; $;
  • subset(df, select=)
Add column(s)
  • $, [, [[
  • cbind(x1, x2)
  • transform(x, new=)
Add row(s)
  • rbind(...)
Reshape
  • Long-to-Wide: reshape(direction = "wide")
  • Wide-to-Long: reshape(direction = "long")
Join
  • Inner: merge(x, y, all=FALSE)
  • Outer: merge(x, y, all=TRUE)
  • Left: merge(x, y, all.x=TRUE)
  • Right: merge(x, y, all.y=TRUE)

25.1 Synthetic Data

set.seed(2022)
n <- 10
dat <- data.frame(
  PID = 8001:8010,
  Sex = sample(c("Male", "Female"), size = n, replace = TRUE),
  Age = rnorm(n, mean = 58, sd = 12),
  Height = rnorm(n, mean = 178, sd = 10),
  Weight = rnorm(n, mean = 80, sd = 12),
  Group = sample(c("A", "B"), size = n, replace = TRUE)
)

25.2 Filter

Filter & select operations are performed by indexing rows and cases, respectively.

25.3 Column and row names

Let’s start with a simple example data.frame:

df <- data.frame(PID = c(111:119),
                 Hospital = c("UCSF", "HUP", "Stanford",
                             "Stanford", "UCSF", "HUP", 
                             "HUP", "Stanford", "UCSF"),
                Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
                Sex = c(1, 1, 0, 1, 0, 0, 1, 0, 0))
df
  PID Hospital Age Sex
1 111     UCSF  22   1
2 112      HUP  34   1
3 113 Stanford  41   0
4 114 Stanford  19   1
5 115     UCSF  53   0
6 116      HUP  21   0
7 117      HUP  63   1
8 118 Stanford  22   0
9 119     UCSF  19   0

The optional row.names argument (see data.frame usage in the R documentation) can be used to define row names at the time of the data frame creation. It accepts either:

  • a single integer or a character specifying a column of the data.frame being created whose values should be used as row names, or
  • a vector of values (character or integer) of the row names to be used

For example, we can use the “PID” column:

df <- data.frame(PID = c(111:119),
                 Hospital = c("UCSF", "HUP", "Stanford",
                             "Stanford", "UCSF", "HUP", 
                             "HUP", "Stanford", "UCSF"),
                Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
                Sex = c(1, 1, 0, 1, 0, 0, 1, 0, 0),
                row.names = "PID")
Note

It is recommended to not use/depend on row names to identify or index data.frames, and instead include a column of case IDs.

We can get column names and row names with colnames() and rownames(), respectively:

[1] "Hospital" "Age"      "Sex"     
[1] "111" "112" "113" "114" "115" "116" "117" "118" "119"

To set new column or row names use the form:

colnames(df) <- new.colnames

rownames(df) <- new.rownames

where new.colnames and new.rownames is a character vector.

You can rename all columns/rows or use indexing to replace specific names:

Rename all rows:

rownames(df) <- paste0("Patient_", 1:9)
df
          Hospital Age Sex
Patient_1     UCSF  22   1
Patient_2      HUP  34   1
Patient_3 Stanford  41   0
Patient_4 Stanford  19   1
Patient_5     UCSF  53   0
Patient_6      HUP  21   0
Patient_7      HUP  63   1
Patient_8 Stanford  22   0
Patient_9     UCSF  19   0

Rename first two columns:

colnames(df)[1:2] <- c("Center", "Age_at_Dx")
df
            Center Age_at_Dx Sex
Patient_1     UCSF        22   1
Patient_2      HUP        34   1
Patient_3 Stanford        41   0
Patient_4 Stanford        19   1
Patient_5     UCSF        53   0
Patient_6      HUP        21   0
Patient_7      HUP        63   1
Patient_8 Stanford        22   0
Patient_9     UCSF        19   0

25.4 Delete columns or rows

To delete a data.frame column, set it to NULL:

df$Sex <- NULL
df
            Center Age_at_Dx
Patient_1     UCSF        22
Patient_2      HUP        34
Patient_3 Stanford        41
Patient_4 Stanford        19
Patient_5     UCSF        53
Patient_6      HUP        21
Patient_7      HUP        63
Patient_8 Stanford        22
Patient_9     UCSF        19

To delete a data.frame row, you can “index it out”.

For example, to remove the third and fifths rows of the above data.frame using an integer index:

df <- df[-c(3, 5), ]
df
            Center Age_at_Dx
Patient_1     UCSF        22
Patient_2      HUP        34
Patient_4 Stanford        19
Patient_6      HUP        21
Patient_7      HUP        63
Patient_8 Stanford        22
Patient_9     UCSF        19

You can similarly exclude a row using a logical index. Logical indexing occurs usually following some filtering condition.

For example, exclude patients under 20 years old:

df <- df[!df$Age < 20, ]
df
            Center Age_at_Dx
Patient_1     UCSF        22
Patient_2      HUP        34
Patient_6      HUP        21
Patient_7      HUP        63
Patient_8 Stanford        22

25.5 subset()

subset() allows you to

  • filter cases that meet certain conditions using the subset argument
  • select columns using the select argument

(head() returns the first few lines of a data frame. We use it to avoid printing too many lines)

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
iris_sl.gt.med <- subset(iris, Sepal.Length > median(Sepal.Length))

Note: You can use the column name Sepal.Length directly, i.e. unquoted and you don’t need to use iris$Sepal.Length.

This is called Non-Standard Evaluation (NSE).

x <- data.frame(one = 1:10,
                two = rnorm(10),
                group = c(rep("alpha", 4),  rep("beta", 6)))
subset(x, subset = two > 0, select = two)
          two
2  0.80776842
5  0.06035668
7  0.34027593
10 0.36817344
subset(x, two > 0, -one)
          two group
2  0.80776842 alpha
5  0.06035668  beta
7  0.34027593  beta
10 0.36817344  beta
subset(x, two > 0, two:one)
          two one
2  0.80776842   2
5  0.06035668   5
7  0.34027593   7
10 0.36817344  10
subset(x, two > 0, two:group)
          two group
2  0.80776842 alpha
5  0.06035668  beta
7  0.34027593  beta
10 0.36817344  beta

25.6 split()

Split a data frame into multiple data frames by groups defined by a factor:

x_by_group <- split(x, x$group)
x_by_group
$alpha
  one        two group
1   1 -0.2690377 alpha
2   2  0.8077684 alpha
3   3 -1.1247172 alpha
4   4 -1.4307880 alpha

$beta
   one         two group
5    5  0.06035668  beta
6    6 -0.79298250  beta
7    7  0.34027593  beta
8    8 -0.25946873  beta
9    9 -1.30484865  beta
10  10  0.36817344  beta

25.7 with()

Within a with() expression, you can access list elements or data.frame columns without quoting or using the $ operator:

with(x, one + two)
 [1]  0.7309623  2.8077684  1.8752828  2.5692120  5.0603567  5.2070175
 [7]  7.3402759  7.7405313  7.6951514 10.3681734
with(x, x[group == "alpha", ])
  one        two group
1   1 -0.2690377 alpha
2   2  0.8077684 alpha
3   3 -1.1247172 alpha
4   4 -1.4307880 alpha
with(x, x[two > 0, ])
   one        two group
2    2 0.80776842 alpha
5    5 0.06035668  beta
7    7 0.34027593  beta
10  10 0.36817344  beta

25.8 Feature transformation with transform()

Make up some data:

dat <- data.frame(Sex = c(0, 0, 1, 1, 0),
                  Height = c(1.5, 1.6, 1.55, 1.73, 1.8),
                  Weight = c(55, 70, 69, 76, 91))
dat <- transform(dat, BMI = Weight/Height^2)
dat
  Sex Height Weight      BMI
1   0   1.50     55 24.44444
2   0   1.60     70 27.34375
3   1   1.55     69 28.72008
4   1   1.73     76 25.39343
5   0   1.80     91 28.08642

transform() is probably not used too often, because it is trivial to do the same with direct assignment:

dat$BMI <- dat$Weight/dat$Height^2

but can be useful when adding multiple variables and/or used in a pipe:

dat |> 
  subset(Sex == 0) |> 
  transform(DeltaWeightFromMean = Weight - mean(Weight),
            BMI = Weight/Height^2,
            CI = Weight/Height^3)
  Sex Height Weight      BMI DeltaWeightFromMean       CI
1   0    1.5     55 24.44444                 -17 16.29630
2   0    1.6     70 27.34375                  -2 17.08984
5   0    1.8     91 28.08642                  19 15.60357

25.9 Identify and remove duplicated rows with duplicated() and unique()

When applied on a data.frame, the duplicated() function returns a logical index specifying the location of duplicated rows - specifically, a logical index of rows which are the duplicate of another row further up the data.frame. This means that if rows 20 and 23 are identical, duplicated() will return TRUE for row 23.

On the other hand, unique() will remove duplicate rows from a data.frame.

x <- data.frame(ID = c(203, 808, 909, 707, 808),
                Age = c(23, 44, 33, 42, 44))
[1] FALSE FALSE FALSE FALSE  TRUE
   ID Age
1 203  23
2 808  44
3 909  33
4 707  42

25.10 See also