25 Working with Data frames

Core data.frame operations:

Operation	Functions
Filter	`[` `subset(x)`
Find duplicate rows	Get index of duplicated: `duplicated(x)` Keep only unique rows: `unique(x)`
Select	`[`; `[[`; `$`; `subset(df, select=)`
Add column(s)	`$`, `[`, `[[` `cbind(x1, x2)` `transform(x, new=)`
Add row(s)	`rbind(...)`
Reshape	Long-to-Wide: `reshape(direction = "wide")` Wide-to-Long: `reshape(direction = "long")`
Join	Inner: `merge(x, y, all=FALSE)` Outer: `merge(x, y, all=TRUE)` Left: `merge(x, y, all.x=TRUE)` Right: `merge(x, y, all.y=TRUE)`

25.1 Synthetic Data

set.seed(2022)
n <- 10
dat <- data.frame(
  PID = 8001:8010,
  Sex = sample(c("Male", "Female"), size = n, replace = TRUE),
  Age = rnorm(n, mean = 58, sd = 12),
  Height = rnorm(n, mean = 178, sd = 10),
  Weight = rnorm(n, mean = 80, sd = 12),
  Group = sample(c("A", "B"), size = n, replace = TRUE)
)

25.2 Filter

Filter & select operations are performed by indexing rows and cases, respectively.

25.3 Column and row names

Let’s start with a simple example data.frame:

df <- data.frame(PID = c(111:119),
                 Hospital = c("UCSF", "HUP", "Stanford",
                             "Stanford", "UCSF", "HUP", 
                             "HUP", "Stanford", "UCSF"),
                Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
                Sex = c(1, 1, 0, 1, 0, 0, 1, 0, 0))
df

  PID Hospital Age Sex
1 111     UCSF  22   1
2 112      HUP  34   1
3 113 Stanford  41   0
4 114 Stanford  19   1
5 115     UCSF  53   0
6 116      HUP  21   0
7 117      HUP  63   1
8 118 Stanford  22   0
9 119     UCSF  19   0

The optional row.names argument (see data.frame usage in the R documentation) can be used to define row names at the time of the data frame creation. It accepts either:

a single integer or a character specifying a column of the data.frame being created whose values should be used as row names, or
a vector of values (character or integer) of the row names to be used

For example, we can use the “PID” column:

df <- data.frame(PID = c(111:119),
                 Hospital = c("UCSF", "HUP", "Stanford",
                             "Stanford", "UCSF", "HUP", 
                             "HUP", "Stanford", "UCSF"),
                Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
                Sex = c(1, 1, 0, 1, 0, 0, 1, 0, 0),
                row.names = "PID")

Note

It is recommended to not use/depend on row names to identify or index data.frames, and instead include a column of case IDs.

We can get column names and row names with colnames() and rownames(), respectively:

colnames(df)

[1] "Hospital" "Age"      "Sex"

rownames(df)

[1] "111" "112" "113" "114" "115" "116" "117" "118" "119"

To set new column or row names use the form:

colnames(df) <- new.colnames

rownames(df) <- new.rownames

where new.colnames and new.rownames is a character vector.

You can rename all columns/rows or use indexing to replace specific names:

Rename all rows:

rownames(df) <- paste0("Patient_", 1:9)
df

          Hospital Age Sex
Patient_1     UCSF  22   1
Patient_2      HUP  34   1
Patient_3 Stanford  41   0
Patient_4 Stanford  19   1
Patient_5     UCSF  53   0
Patient_6      HUP  21   0
Patient_7      HUP  63   1
Patient_8 Stanford  22   0
Patient_9     UCSF  19   0

Rename first two columns:

colnames(df)[1:2] <- c("Center", "Age_at_Dx")
df

            Center Age_at_Dx Sex
Patient_1     UCSF        22   1
Patient_2      HUP        34   1
Patient_3 Stanford        41   0
Patient_4 Stanford        19   1
Patient_5     UCSF        53   0
Patient_6      HUP        21   0
Patient_7      HUP        63   1
Patient_8 Stanford        22   0
Patient_9     UCSF        19   0

25.4 Delete columns or rows

To delete a data.frame column, set it to NULL:

df$Sex <- NULL
df

            Center Age_at_Dx
Patient_1     UCSF        22
Patient_2      HUP        34
Patient_3 Stanford        41
Patient_4 Stanford        19
Patient_5     UCSF        53
Patient_6      HUP        21
Patient_7      HUP        63
Patient_8 Stanford        22
Patient_9     UCSF        19

To delete a data.frame row, you can “index it out”.

For example, to remove the third and fifths rows of the above data.frame using an integer index:

df <- df[-c(3, 5), ]
df

            Center Age_at_Dx
Patient_1     UCSF        22
Patient_2      HUP        34
Patient_4 Stanford        19
Patient_6      HUP        21
Patient_7      HUP        63
Patient_8 Stanford        22
Patient_9     UCSF        19

You can similarly exclude a row using a logical index. Logical indexing occurs usually following some filtering condition.

For example, exclude patients under 20 years old:

df <- df[!df$Age < 20, ]
df

            Center Age_at_Dx
Patient_1     UCSF        22
Patient_2      HUP        34
Patient_6      HUP        21
Patient_7      HUP        63
Patient_8 Stanford        22

25.5 `subset()`

subset() allows you to

filter cases that meet certain conditions using the subset argument
select columns using the select argument

(head() returns the first few lines of a data frame. We use it to avoid printing too many lines)

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

iris_sl.gt.med <- subset(iris, Sepal.Length > median(Sepal.Length))

Note: You can use the column name Sepal.Length directly, i.e. unquoted and you don’t need to use iris$Sepal.Length.

This is called Non-Standard Evaluation (NSE).

x <- data.frame(one = 1:10,
                two = rnorm(10),
                group = c(rep("alpha", 4),  rep("beta", 6)))
subset(x, subset = two > 0, select = two)

          two
2  0.80776842
5  0.06035668
7  0.34027593
10 0.36817344

subset(x, two > 0, -one)

          two group
2  0.80776842 alpha
5  0.06035668  beta
7  0.34027593  beta
10 0.36817344  beta

subset(x, two > 0, two:one)

          two one
2  0.80776842   2
5  0.06035668   5
7  0.34027593   7
10 0.36817344  10

subset(x, two > 0, two:group)

          two group
2  0.80776842 alpha
5  0.06035668  beta
7  0.34027593  beta
10 0.36817344  beta

25.6 `split()`

Split a data frame into multiple data frames by groups defined by a factor:

x_by_group <- split(x, x$group)
x_by_group

$alpha
  one        two group
1   1 -0.2690377 alpha
2   2  0.8077684 alpha
3   3 -1.1247172 alpha
4   4 -1.4307880 alpha

$beta
   one         two group
5    5  0.06035668  beta
6    6 -0.79298250  beta
7    7  0.34027593  beta
8    8 -0.25946873  beta
9    9 -1.30484865  beta
10  10  0.36817344  beta

25.7 `with()`

Within a with() expression, you can access list elements or data.frame columns without quoting or using the $ operator:

with(x, one + two)

 [1]  0.7309623  2.8077684  1.8752828  2.5692120  5.0603567  5.2070175
 [7]  7.3402759  7.7405313  7.6951514 10.3681734

with(x, x[group == "alpha", ])

  one        two group
1   1 -0.2690377 alpha
2   2  0.8077684 alpha
3   3 -1.1247172 alpha
4   4 -1.4307880 alpha

with(x, x[two > 0, ])

   one        two group
2    2 0.80776842 alpha
5    5 0.06035668  beta
7    7 0.34027593  beta
10  10 0.36817344  beta

25.8 Feature transformation with `transform()`

Make up some data:

dat <- data.frame(Sex = c(0, 0, 1, 1, 0),
                  Height = c(1.5, 1.6, 1.55, 1.73, 1.8),
                  Weight = c(55, 70, 69, 76, 91))

dat <- transform(dat, BMI = Weight/Height^2)
dat

  Sex Height Weight      BMI
1   0   1.50     55 24.44444
2   0   1.60     70 27.34375
3   1   1.55     69 28.72008
4   1   1.73     76 25.39343
5   0   1.80     91 28.08642

transform() is probably not used too often, because it is trivial to do the same with direct assignment:

dat$BMI <- dat$Weight/dat$Height^2

but can be useful when adding multiple variables and/or used in a pipe:

dat |> 
  subset(Sex == 0) |> 
  transform(DeltaWeightFromMean = Weight - mean(Weight),
            BMI = Weight/Height^2,
            CI = Weight/Height^3)

  Sex Height Weight      BMI DeltaWeightFromMean       CI
1   0    1.5     55 24.44444                 -17 16.29630
2   0    1.6     70 27.34375                  -2 17.08984
5   0    1.8     91 28.08642                  19 15.60357

25.9 Identify and remove duplicated rows with `duplicated()` and `unique()`

When applied on a data.frame, the duplicated() function returns a logical index specifying the location of duplicated rows - specifically, a logical index of rows which are the duplicate of another row further up the data.frame. This means that if rows 20 and 23 are identical, duplicated() will return TRUE for row 23.

On the other hand, unique() will remove duplicate rows from a data.frame.

x <- data.frame(ID = c(203, 808, 909, 707, 808),
                Age = c(23, 44, 33, 42, 44))

duplicated(x)

[1] FALSE FALSE FALSE FALSE  TRUE

unique(x)

25.10 See also

See the data frame section for an introduction to data frames.

25.1 Synthetic Data

25.2 Filter

25.3 Column and row names

25.4 Delete columns or rows

25.5 subset()

25.6 split()

25.7 with()

25.8 Feature transformation with transform()

25.9 Identify and remove duplicated rows with duplicated() and unique()

25.10 See also

25.5 `subset()`

25.6 `split()`

25.7 `with()`

25.8 Feature transformation with `transform()`

25.9 Identify and remove duplicated rows with `duplicated()` and `unique()`