25 Working with Data frames
Core data.frame operations:
Operation | Functions |
---|---|
Filter |
|
Find duplicate rows |
|
Select |
|
Add column(s) |
|
Add row(s) |
|
Reshape |
|
Join |
|
25.1 Synthetic Data
25.2 Filter
Filter & select operations are performed by indexing rows and cases, respectively.
25.3 Column and row names
Let’s start with a simple example data.frame:
df <- data.frame(PID = c(111:119),
Hospital = c("UCSF", "HUP", "Stanford",
"Stanford", "UCSF", "HUP",
"HUP", "Stanford", "UCSF"),
Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
Sex = c(1, 1, 0, 1, 0, 0, 1, 0, 0))
df
PID Hospital Age Sex
1 111 UCSF 22 1
2 112 HUP 34 1
3 113 Stanford 41 0
4 114 Stanford 19 1
5 115 UCSF 53 0
6 116 HUP 21 0
7 117 HUP 63 1
8 118 Stanford 22 0
9 119 UCSF 19 0
The optional row.names
argument (see data.frame
usage in the R documentation) can be used to define row names at the time of the data frame creation. It accepts either:
- a single integer or a character specifying a column of the data.frame being created whose values should be used as row names, or
- a vector of values (character or integer) of the row names to be used
For example, we can use the “PID” column:
df <- data.frame(PID = c(111:119),
Hospital = c("UCSF", "HUP", "Stanford",
"Stanford", "UCSF", "HUP",
"HUP", "Stanford", "UCSF"),
Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
Sex = c(1, 1, 0, 1, 0, 0, 1, 0, 0),
row.names = "PID")
It is recommended to not use/depend on row names to identify or index data.frames, and instead include a column of case IDs.
We can get column names and row names with colnames()
and rownames()
, respectively:
colnames(df)
[1] "Hospital" "Age" "Sex"
rownames(df)
[1] "111" "112" "113" "114" "115" "116" "117" "118" "119"
To set new column or row names use the form:
colnames(df) <- new.colnames
rownames(df) <- new.rownames
where new.colnames
and new.rownames
is a character vector.
You can rename all columns/rows or use indexing to replace specific names:
Rename all rows:
Hospital Age Sex
Patient_1 UCSF 22 1
Patient_2 HUP 34 1
Patient_3 Stanford 41 0
Patient_4 Stanford 19 1
Patient_5 UCSF 53 0
Patient_6 HUP 21 0
Patient_7 HUP 63 1
Patient_8 Stanford 22 0
Patient_9 UCSF 19 0
Rename first two columns:
25.4 Delete columns or rows
To delete a data.frame column, set it to NULL:
df$Sex <- NULL
df
Center Age_at_Dx
Patient_1 UCSF 22
Patient_2 HUP 34
Patient_3 Stanford 41
Patient_4 Stanford 19
Patient_5 UCSF 53
Patient_6 HUP 21
Patient_7 HUP 63
Patient_8 Stanford 22
Patient_9 UCSF 19
To delete a data.frame row, you can “index it out”.
For example, to remove the third and fifths rows of the above data.frame using an integer index:
df <- df[-c(3, 5), ]
df
Center Age_at_Dx
Patient_1 UCSF 22
Patient_2 HUP 34
Patient_4 Stanford 19
Patient_6 HUP 21
Patient_7 HUP 63
Patient_8 Stanford 22
Patient_9 UCSF 19
You can similarly exclude a row using a logical index. Logical indexing occurs usually following some filtering condition.
For example, exclude patients under 20 years old:
df <- df[!df$Age < 20, ]
df
Center Age_at_Dx
Patient_1 UCSF 22
Patient_2 HUP 34
Patient_6 HUP 21
Patient_7 HUP 63
Patient_8 Stanford 22
25.5 subset()
subset()
allows you to
-
filter cases that meet certain conditions using the
subset
argument -
select columns using the
select
argument
(head()
returns the first few lines of a data frame. We use it to avoid printing too many lines)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Note: You can use the column name Sepal.Length directly, i.e. unquoted and you don’t need to use iris$Sepal.Length
.
This is called Non-Standard Evaluation (NSE).
x <- data.frame(one = 1:10,
two = rnorm(10),
group = c(rep("alpha", 4), rep("beta", 6)))
subset(x, subset = two > 0, select = two)
two
2 0.80776842
5 0.06035668
7 0.34027593
10 0.36817344
subset(x, two > 0, -one)
two group
2 0.80776842 alpha
5 0.06035668 beta
7 0.34027593 beta
10 0.36817344 beta
subset(x, two > 0, two:one)
two one
2 0.80776842 2
5 0.06035668 5
7 0.34027593 7
10 0.36817344 10
subset(x, two > 0, two:group)
two group
2 0.80776842 alpha
5 0.06035668 beta
7 0.34027593 beta
10 0.36817344 beta
25.6 split()
Split a data frame into multiple data frames by groups defined by a factor:
x_by_group <- split(x, x$group)
x_by_group
$alpha
one two group
1 1 -0.2690377 alpha
2 2 0.8077684 alpha
3 3 -1.1247172 alpha
4 4 -1.4307880 alpha
$beta
one two group
5 5 0.06035668 beta
6 6 -0.79298250 beta
7 7 0.34027593 beta
8 8 -0.25946873 beta
9 9 -1.30484865 beta
10 10 0.36817344 beta
25.7 with()
Within a with()
expression, you can access list elements or data.frame columns without quoting or using the $
operator:
with(x, one + two)
[1] 0.7309623 2.8077684 1.8752828 2.5692120 5.0603567 5.2070175
[7] 7.3402759 7.7405313 7.6951514 10.3681734
with(x, x[group == "alpha", ])
one two group
1 1 -0.2690377 alpha
2 2 0.8077684 alpha
3 3 -1.1247172 alpha
4 4 -1.4307880 alpha
with(x, x[two > 0, ])
one two group
2 2 0.80776842 alpha
5 5 0.06035668 beta
7 7 0.34027593 beta
10 10 0.36817344 beta
25.8 Feature transformation with transform()
Make up some data:
dat <- data.frame(Sex = c(0, 0, 1, 1, 0),
Height = c(1.5, 1.6, 1.55, 1.73, 1.8),
Weight = c(55, 70, 69, 76, 91))
dat <- transform(dat, BMI = Weight/Height^2)
dat
Sex Height Weight BMI
1 0 1.50 55 24.44444
2 0 1.60 70 27.34375
3 1 1.55 69 28.72008
4 1 1.73 76 25.39343
5 0 1.80 91 28.08642
transform()
is probably not used too often, because it is trivial to do the same with direct assignment:
dat$BMI <- dat$Weight/dat$Height^2
but can be useful when adding multiple variables and/or used in a pipe:
25.9 Identify and remove duplicated rows with duplicated()
and unique()
When applied on a data.frame, the duplicated()
function returns a logical index specifying the location of duplicated rows - specifically, a logical index of rows which are the duplicate of another row further up the data.frame. This means that if rows 20 and 23 are identical, duplicated()
will return TRUE
for row 23.
On the other hand, unique()
will remove duplicate rows from a data.frame.
x <- data.frame(ID = c(203, 808, 909, 707, 808),
Age = c(23, 44, 33, 42, 44))
duplicated(x)
[1] FALSE FALSE FALSE FALSE TRUE
unique(x)
ID Age
1 203 23
2 808 44
3 909 33
4 707 42
25.10 See also
- See the data frame section for an introduction to data frames.