Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
dplyr
includes the following commands for performing table joins:
inner_join()
for inner joinsleft_join()
for left outer joinsright_join()
for right outer joinsfull_join()
for outer joinsUsing the same example data as in the base R example (Chapter 27):
a <- data.frame(PID = 101:109,
Hospital = c("UCSF", "HUP", "Stanford",
"Stanford", "UCSF", "HUP",
"HUP", "Stanford", "UCSF"),
Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
Sex = c(1, 1, 0, 1, 0, 0, 1, 0, 0))
a
PID Hospital Age Sex
1 101 UCSF 22 1
2 102 HUP 34 1
3 103 Stanford 41 0
4 104 Stanford 19 1
5 105 UCSF 53 0
6 106 HUP 21 0
7 107 HUP 63 1
8 108 Stanford 22 0
9 109 UCSF 19 0
dim(a)
[1] 9 4
b <- data.frame(PID = 106:112,
V1 = c(153, 89, 112, 228, 91, 190, 101),
Department = c("Neurology", "Radiology",
"Emergency", "Cardiology",
"Surgery", "Neurology", "Psychiatry"))
dim(b)
[1] 7 3
b
PID V1 Department
1 106 153 Neurology
2 107 89 Radiology
3 108 112 Emergency
4 109 228 Cardiology
5 110 91 Surgery
6 111 190 Neurology
7 112 101 Psychiatry
ab_inner <- inner_join(a, b)
Joining with `by = join_by(PID)`
ab_inner
PID Hospital Age Sex V1 Department
1 106 HUP 21 0 153 Neurology
2 107 HUP 63 1 89 Radiology
3 108 Stanford 22 0 112 Emergency
4 109 UCSF 19 0 228 Cardiology
Note that the resulting table only contains cases found in both datasets, i.e. IDs 106 through 109
ab_outer <- full_join(a, b)
Joining with `by = join_by(PID)`
ab_outer
PID Hospital Age Sex V1 Department
1 101 UCSF 22 1 NA <NA>
2 102 HUP 34 1 NA <NA>
3 103 Stanford 41 0 NA <NA>
4 104 Stanford 19 1 NA <NA>
5 105 UCSF 53 0 NA <NA>
6 106 HUP 21 0 153 Neurology
7 107 HUP 63 1 89 Radiology
8 108 Stanford 22 0 112 Emergency
9 109 UCSF 19 0 228 Cardiology
10 110 <NA> NA NA 91 Surgery
11 111 <NA> NA NA 190 Neurology
12 112 <NA> NA NA 101 Psychiatry
Note that the resulting data frame contains all cases found in either dataset and missing values are represented with NA
.
ab_leftOuter <- left_join(a, b)
Joining with `by = join_by(PID)`
ab_leftOuter
PID Hospital Age Sex V1 Department
1 101 UCSF 22 1 NA <NA>
2 102 HUP 34 1 NA <NA>
3 103 Stanford 41 0 NA <NA>
4 104 Stanford 19 1 NA <NA>
5 105 UCSF 53 0 NA <NA>
6 106 HUP 21 0 153 Neurology
7 107 HUP 63 1 89 Radiology
8 108 Stanford 22 0 112 Emergency
9 109 UCSF 19 0 228 Cardiology
Note that the resulting data frame contains all cases present in the left input dataset (i.e. the one defined first in the arguments) only.
ab_rightOuter <- right_join(a, b)
Joining with `by = join_by(PID)`
ab_rightOuter
PID Hospital Age Sex V1 Department
1 106 HUP 21 0 153 Neurology
2 107 HUP 63 1 89 Radiology
3 108 Stanford 22 0 112 Emergency
4 109 UCSF 19 0 228 Cardiology
5 110 <NA> NA NA 91 Surgery
6 111 <NA> NA NA 190 Neurology
7 112 <NA> NA NA 101 Psychiatry
Note how the resulting data frame contains all cases present in the right input dataset (i.e. the one defined second in the arguments) only.
If the ID columns in the two data.frames to be merged do not have the same name, you can specify column names using the by
argument with a slightly peculiar syntax: c("left_by" = "right_by")
a <- data.frame(PID = 101:109,
Hospital = c("UCSF", "HUP", "Stanford",
"Stanford", "UCSF", "HUP",
"HUP", "Stanford", "UCSF"),
Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
Sex = c(1, 1, 0, 1, 0, 0, 1, 0, 0))
a
PID Hospital Age Sex
1 101 UCSF 22 1
2 102 HUP 34 1
3 103 Stanford 41 0
4 104 Stanford 19 1
5 105 UCSF 53 0
6 106 HUP 21 0
7 107 HUP 63 1
8 108 Stanford 22 0
9 109 UCSF 19 0
b <- data.frame(PatientID = 106:112,
V1 = c(153, 89, 112, 228, 91, 190, 101),
Department = c("Neurology", "Radiology",
"Emergency", "Cardiology",
"Surgery", "Neurology", "Psychiatry"))
b
PatientID V1 Department
1 106 153 Neurology
2 107 89 Radiology
3 108 112 Emergency
4 109 228 Cardiology
5 110 91 Surgery
6 111 190 Neurology
7 112 101 Psychiatry
ab_inner <- inner_join(a, b, by = c("PID" = "PatientID"))
ab_inner
PID Hospital Age Sex V1 Department
1 106 HUP 21 0 153 Neurology
2 107 HUP 63 1 89 Radiology
3 108 Stanford 22 0 112 Emergency
4 109 UCSF 19 0 228 Cardiology