library(data.table)
options(datatable.print.class = TRUE)
38 Joins
data.table
allows you to perform table joins with either:
- the base
merge()
function using the same syntax as for data.frame objects or - Using bracket notation - unique to
data.table
objects.
a <- data.table(PID = 1:9,
Hospital = c("UCSF", "HUP", "Stanford",
"Stanford", "UCSF", "HUP",
"HUP", "Stanford", "UCSF"),
Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
Sex = c(1, 1, 0, 1, 0, 0, 1, 0, 0),
key = "PID")
a
Key: <PID>
PID Hospital Age Sex
<int> <char> <num> <num>
1: 1 UCSF 22 1
2: 2 HUP 34 1
3: 3 Stanford 41 0
4: 4 Stanford 19 1
5: 5 UCSF 53 0
6: 6 HUP 21 0
7: 7 HUP 63 1
8: 8 Stanford 22 0
9: 9 UCSF 19 0
b <- data.table(PID = 6:12,
V1 = c(153, 89, 112, 228, 91, 190, 101),
Department = c("Neurology", "Radiology", "Emergency",
"Cardiology", "Surgery", "Neurology",
"Psychiatry"),
key = "PID")
b
Key: <PID>
PID V1 Department
<int> <num> <char>
1: 6 153 Neurology
2: 7 89 Radiology
3: 8 112 Emergency
4: 9 228 Cardiology
5: 10 91 Surgery
6: 11 190 Neurology
7: 12 101 Psychiatry
In the above command we use the key
argument to set PID
as key. This can be performed after the data.table
has been created using the setkey()
command:
setkey(a, PID)
Multiple keys can be set, in order, with the same setkey()
command, separated by commas, e.g.
setkey(a, PID, Hospital)
Keys sort the data.table by the corresponding columns and can be used to perform left and right joins with bracket notation seen later.
38.1 Inner join
merge(a, b)
Key: <PID>
PID Hospital Age Sex V1 Department
<int> <char> <num> <num> <num> <char>
1: 6 HUP 21 0 153 Neurology
2: 7 HUP 63 1 89 Radiology
3: 8 Stanford 22 0 112 Emergency
4: 9 UCSF 19 0 228 Cardiology
38.2 Outer join
merge(a, b, all = TRUE)
Key: <PID>
PID Hospital Age Sex V1 Department
<int> <char> <num> <num> <num> <char>
1: 1 UCSF 22 1 NA <NA>
2: 2 HUP 34 1 NA <NA>
3: 3 Stanford 41 0 NA <NA>
4: 4 Stanford 19 1 NA <NA>
5: 5 UCSF 53 0 NA <NA>
6: 6 HUP 21 0 153 Neurology
7: 7 HUP 63 1 89 Radiology
8: 8 Stanford 22 0 112 Emergency
9: 9 UCSF 19 0 228 Cardiology
10: 10 <NA> NA NA 91 Surgery
11: 11 <NA> NA NA 190 Neurology
12: 12 <NA> NA NA 101 Psychiatry
38.3 Left outer join
Using merge()
:
merge(a, b, all.x = TRUE)
Key: <PID>
PID Hospital Age Sex V1 Department
<int> <char> <num> <num> <num> <char>
1: 1 UCSF 22 1 NA <NA>
2: 2 HUP 34 1 NA <NA>
3: 3 Stanford 41 0 NA <NA>
4: 4 Stanford 19 1 NA <NA>
5: 5 UCSF 53 0 NA <NA>
6: 6 HUP 21 0 153 Neurology
7: 7 HUP 63 1 89 Radiology
8: 8 Stanford 22 0 112 Emergency
9: 9 UCSF 19 0 228 Cardiology
Using bracket notation:
b[a, ]
Key: <PID>
PID V1 Department Hospital Age Sex
<int> <num> <char> <char> <num> <num>
1: 1 NA <NA> UCSF 22 1
2: 2 NA <NA> HUP 34 1
3: 3 NA <NA> Stanford 41 0
4: 4 NA <NA> Stanford 19 1
5: 5 NA <NA> UCSF 53 0
6: 6 153 Neurology HUP 21 0
7: 7 89 Radiology HUP 63 1
8: 8 112 Emergency Stanford 22 0
9: 9 228 Cardiology UCSF 19 0
If keys were not set for a and b, you could specify the column to match on using the on
argument:
b[a, on = "PID"]
Key: <PID>
PID V1 Department Hospital Age Sex
<int> <num> <char> <char> <num> <num>
1: 1 NA <NA> UCSF 22 1
2: 2 NA <NA> HUP 34 1
3: 3 NA <NA> Stanford 41 0
4: 4 NA <NA> Stanford 19 1
5: 5 NA <NA> UCSF 53 0
6: 6 153 Neurology HUP 21 0
7: 7 89 Radiology HUP 63 1
8: 8 112 Emergency Stanford 22 0
9: 9 228 Cardiology UCSF 19 0
The easy way to understand the bracket notation merges is to think that the data.table inside the bracket is used to index the data.table on the outside, therefore the resulting table will have rows dictated by the inside table’s key.
38.4 Right outer join
merge(a, b, all.y = TRUE)
Key: <PID>
PID Hospital Age Sex V1 Department
<int> <char> <num> <num> <num> <char>
1: 6 HUP 21 0 153 Neurology
2: 7 HUP 63 1 89 Radiology
3: 8 Stanford 22 0 112 Emergency
4: 9 UCSF 19 0 228 Cardiology
5: 10 <NA> NA NA 91 Surgery
6: 11 <NA> NA NA 190 Neurology
7: 12 <NA> NA NA 101 Psychiatry
Using bracket notation:
a[b, ]
Key: <PID>
PID Hospital Age Sex V1 Department
<int> <char> <num> <num> <num> <char>
1: 6 HUP 21 0 153 Neurology
2: 7 HUP 63 1 89 Radiology
3: 8 Stanford 22 0 112 Emergency
4: 9 UCSF 19 0 228 Cardiology
5: 10 <NA> NA NA 91 Surgery
6: 11 <NA> NA NA 190 Neurology
7: 12 <NA> NA NA 101 Psychiatry