16  Joins

using DataFrames
a = DataFrame(PID = collect(1:9),
              Hospital = ["UCSF", "HUP", "Stanford",
                          "Stanford", "UCSF", "HUP", 
                          "HUP", "Stanford", "UCSF"],
              Age = [22, 34, 41, 19, 53, 21, 63, 22, 19],
              Sex = [1, 1, 0, 1, 0, 0, 1, 0, 0])
9×4 DataFrame
Row PID Hospital Age Sex
Int64 String Int64 Int64
1 1 UCSF 22 1
2 2 HUP 34 1
3 3 Stanford 41 0
4 4 Stanford 19 1
5 5 UCSF 53 0
6 6 HUP 21 0
7 7 HUP 63 1
8 8 Stanford 22 0
9 9 UCSF 19 0
b = DataFrame(PID = collect(6:12),
              V1 = [153, 89, 112, 228,  91, 190, 101],
              Department = ["Neurology", "Radiology",
                            "Emergency", "Cardiology",
                            "Surgery", "Neurology", "Psychiatry"])
7×3 DataFrame
Row PID V1 Department
Int64 Int64 String
1 6 153 Neurology
2 7 89 Radiology
3 8 112 Emergency
4 9 228 Cardiology
5 10 91 Surgery
6 11 190 Neurology
7 12 101 Psychiatry

16.1 Inner join

ab_inner = innerjoin(a, b, on = :PID)
4×6 DataFrame
Row PID Hospital Age Sex V1 Department
Int64 String Int64 Int64 Int64 String
1 6 HUP 21 0 153 Neurology
2 7 HUP 63 1 89 Radiology
3 8 Stanford 22 0 112 Emergency
4 9 UCSF 19 0 228 Cardiology

Note that the resulting table only contains cases found in both data frames (i.e. IDs 6 through 9)

16.2 Outer join

ab_outer = outerjoin(a, b, on = :PID)
12×6 DataFrame
Row PID Hospital Age Sex V1 Department
Int64 String? Int64? Int64? Int64? String?
1 6 HUP 21 0 153 Neurology
2 7 HUP 63 1 89 Radiology
3 8 Stanford 22 0 112 Emergency
4 9 UCSF 19 0 228 Cardiology
5 1 UCSF 22 1 missing missing
6 2 HUP 34 1 missing missing
7 3 Stanford 41 0 missing missing
8 4 Stanford 19 1 missing missing
9 5 UCSF 53 0 missing missing
10 10 missing missing missing 91 Surgery
11 11 missing missing missing 190 Neurology
12 12 missing missing missing 101 Psychiatry

Note that the resulting data frame contains all IDs found in either input data frame and missing values are represented with missing

16.3 Left outer join

ab_leftouter = leftjoin(a, b, on = :PID)
9×6 DataFrame
Row PID Hospital Age Sex V1 Department
Int64 String Int64 Int64 Int64? String?
1 6 HUP 21 0 153 Neurology
2 7 HUP 63 1 89 Radiology
3 8 Stanford 22 0 112 Emergency
4 9 UCSF 19 0 228 Cardiology
5 1 UCSF 22 1 missing missing
6 2 HUP 34 1 missing missing
7 3 Stanford 41 0 missing missing
8 4 Stanford 19 1 missing missing
9 5 UCSF 53 0 missing missing

Note how the resulting data frame contains all IDs present in the left input DataFrame only.

16.4 Right outer join

ab_rightouter = rightjoin(a, b, on = :PID)
7×6 DataFrame
Row PID Hospital Age Sex V1 Department
Int64 String? Int64? Int64? Int64 String
1 6 HUP 21 0 153 Neurology
2 7 HUP 63 1 89 Radiology
3 8 Stanford 22 0 112 Emergency
4 9 UCSF 19 0 228 Cardiology
5 10 missing missing missing 91 Surgery
6 11 missing missing missing 190 Neurology
7 12 missing missing missing 101 Psychiatry

Note how the resulting data frame contains all IDs present in the right input DataFrame only.