15  Joins

import pandas as pd
a = pd.DataFrame({
    "PID": list(range(1, 10)),
    "Hospital": ["UCSF", "HUP", "Stanford",
                "Stanford", "UCSF", "HUP", 
                "HUP", "Stanford", "UCSF"],
    "Age": [22, 34, 41, 19, 53, 21, 63, 22, 19],
    "Sex": [1, 1, 0, 1, 0, 0, 1, 0, 0]
})
a.set_index("PID")
Hospital Age Sex
PID
1 UCSF 22 1
2 HUP 34 1
3 Stanford 41 0
4 Stanford 19 1
5 UCSF 53 0
6 HUP 21 0
7 HUP 63 1
8 Stanford 22 0
9 UCSF 19 0
b = pd.DataFrame({
    "PID": list(range(6, 13)),
    "V1": [153, 89, 112, 228,  91, 190, 101],
    "Department": ["Neurology", "Radiology",
                   "Emergency", "Cardiology",
                   "Surgery", "Neurology", "Psychiatry"]
})
# Optional, but supposed to help in large datasets
b.set_index("PID")
V1 Department
PID
6 153 Neurology
7 89 Radiology
8 112 Emergency
9 228 Cardiology
10 91 Surgery
11 190 Neurology
12 101 Psychiatry

15.1 Inner join

how argument default to “left”:

pd.merge(a, b, on = "PID", how = "inner")
PID Hospital Age Sex V1 Department
0 6 HUP 21 0 153 Neurology
1 7 HUP 63 1 89 Radiology
2 8 Stanford 22 0 112 Emergency
3 9 UCSF 19 0 228 Cardiology

Since we have set an index on both DataFrames, we can skip ‘on’:

pd.merge(a, b, how = "inner")
PID Hospital Age Sex V1 Department
0 6 HUP 21 0 153 Neurology
1 7 HUP 63 1 89 Radiology
2 8 Stanford 22 0 112 Emergency
3 9 UCSF 19 0 228 Cardiology

Note: The pd.join syntax is simialr to polars, but unlike polars, it only works correctly under specific conditions and will often give wrong results with no error or warning - avoid.

15.2 Outer join

pd.merge(a, b, on = "PID", how = "outer")
PID Hospital Age Sex V1 Department
0 1 UCSF 22.0 1.0 NaN NaN
1 2 HUP 34.0 1.0 NaN NaN
2 3 Stanford 41.0 0.0 NaN NaN
3 4 Stanford 19.0 1.0 NaN NaN
4 5 UCSF 53.0 0.0 NaN NaN
5 6 HUP 21.0 0.0 153.0 Neurology
6 7 HUP 63.0 1.0 89.0 Radiology
7 8 Stanford 22.0 0.0 112.0 Emergency
8 9 UCSF 19.0 0.0 228.0 Cardiology
9 10 NaN NaN NaN 91.0 Surgery
10 11 NaN NaN NaN 190.0 Neurology
11 12 NaN NaN NaN 101.0 Psychiatry

15.3 Left outer join

pd.merge(a, b, on = "PID", how = "left")
PID Hospital Age Sex V1 Department
0 1 UCSF 22 1 NaN NaN
1 2 HUP 34 1 NaN NaN
2 3 Stanford 41 0 NaN NaN
3 4 Stanford 19 1 NaN NaN
4 5 UCSF 53 0 NaN NaN
5 6 HUP 21 0 153.0 Neurology
6 7 HUP 63 1 89.0 Radiology
7 8 Stanford 22 0 112.0 Emergency
8 9 UCSF 19 0 228.0 Cardiology

15.4 Right outer join

There is no how = "right" option, so switch places of DataFrames:

pd.merge(a, b, on = "PID", how = "right")
PID Hospital Age Sex V1 Department
0 6 HUP 21.0 0.0 153 Neurology
1 7 HUP 63.0 1.0 89 Radiology
2 8 Stanford 22.0 0.0 112 Emergency
3 9 UCSF 19.0 0.0 228 Cardiology
4 10 NaN NaN NaN 91 Surgery
5 11 NaN NaN NaN 190 Neurology
6 12 NaN NaN NaN 101 Psychiatry