import dask.dataframe as dd
import pandas as pd
19 Joins
We are using very small data frames as an example. Such operations would normally be done in dask instead of pandas for much bigger datasets.
= pd.DataFrame({
a "PID": list(range(1, 10)),
"Hospital": ["UCSF", "HUP", "Stanford",
"Stanford", "UCSF", "HUP",
"HUP", "Stanford", "UCSF"],
"Age": [22, 34, 41, 19, 53, 21, 63, 22, 19],
"Sex": [1, 1, 0, 1, 0, 0, 1, 0, 0]
})"PID")
a.set_index(= dd.from_pandas(a, npartitions=1)
a a
Dask DataFrame Structure:
PID | Hospital | Age | Sex | |
---|---|---|---|---|
npartitions=1 | ||||
0 | int64 | object | int64 | int64 |
8 | ... | ... | ... | ... |
Dask Name: from_pandas, 1 graph layer
= pd.DataFrame({
b "PID": list(range(6, 13)),
"V1": [153, 89, 112, 228, 91, 190, 101],
"Department": ["Neurology", "Radiology",
"Emergency", "Cardiology",
"Surgery", "Neurology", "Psychiatry"]
})# Optional, but supposed to help in large datasets
"PID")
b.set_index(= dd.from_pandas(b, npartitions=1)
b b
Dask DataFrame Structure:
PID | V1 | Department | |
---|---|---|---|
npartitions=1 | |||
0 | int64 | int64 | object |
6 | ... | ... | ... |
Dask Name: from_pandas, 1 graph layer
19.1 Inner join
Similar to Pandas
='PID', how='inner').compute() dd.merge(a, b, on
PID | Hospital | Age | Sex | V1 | Department | |
---|---|---|---|---|---|---|
0 | 6 | HUP | 21 | 0 | 153 | Neurology |
1 | 7 | HUP | 63 | 1 | 89 | Radiology |
2 | 8 | Stanford | 22 | 0 | 112 | Emergency |
3 | 9 | UCSF | 19 | 0 | 228 | Cardiology |
or:
='PID', how='inner').compute() a.merge(b, on
PID | Hospital | Age | Sex | V1 | Department | |
---|---|---|---|---|---|---|
0 | 6 | HUP | 21 | 0 | 153 | Neurology |
1 | 7 | HUP | 63 | 1 | 89 | Radiology |
2 | 8 | Stanford | 22 | 0 | 112 | Emergency |
3 | 9 | UCSF | 19 | 0 | 228 | Cardiology |
Since we have set an index on both DataFrames, we can skip ‘on’:
= "inner").compute() dd.merge(a, b, how
PID | Hospital | Age | Sex | V1 | Department | |
---|---|---|---|---|---|---|
0 | 6 | HUP | 21 | 0 | 153 | Neurology |
1 | 7 | HUP | 63 | 1 | 89 | Radiology |
2 | 8 | Stanford | 22 | 0 | 112 | Emergency |
3 | 9 | UCSF | 19 | 0 | 228 | Cardiology |
Note: The pd.join
syntax is simialr to polars, but unlike polars, it only works correctly under specific conditions and will often give wrong results with no error or warning - avoid.
19.2 Outer join
= "PID", how = "outer").compute() dd.merge(a, b, on
PID | Hospital | Age | Sex | V1 | Department | |
---|---|---|---|---|---|---|
0 | 1 | UCSF | 22.0 | 1.0 | NaN | NaN |
1 | 2 | HUP | 34.0 | 1.0 | NaN | NaN |
2 | 3 | Stanford | 41.0 | 0.0 | NaN | NaN |
3 | 4 | Stanford | 19.0 | 1.0 | NaN | NaN |
4 | 5 | UCSF | 53.0 | 0.0 | NaN | NaN |
5 | 6 | HUP | 21.0 | 0.0 | 153.0 | Neurology |
6 | 7 | HUP | 63.0 | 1.0 | 89.0 | Radiology |
7 | 8 | Stanford | 22.0 | 0.0 | 112.0 | Emergency |
8 | 9 | UCSF | 19.0 | 0.0 | 228.0 | Cardiology |
9 | 10 | NaN | NaN | NaN | 91.0 | Surgery |
10 | 11 | NaN | NaN | NaN | 190.0 | Neurology |
11 | 12 | NaN | NaN | NaN | 101.0 | Psychiatry |
19.3 Left outer join
= "PID", how = "left").compute() dd.merge(a, b, on
PID | Hospital | Age | Sex | V1 | Department | |
---|---|---|---|---|---|---|
0 | 1 | UCSF | 22 | 1 | NaN | NaN |
1 | 2 | HUP | 34 | 1 | NaN | NaN |
2 | 3 | Stanford | 41 | 0 | NaN | NaN |
3 | 4 | Stanford | 19 | 1 | NaN | NaN |
4 | 5 | UCSF | 53 | 0 | NaN | NaN |
5 | 6 | HUP | 21 | 0 | 153.0 | Neurology |
6 | 7 | HUP | 63 | 1 | 89.0 | Radiology |
7 | 8 | Stanford | 22 | 0 | 112.0 | Emergency |
8 | 9 | UCSF | 19 | 0 | 228.0 | Cardiology |
19.4 Right outer join
= "PID", how = "right").compute() dd.merge(a, b, on
PID | Hospital | Age | Sex | V1 | Department | |
---|---|---|---|---|---|---|
0 | 6 | HUP | 21.0 | 0.0 | 153 | Neurology |
1 | 7 | HUP | 63.0 | 1.0 | 89 | Radiology |
2 | 8 | Stanford | 22.0 | 0.0 | 112 | Emergency |
3 | 9 | UCSF | 19.0 | 0.0 | 228 | Cardiology |
4 | 10 | NaN | NaN | NaN | 91 | Surgery |
5 | 11 | NaN | NaN | NaN | 190 | Neurology |
6 | 12 | NaN | NaN | NaN | 101 | Psychiatry |