Simple Join Example#

Let’s suppose we had two datasets as follows:

main_df#

Fruit

ID Num

Banana

1

Orange

2

Apple

3

colour_df#

Fruit

ID Num

Colour

1

Yellow

Orange

Orange

Apple

3

Red

Apple

4

Green

We might want to join these two datasets together, but we have a problem: the colour_df dataset is missing a value in the Fruit column for the first row. Likewise, the it is missing a value in the ID Num column for the second row.

We might come up with the following strategy to join the datasets together:

../_images/flowchart_dark1.png ../_images/flowchart_light1.png

This would be the expected output for the join:

join_df#

Fruit

ID Num

Colour

Join Type

Banana

1

Yellow

ID Num

Orange

2

Orange

Fruit Name

Apple

3

Red

ID Num

How do we get that output?

  • The first row of main_df has a value of "Banana" in the "Fruit" column. We can see that the "Fruit" column in colour_df has no value equal to "Banana". So it instead, we use the value in the "ID Num" column. The value in the "ID Num" column in main_df is 1, which appears exactly once in the "ID Num" column in colour_df. So we use that value to join the two rows together, noting that the Join Type is "ID Num"-based:

Fruit

ID Num

Colour

Join Type

Banana

1

Yellow

ID Num

  • The second row of main_df has a value of "Orange" in the "Fruit" column. We can see that the "Fruit" column in colour_df has exactly one value equal to "Orange". So we use that value to join the two rows together, noting that the Join Type is "Fruit Name"-based:

Fruit

ID Num

Colour

Join Type

Orange

2

Orange

Fruit Name

  • The third row of main_df has a value of "Apple" in the "Fruit" column. We can see that the "Fruit" column in colour_df has two values equal to "Apple". This leaves an ambiguity, so in addition we begin to use the value in the "ID Num" column. The value in the "ID Num" column in main_df is 3, which appears exactly once in the "ID Num" column in colour_df. So we use that value to join the two rows together, noting that the Join Type is "ID Num"-based:

Fruit

ID Num

Colour

Join Type

Apple

3

Red

ID Num

And here’s how you’d set that up using a joinplex-style interface:

name_join = ColJoin(main_df, colour_df, on='Fruit')
id_join = ColJoin(main_df, colour_df, on='ID Num')

# A plex is a general object representing a table operation - e.g. joining or filtering.
plex = OneElseFunnel(
    name_join,
    one="Fruit Name",
    many_or_none=OneElseFunnel(
        id_join,
        one="ID Num",
        many_or_none=ValueError,
    )
)

join_df = plex.run(
    label_col = 'Join Type',
    left_cols = ['Fruit', 'ID Num']
)

And here’s equivalent pandas code:

main_df["row_id"] = range(len(main_df))
name_join_df = main_df.merge(colour_df, how="inner", on="Fruit", suffixes=("", "_Name"))
name_join_df.drop(columns=["ID Num_Name"], inplace=True)

dupl_idx = name_join_df.duplicated(subset=main_df.columns, keep=False)
dupl_name_join_df = name_join_df.loc[dupl_idx]
one_name_join_df = name_join_df.loc[~dupl_idx]

main_dupl_idx = main_df.index.isin(dupl_name_join_df.index)
dupl_df = main_df.loc[main_dupl_idx]
no_name_join_df = main_df.loc[~main_dupl_idx]

id_dupl_join_df = dupl_df.merge(colour_df, how="inner", on=["Fruit", "ID Num"])
id_no_join_df = no_name_join_df.merge(
    colour_df, how="inner", on=["ID Num"], suffixes=("", "_ID")
)
id_no_join_df.drop(columns=["Fruit_ID"], inplace=True)

one_name_join_df["Join Type"] = "Fruit Name"
id_dupl_join_df["Join Type"] = "ID Num"
id_no_join_df["Join Type"] = "ID Num"

join_df = pd.concat([one_name_join_df, id_dupl_join_df, id_no_join_df], axis="index")
join_df.sort_values("row_id", inplace=True)
join_df.reset_index(drop=True, inplace=True)
join_df.drop(columns="row_id", inplace=True)
main_df.drop(columns="row_id", inplace=True)

Oh No! Where to begin? This pandas code has some issues:

  • The main one is that it’s not very readable. It’s very long, circuitous, and has a lot of intermediate variables and tables. It’s hard to tell what’s going on at a glance - and hard to tell whether there are any bugs.

  • Another issue is that there’s plenty of other ways to create equivalent pandas code. There’s not “One, and only one, obvious way to do it”.