Simple Join Example ==================================== Let's suppose we had two datasets as follows: .. table:: ``main_df`` :class: longtable =============== ======== Fruit ID Num =============== ======== Banana 1 Orange 2 Apple 3 =============== ======== .. table:: ``colour_df`` :class: longtable =================== ========= ======== Fruit ID Num Colour =================== ========= ======== \ 1 Yellow Orange \ Orange Apple 3 Red Apple 4 Green =================== ========= ======== We might want to join these two datasets together, but we have a problem: the ``colour_df`` dataset is missing a value in the ``Fruit`` column for the first row. Likewise, the it is missing a value in the ``ID Num`` column for the second row. We might come up with the following strategy to join the datasets together: .. image:: ../_static/examples/simple_join/flowchart_dark.png :class: only-dark :align: center .. image:: ../_static/examples/simple_join/flowchart_light.png :class: only-light :align: center This would be the expected output for the join: .. table:: ``join_df`` :class: longtable =============== ======== ======== ============ Fruit ID Num Colour Join Type =============== ======== ======== ============ Banana 1 Yellow ID Num Orange 2 Orange Fruit Name Apple 3 Red ID Num =============== ======== ======== ============ How do we get that output? - The first row of ``main_df`` has a value of ``"Banana"`` in the ``"Fruit"`` column. We can see that the ``"Fruit"`` column in ``colour_df`` has no value equal to ``"Banana"``. So it instead, we use the value in the ``"ID Num"`` column. The value in the ``"ID Num"`` column in ``main_df`` is ``1``, which appears exactly once in the ``"ID Num"`` column in ``colour_df``. So we use that value to join the two rows together, noting that the ``Join Type`` is ``"ID Num"``-based: =============== ======== ======== ============ Fruit ID Num Colour Join Type =============== ======== ======== ============ Banana 1 Yellow ID Num =============== ======== ======== ============ - The second row of ``main_df`` has a value of ``"Orange"`` in the ``"Fruit"`` column. We can see that the ``"Fruit"`` column in ``colour_df`` has exactly one value equal to ``"Orange"``. So we use that value to join the two rows together, noting that the ``Join Type`` is ``"Fruit Name"``-based: =============== ======== ======== ============ Fruit ID Num Colour Join Type =============== ======== ======== ============ Orange 2 Orange Fruit Name =============== ======== ======== ============ - The third row of ``main_df`` has a value of ``"Apple"`` in the ``"Fruit"`` column. We can see that the ``"Fruit"`` column in ``colour_df`` has two values equal to ``"Apple"``. This leaves an ambiguity, so in addition we begin to use the value in the ``"ID Num"`` column. The value in the ``"ID Num"`` column in ``main_df`` is ``3``, which appears exactly once in the ``"ID Num"`` column in ``colour_df``. So we use that value to join the two rows together, noting that the ``Join Type`` is ``"ID Num"``-based: =============== ======== ======== ============ Fruit ID Num Colour Join Type =============== ======== ======== ============ Apple 3 Red ID Num =============== ======== ======== ============ And here's how you'd set that up using a ``joinplex``-style interface: .. code-block:: python name_join = ColJoin(main_df, colour_df, on='Fruit') id_join = ColJoin(main_df, colour_df, on='ID Num') # A plex is a general object representing a table operation - e.g. joining or filtering. plex = OneElseFunnel( name_join, one="Fruit Name", many_or_none=OneElseFunnel( id_join, one="ID Num", many_or_none=ValueError, ) ) join_df = plex.run( label_col = 'Join Type', left_cols = ['Fruit', 'ID Num'] ) And here's equivalent ``pandas`` code: .. code-block:: python main_df["row_id"] = range(len(main_df)) name_join_df = main_df.merge(colour_df, how="inner", on="Fruit", suffixes=("", "_Name")) name_join_df.drop(columns=["ID Num_Name"], inplace=True) dupl_idx = name_join_df.duplicated(subset=main_df.columns, keep=False) dupl_name_join_df = name_join_df.loc[dupl_idx] one_name_join_df = name_join_df.loc[~dupl_idx] main_dupl_idx = main_df.index.isin(dupl_name_join_df.index) dupl_df = main_df.loc[main_dupl_idx] no_name_join_df = main_df.loc[~main_dupl_idx] id_dupl_join_df = dupl_df.merge(colour_df, how="inner", on=["Fruit", "ID Num"]) id_no_join_df = no_name_join_df.merge( colour_df, how="inner", on=["ID Num"], suffixes=("", "_ID") ) id_no_join_df.drop(columns=["Fruit_ID"], inplace=True) one_name_join_df["Join Type"] = "Fruit Name" id_dupl_join_df["Join Type"] = "ID Num" id_no_join_df["Join Type"] = "ID Num" join_df = pd.concat([one_name_join_df, id_dupl_join_df, id_no_join_df], axis="index") join_df.sort_values("row_id", inplace=True) join_df.reset_index(drop=True, inplace=True) join_df.drop(columns="row_id", inplace=True) main_df.drop(columns="row_id", inplace=True) Oh No! Where to begin? This ``pandas`` code has some issues: - The main one is that it's not very readable. It's very long, circuitous, and has a lot of intermediate variables and tables. It's hard to tell what's going on at a glance - and hard to tell whether there are any bugs. - Another issue is that there's plenty of other ways to create equivalent ``pandas`` code. There's not "One, and only one, obvious way to do it".