Cheese Join Example ==================================== Let's suppose we had two datasets as follows: .. table:: ``main_df`` :class: longtable =============== ======== ================ Name Group ID Favourite Food =============== ======== ================ \ 2 Egg + Spam John Cleese Spam Eric Idle 3 Spam \ Egg Bacon + Spam Terry Jones Egg Terry Gilliam 2 Spam Carol Cleveland Egg Neil Innes 1 Spam =============== ======== ================ .. table:: ``cheese_df`` :class: longtable =================== ======== ================ =================== Name Group ID Favourite Food Cheese Choice =================== ======== ================ =================== Graham Chapman Egg + Spam Red Leicester John Marwood Cleese 3 Spam Tilsit Eric Idle 3 Eggs Bel Paese Michael Palin Egg Bacon + Spam Red Windsor Terry Jones 2 Egg Stitlton Winnie the Pooh 1 Honey Gruyère Terrence Gilliam 2 Spam Emmental Carol Cleveland 1 Egg Norwegian Jarlsberg Neil Innes 1 Egg Liptauer =================== ======== ================ =================== We might want to join these two datasets together, but we have a problem: the names don't match exactly. In this case, we could write some logic to automatically clean the names and then join the datasets together. However, for a more complicated example, this might not be possible so let's pretend we can't do that. In addition, some of the names are missing entirely. We might come up with the following strategy to join the datasets together: .. image:: ../_static/examples/cheese_join/flowchart_dark.png :class: only-dark :align: center .. image:: ../_static/examples/cheese_join/flowchart_light.png :class: only-light :align: center This would be the expected output for the join: .. table:: ``join_df`` :class: longtable ======== ================ =================== =================== ================= Group ID Favourite Food Name Cheese Choice Join Type ======== ================ =================== =================== ================= 2 Egg Graham Chapman Red Leicester Food 3 Spam John Marwood Cleese Tilsit Food and Group 3 Spam Eric Idle Bel Paese Name \ Egg Bacon + Spam Michael Palin Red Windsor Food 2 Egg Terry Jones Stitlton Name 2 Spam Terry Gilliam Emmental Name 1 Egg Carol Cleveland Norwegian Jarlsberg Name 1 Spam Neil Innes Liptauer Name ======== ================ =================== =================== ================= And here's how you'd set that up using a ``joinplex``-style interface: .. code-block:: python name_join = ColJoin(main_df, cheese_df, on='Name') food_join = ColJoin(main_df, cheese_df, on='Favourite Food') group_id_join = ColJoin(main_df, cheese_df, on='Group ID') # A plex is a general object representing a table operation - e.g. joining or filtering. plex = OneManyNoneFunnel( name_join, one="Name", many=ValueError, none=OneManyNoneFunnel( food_join, one="Food", none=ValueError, many=OneElseFunnel( group_id_join, one="Food and Group", many_or_none=ValueError, ) ) ) join_df = plex.run( label_col = 'Join Type', right_cols = ['Name'], combined_cols = ['Group ID'] ) Doing the same thing directly in ``pandas`` would be very tedious, and difficult to understand at a glance. This example has been fairly straightforward. Joinplex should be able to handle more complicated examples, such as: * Many-to-many, one-to-many and many-to-one joins * Joins with three tables or more involved * Funnelling based on the number of matches on both sides of the join (e.g. one-to-one distinguished from many-to-one) * Other join types than ``ColJoin``-based on a functional comparisons, spatial joins using ``geopandas``, etc. - More complex funnelling configurations involving cycles, multiple uses of the same join, etc., yet being economical with computation. * Precedence: require that one join outcome is reached before another join is completed, and then restrict the possible join candidates only to those which are not already joined. This is useful in situations where you can use “process of elimination” style logic.