Cheese Join Example#
Let’s suppose we had two datasets as follows:
Name |
Group ID |
Favourite Food |
|---|---|---|
2 |
Egg + Spam |
|
John Cleese |
Spam |
|
Eric Idle |
3 |
Spam |
Egg Bacon + Spam |
||
Terry Jones |
Egg |
|
Terry Gilliam |
2 |
Spam |
Carol Cleveland |
Egg |
|
Neil Innes |
1 |
Spam |
Name |
Group ID |
Favourite Food |
Cheese Choice |
|---|---|---|---|
Graham Chapman |
Egg + Spam |
Red Leicester |
|
John Marwood Cleese |
3 |
Spam |
Tilsit |
Eric Idle |
3 |
Eggs |
Bel Paese |
Michael Palin |
Egg Bacon + Spam |
Red Windsor |
|
Terry Jones |
2 |
Egg |
Stitlton |
Winnie the Pooh |
1 |
Honey |
Gruyère |
Terrence Gilliam |
2 |
Spam |
Emmental |
Carol Cleveland |
1 |
Egg |
Norwegian Jarlsberg |
Neil Innes |
1 |
Egg |
Liptauer |
We might want to join these two datasets together, but we have a problem: the names don’t match exactly. In this case, we could write some logic to automatically clean the names and then join the datasets together. However, for a more complicated example, this might not be possible so let’s pretend we can’t do that. In addition, some of the names are missing entirely.
We might come up with the following strategy to join the datasets together:
This would be the expected output for the join:
Group ID |
Favourite Food |
Name |
Cheese Choice |
Join Type |
|---|---|---|---|---|
2 |
Egg |
Graham Chapman |
Red Leicester |
Food |
3 |
Spam |
John Marwood Cleese |
Tilsit |
Food and Group |
3 |
Spam |
Eric Idle |
Bel Paese |
Name |
Egg Bacon + Spam |
Michael Palin |
Red Windsor |
Food |
|
2 |
Egg |
Terry Jones |
Stitlton |
Name |
2 |
Spam |
Terry Gilliam |
Emmental |
Name |
1 |
Egg |
Carol Cleveland |
Norwegian Jarlsberg |
Name |
1 |
Spam |
Neil Innes |
Liptauer |
Name |
And here’s how you’d set that up using a joinplex-style interface:
name_join = ColJoin(main_df, cheese_df, on='Name')
food_join = ColJoin(main_df, cheese_df, on='Favourite Food')
group_id_join = ColJoin(main_df, cheese_df, on='Group ID')
# A plex is a general object representing a table operation - e.g. joining or filtering.
plex = OneManyNoneFunnel(
name_join,
one="Name",
many=ValueError,
none=OneManyNoneFunnel(
food_join,
one="Food",
none=ValueError,
many=OneElseFunnel(
group_id_join,
one="Food and Group",
many_or_none=ValueError,
)
)
)
join_df = plex.run(
label_col = 'Join Type',
right_cols = ['Name'],
combined_cols = ['Group ID']
)
Doing the same thing directly in pandas would be very tedious, and difficult to
understand at a glance.
This example has been fairly straightforward. Joinplex should be able to handle more complicated examples, such as:
Many-to-many, one-to-many and many-to-one joins
Joins with three tables or more involved
Funnelling based on the number of matches on both sides of the join (e.g. one-to-one distinguished from many-to-one)
Other join types than
ColJoin-based on a functional comparisons, spatial joins usinggeopandas, etc. - More complex funnelling configurations involving cycles, multiple uses of the same join, etc., yet being economical with computation.Precedence: require that one join outcome is reached before another join is completed, and then restrict the possible join candidates only to those which are not already joined. This is useful in situations where you can use “process of elimination” style logic.