Is your feature request related to a problem?
I wish I could use the Pandas DataFrame.join
function to perform left semi joins: selecting each row from the left DataFrame that has at least one matching row in the right DataFrame.
Describe the solution you'd like
I would like to be able to pass leftsemi
to the how
parameter in the DataDrame.join
function to perform this type of join.
Additional context
For example: in SQL this might look like
SELECT *
FROM table1
WHERE EXISTS
(SELECT 1
FROM table2
WHERE table1.id = table2.id)
Comment From: jreback
this is just a join after you isin on the rhs - would take a PR for this naive implementation
Comment From: rhshadrach
If there are duplicate matches on the RHS, I think a left semijoin would only produce a single row. But I think this can be produced with a drop duplicates on the left columns.
Comment From: aneesh98
Hi I would like to work on this enhancement, if no one is working on this. I have few questions: 1) Incase of column name mentioned in 'on' param provided, a .isin must be performed on the same 'on' value in rhs df right for the unique values of the 'on' column in lhs?. 2) If no 'on' param value is provided, then the isin should be performed on the index of the rhs? 3) The columns of the rhs dataframe present in the resulting dataframe from .join be dropped? Because the query selects only the rows in lhs df for which there are matching rows in rhs df.
Comment From: debnathshoham
Hi @aneesh98, just wanted to check if you are working on this? Otherwise I would like to pick this up.
Comment From: aneesh98
Hi @debnathshoham, I am actually still working on it.
Comment From: debnathshoham
Great!
To answer the questions above:
1. Correct
2. Merge can happen using on
, left_on
& right_on
, right_index
& left_index
, left_index
& right_on
and right_index
& left_on
. If none of these are provided, it tries on
the common columns.
3. AFAIU semi join should only returns rows from left
that has matching rows in right
. It won't add columns from right
.
Hope that helps.
Comment From: aneesh98
Hi @debnathshoham. Thank you so much for reply. Can you please confirm if this approach is correct then? 1) If 'on' is None and join is to be done on index-to-index, then find out the rows present in the left dataframe which have matching index values in the right dataframe 2) If 'on' is provided, then perform an inner join based on the values provided in 'on', discard the columns from right dataframe and remove duplicates.
Please let me know if I am correct with this logic, or if I am missing any case or anything at all?. Thank you.
Comment From: debnathshoham
Merge can happen using on, left_on & right_on, right_index & left_index, left_index & right_on and right_index & left_on. If none of these are provided, it tries on the common columns.
Yes, actually you have just considered 2 cases from the 5 I mentioned (you will get a clear picture, if you play around with all these existing merging procedures).
The idea is really simple, in leftsemi
, check for the keys in left
(can be either index(s) or column(s)) isin
the keys in the right
(again, could be either index(s) or column(s)). And return those keys from left
.
You can check this example from stack. Hope this will clarify.
Comment From: kurzdev
Any updates on this?