Code Sample
from random import randint
def bug_report(n=2000000, idmax=22750, prodmax=3414341):
ids = [randint(1, idmax) for _ in range(n)]
r = lambda: randint(1, prodmax)
prods = [(-1,-2,-3), (-1,-2,-3)] + [(r(), r(), r()) for _ in range(n-2)]
df = pd.DataFrame({'ids': ids, 'products': prods})
counts = df['products'].value_counts()
counts_idxs = counts[counts >= 2].index
idxs = df['products'].isin(counts_idxs)
return df[idxs]
Problem description
There are several ways to trigger the bug, either of them resulting in isin
returning all False
whereas some indexes should be True
.
Take the example above, we have the tuple (-1,-2,-3)
repeated twice, and it can be checked that both counts
and counts_idxs
are 2
and (-1,-2,-3)
, respectively. Then, independently from the rest of the products
, the resulting dataset from taking the idxs
from isin
should have, at least, 2 items. Calling the function as is, does not. Explanation, causes and possible solutions below:
Manually importing from pandas.core.algorithms import isin
and settings idxs = isin(df['products'], counts[counts >= 2].index)
results in the exact same behaviour.
I've tried to reproduce this same behaviour when not using tuples at all and I can't seem to succeed.
Proposed solution
This seems to be a regression in 0.20.x
as using latest 0.19.x
(0.19.2) works perfectly fine. Indeed, manually copying isin
from 0.19.x
and using it instead of 0.20.x
works. One can see that a particular if was reversed/erased in
https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L414
and
https://github.com/pandas-dev/pandas/blob/v0.19.2/pandas/core/algorithms.py#L144
https://github.com/pandas-dev/pandas/blob/v0.19.2/pandas/core/algorithms.py#L161
This results in 0.20.x
relying in numpy.in1d
whereas 0.19.x
used lib.ismember
, which is equivalent to htable.ismember_object
in 0.20.x
. One can confirm this becase:
htable = pandas._libs.hashtable
idxs = htable.ismember_object(df['products'].values, np.asarray(counts[counts >= 2].index))
df[idxs]
works fine, whereas
idxs = np.in1d(df['products'].values, np.asarray(counts[counts >= 2].index))
all_sets[idxs]
silently fails.
Now, either this is temporally fixed in pandas by not relying in in1d
or an issue is submitted to numpy (which I will do once I can take a look at in1d
and see what's happening). Also, one can solve it by not using tuples at all, and applying hash
beforehand, for example.
I've narrowed a bit more the problem and it is not only related to n
but also prodmax
:
Any combination with n > 1000001 && prodmax > 1986
produces and empty dataframe:
bug_report(n=1000001, prodmax=1987)
bug_report(n=1000001)
bug_report()
Whereas having n <= 1000000
or prodmax <= 1986
works just fine. Parameter values have been deduced from:
n
from https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L414prodmax
by binary search:
def narrow():
start = 256
end = 2048
while start + 1 < end:
print(start, end)
df = bug_report_4(n=1000001, prodmax=(start + end) // 2)
if df.empty:
end = (start + end) // 2
else:
start = (start + end) // 2
return start, df.empty
narrow()
# (1896, False)
Output of pd.show_versions()
This has been confirmed and tested in multiple pcs and environments, always Python 3.x
Comment From: jorisvandenbossche
There has been a related issue and fix (https://github.com/pandas-dev/pandas/issues/16012, https://github.com/pandas-dev/pandas/pull/16969). So it might be this is fixed in the meantime in master. Would you be able to test that with the 0.21.0 release candidate? (published a few days ago, see https://groups.google.com/forum/#!topic/pydata/kxVdd6ZHjvI)
Comment From: gpascualg
I did a fresh install of python 3.6 and upgraded to 0.21.0-rc1 and can confirm the issue is no longer there, it has been solved.
For reference:
I am closing the issue as it is solved already, if needed feel free to reopen!
Thank you!
Comment From: jorisvandenbossche
Thanks for testing!