Code Sample
import pandas as pd
import time
squares = set(a**2 for a in range(100000000))
series = pd.Series(range(100))
start = time.time()
apply_result = series.apply(lambda x: x in squares)
apply_end = time.time()
isin_result = series.isin(squares)
isin_end = time.time()
assert((apply_result==isin_result).all())
print("pandas.Series.apply() took {} seconds and pandas.Series.isin() took {} seconds.".format(apply_end - start, isin_end - apply_end))
Output:
pandas.Series.apply() took 0.0044422149658203125 seconds and pandas.Series.isin() took 72.23143887519836 seconds.
Problem description
When a set is passed to pandas.Series.isin
, the set is converted to a list, before being converted back to a hash table. Consequently, the run time is linear in the size of the set, which is not ideal because one of the main reasons to use a set is that membership can be tested in constant time.
Suggested improvements
The quick and dirty workaround is to use pandas.Series.apply
(as in the above code sample) instead of pandas.Series.isin
. I'm not familiar enough with pandas internals to know whether there are edge cases where this would fail or whether it would be a bad idea to incorporate this workaround into isin
directly. I would suggest, however, that at a minimum the documentation for isin
be updated to mention that a set will be converted to a list and that this has performance implications, so that users can choose an alternative approach. (I am happy to contribute the documentation if this is the preferred solution.)
Output of pd.show_versions()
Comment From: WillAyd
Makes sense. There is already an ASV benchmark for Series is in so if you wanted to submit a PR could try removing the list call, running the test suite to make sure no regression and posting the ASV output to confirm performance improvement
Comment From: Sujingqiao
秀