Code Sample

import pandas as pd
import time

squares = set(a**2 for a in range(100000000))
series = pd.Series(range(100))

start = time.time()
apply_result = series.apply(lambda x: x in squares)
apply_end = time.time()
isin_result = series.isin(squares)
isin_end = time.time()

assert((apply_result==isin_result).all())

print("pandas.Series.apply() took {} seconds and pandas.Series.isin() took {} seconds.".format(apply_end - start, isin_end - apply_end))

Output:

pandas.Series.apply() took 0.0044422149658203125 seconds and pandas.Series.isin() took 72.23143887519836 seconds.

Problem description

When a set is passed to pandas.Series.isin, the set is converted to a list, before being converted back to a hash table. Consequently, the run time is linear in the size of the set, which is not ideal because one of the main reasons to use a set is that membership can be tested in constant time.

Suggested improvements

The quick and dirty workaround is to use pandas.Series.apply (as in the above code sample) instead of pandas.Series.isin. I'm not familiar enough with pandas internals to know whether there are edge cases where this would fail or whether it would be a bad idea to incorporate this workaround into isin directly. I would suggest, however, that at a minimum the documentation for isin be updated to mention that a set will be converted to a list and that this has performance implications, so that users can choose an alternative approach. (I am happy to contribute the documentation if this is the preferred solution.)

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.24.1 pytest: 3.0.5 pip: 19.0.3 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.16.2 scipy: 0.18.1 pyarrow: None xarray: None IPython: 5.3.0 sphinx: 1.5.1 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: 1.2.0 tables: 3.3.0 numexpr: 2.6.1 feather: None matplotlib: 2.0.0 openpyxl: 2.4.1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml.etree: 3.7.2 bs4: 4.5.3 html5lib: 0.9999999 sqlalchemy: 1.1.5 pymysql: None psycopg2: 2.7.4 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

Comment From: WillAyd

Makes sense. There is already an ASV benchmark for Series is in so if you wanted to submit a PR could try removing the list call, running the test suite to make sure no regression and posting the ASV output to confirm performance improvement

Comment From: Sujingqiao