Pandas Series.all much slower than Series.values.all

Code Sample, a copy-pastable example if possible


# Series of bools

s = pd.Series(np.random.randint(0, 2, 100000)).astype(bool)

# ~1.45 ms
%timeit s.any(skipna=True)

# ~1.35 ms
%timeit s.any(skipna=False)

# ~6.5 us - Note that I get a message about possible caching, but
# even after multiplying by worst case multiplier, still an order of
# magnitude faster than s.any()
%timeit s.values.any()


# Series of ints

s2 = pd.Series(np.random.randint(0, 2, 100000))

# ~330 us
%timeit s2.any(skipna=True)

# ~280 us
%timeit s2.any(skipna=False)

# ~90 us - No possible caching warning on this one
%timeit s2.values.any()

Problem description

Calling Series.any is much slower than calling Series.values.any on a series of bools Interestingly, calling Series.any on a series of ints is quite a bit faster than on a series of bools, though even if it is a series of ints, Series.values.any is still faster.

I ran with both skipna=True and skipna=False in case it was an issue of how NaNs are being handled.

I see the same time differences with Series.all

Expected Output

I would expect the performance to be comparable. Maybe not exactly the same,, but not order(s) of magnitude slower.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.4.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 45 Stepping 7, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.23.3 pytest: 3.3.0 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.1 numpy: 1.11.1 scipy: 0.18.0 pyarrow: None xarray: None IPython: 5.1.0 sphinx: 1.4.1 patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.6.1 blosc: None bottleneck: None tables: 3.2.2 numexpr: 2.6.1 feather: None matplotlib: 1.5.1 openpyxl: 2.5.6 xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.2 lxml: 3.6.4 bs4: 4.5.1 html5lib: 0.9999999 sqlalchemy: 1.0.13 pymysql: None psycopg2: 2.7.1 (dt dec pq3 ext lo64) jinja2: 2.8 s3fs: 0.0.8 fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: chris-b1

You're welcome to have a look here - ultimately it will be due to overhead of unpacking the array to deal with missing values, but might be something that can be elided

https://github.com/pandas-dev/pandas/blob/2f6b90aaaab6814c102eb160c5a9c11bc04a092e/pandas/core/nanops.py#L204

Comment From: ericstarr

I spent a little time looking into this and don't see an obvious way to make it faster. Potentially in the case where the dtype of the Series is bool and skipna is True you could bypass the _get_values call and go straight to df.values.any, but that seems like a messy way of dealing with this.

Comment From: jreback

this is already being address in #25070

cc @qwhelan

Comment From: qwhelan

I fixed the root cause in numpy with numpy/numpy#12988 so 1.17 should be about 250x faster in this regard.

Comment From: arw2019

Below are the results on 1.2 master. In terms of the difference booleans now outperform ints by ~4x for Series.any and ~25x for Series.values.any

In [7]: s = pd.Series(np.random.randint(0, 2, 100000)).astype(bool) 

In [8]: %timeit s.any(skipna=True)                                                                     
16.8 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [9]: %timeit s.any(skipna=False) 
16.3 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [10]: %timeit s.values.any() 
2.19 µs ± 43.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [11]: s2 = pd.Series(np.random.randint(0, 2, 100000)) 

In [12]: %timeit s2.any(skipna=True) 
71.6 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %timeit s2.any(skipna=False) 
68.4 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [14]: %timeit s2.values.any() 
53.4 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Pandas Series.all much slower than Series.values.all

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`