Code Sample, a copy-pastable example if possible
# Series of bools
s = pd.Series(np.random.randint(0, 2, 100000)).astype(bool)
# ~1.45 ms
%timeit s.any(skipna=True)
# ~1.35 ms
%timeit s.any(skipna=False)
# ~6.5 us - Note that I get a message about possible caching, but
# even after multiplying by worst case multiplier, still an order of
# magnitude faster than s.any()
%timeit s.values.any()
# Series of ints
s2 = pd.Series(np.random.randint(0, 2, 100000))
# ~330 us
%timeit s2.any(skipna=True)
# ~280 us
%timeit s2.any(skipna=False)
# ~90 us - No possible caching warning on this one
%timeit s2.values.any()
Problem description
Calling Series.any is much slower than calling Series.values.any on a series of bools Interestingly, calling Series.any on a series of ints is quite a bit faster than on a series of bools, though even if it is a series of ints, Series.values.any is still faster.
I ran with both skipna=True and skipna=False in case it was an issue of how NaNs are being handled.
I see the same time differences with Series.all
Expected Output
I would expect the performance to be comparable. Maybe not exactly the same,, but not order(s) of magnitude slower.
Output of pd.show_versions()
Comment From: chris-b1
You're welcome to have a look here - ultimately it will be due to overhead of unpacking the array to deal with missing values, but might be something that can be elided
https://github.com/pandas-dev/pandas/blob/2f6b90aaaab6814c102eb160c5a9c11bc04a092e/pandas/core/nanops.py#L204
Comment From: ericstarr
I spent a little time looking into this and don't see an obvious way to make it faster. Potentially in the case where the dtype of the Series is bool and skipna is True you could bypass the _get_values call and go straight to df.values.any, but that seems like a messy way of dealing with this.
Comment From: jreback
this is already being address in #25070
cc @qwhelan
Comment From: qwhelan
I fixed the root cause in numpy with numpy/numpy#12988 so 1.17 should be about 250x faster in this regard.
Comment From: arw2019
Below are the results on 1.2 master. In terms of the difference booleans now outperform ints by ~4x for Series.any
and ~25x for Series.values.any
In [7]: s = pd.Series(np.random.randint(0, 2, 100000)).astype(bool)
In [8]: %timeit s.any(skipna=True)
16.8 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [9]: %timeit s.any(skipna=False)
16.3 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [10]: %timeit s.values.any()
2.19 µs ± 43.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [11]: s2 = pd.Series(np.random.randint(0, 2, 100000))
In [12]: %timeit s2.any(skipna=True)
71.6 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [13]: %timeit s2.any(skipna=False)
68.4 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [14]: %timeit s2.values.any()
53.4 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)