-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import numpy as np
import pandas as pd
pd.__version__
N=10**6
z=pd.DataFrame({'a': np.random.randint(1, 10, N), 'b': np.random.randint(1, 100000, N)})
z['c'] = (z.b % 2).astype(bool)
%timeit z.groupby('a').c.transform(max)
%timeit z.groupby('a').c.transform(min)
%timeit z.groupby('a').c.transform(any)
%timeit z.groupby('a').c.transform(all)
Problem description
Logically any or all on bools should be faster or as fast as max or min which give exactly the same results.
In [7]: %timeit z.groupby('a').c.transform(max)
11.8 ms ± 86.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: %timeit z.groupby('a').c.transform(min)
11.8 ms ± 78.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [9]: %timeit z.groupby('a').c.transform(any)
99.7 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [10]: %timeit z.groupby('a').c.transform(all)
99.8 ms ± 665 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Expected Output
Times for any and all are lower or same as for max and min
Output of pd.show_versions()
Comment From: phofl
Thats not really a bug. Max/Min are implemented in Cython while any dispatches to the built in function any
. Same for all
.
Comment From: JacekPliszka
OK, so it is more like feature request to implement any and all in Cython - they should be faster and naively this is the way I would write...
Comment From: jbrockmendel
GroupBy.any/all have cython implementations but it looks like .transform doesn't dispatch to it. With setup as in the OP:
gb = z.groupby("a")
gbc = gb["c"]
In [14]: %timeit gbc.any()
5.91 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [15]: %timeit gbc.agg(any)
20.1 ms ± 746 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: %timeit gbc.apply(any)
20.6 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [17]: %timeit gbc.transform(any)
96.2 ms ± 877 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)