• [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [ ] (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
pd.__version__
N=10**6
z=pd.DataFrame({'a': np.random.randint(1, 10, N), 'b': np.random.randint(1, 100000, N)})
z['c'] = (z.b % 2).astype(bool)
%timeit z.groupby('a').c.transform(max)
%timeit z.groupby('a').c.transform(min)
%timeit z.groupby('a').c.transform(any)
%timeit z.groupby('a').c.transform(all)

Problem description

Logically any or all on bools should be faster or as fast as max or min which give exactly the same results.

In [7]: %timeit z.groupby('a').c.transform(max)                                                             
11.8 ms ± 86.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit z.groupby('a').c.transform(min)                                                             
11.8 ms ± 78.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit z.groupby('a').c.transform(any)                                                             
99.7 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [10]: %timeit z.groupby('a').c.transform(all)                                                            
99.8 ms ± 665 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Expected Output

Times for any and all are lower or same as for max and min

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3 python : 3.8.6.final.0 python-bits : 64 OS : Linux OS-release : 5.8.18-200.fc32.x86_64 Version : #1 SMP Mon Nov 2 19:49:11 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.1.4 numpy : 1.19.4 pytz : 2020.4 dateutil : 2.8.1 pip : 19.3.1 setuptools : 41.6.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.4.1 html5lib : 1.0.1 pymysql : 0.9.3 psycopg2 : None jinja2 : 2.11.2 IPython : 7.12.0 pandas_datareader: None bs4 : 4.9.3 bottleneck : 1.2.1 fsspec : None fastparquet : None gcsfs : None matplotlib : 3.2.2 numexpr : 2.7.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 2.0.0 pytables : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : 3.5.2 tabulate : 0.8.7 xarray : None xlrd : 1.2.0 xlwt : 1.1.2 numba : None

Comment From: phofl

Thats not really a bug. Max/Min are implemented in Cython while any dispatches to the built in function any. Same for all.

Comment From: JacekPliszka

OK, so it is more like feature request to implement any and all in Cython - they should be faster and naively this is the way I would write...

Comment From: jbrockmendel

GroupBy.any/all have cython implementations but it looks like .transform doesn't dispatch to it. With setup as in the OP:

gb = z.groupby("a")
gbc = gb["c"]

In [14]: %timeit gbc.any()
5.91 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [15]: %timeit gbc.agg(any)
20.1 ms ± 746 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: %timeit gbc.apply(any)
20.6 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [17]: %timeit gbc.transform(any)
96.2 ms ± 877 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)