Pandas PERF: groupby.agg() regression

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[ ] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import string
import random

import pandas as pd
import numpy as np
from time import time

rows = 500000
temp = np.random.randint(10**6, size=(rows, 3))
symbols = string.ascii_uppercase + string.digits
string_col = [''.join(random.choices(symbols, k=16)) for _ in range(rows)]
res = np.concatenate((temp, np.array([string_col]).T), axis=1)

df = pd.DataFrame(res)

def _format(x):
    vals = x.values
    if len(vals) > 2:
        return '-'.join(map(str, vals[:-1]))
    return np.nan

def test2():
    for _ in range(3):
        start = time()
        result = df.groupby(3, sort=False).agg(
            col1=(2, _format),
            col2=(1, _format),
        )
        print(f"end: {time()-start}")

test2()

Installed Versions

INSTALLED VERSIONS ------------------ commit : 06d230151e6f18fdb8139d09abf539867a8cd481 python : 3.8.12.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19044 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : English_United States.1252 pandas : 1.4.1 numpy : 1.22.2 pytz : 2021.3 dateutil : 2.8.2 pip : 22.0.3 setuptools : 59.8.0 Cython : None pytest : 7.0.0 hypothesis : None sphinx : 4.4.0 blosc : None feather : 0.4.1 xlsxwriter : None lxml.etree : 4.7.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 8.0.1 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None fsspec : 2022.01.0 gcsfs : None matplotlib : 3.2.2 numba : None numexpr : 2.7.3 odfpy : None openpyxl : 3.0.9 pandas_gbq : 0.17.0 pyarrow : 6.0.1 pyreadstat : None pyxlsb : None s3fs : 2022.01.0 scipy : 1.8.0 sqlalchemy : 1.4.31 tables : 3.7.0 tabulate : None xarray : 0.21.1 xlrd : 2.0.1 xlwt : None zstandard : None

Prior Performance

For 1.3.5 - 4.9 sec For 1.4.1 - 19.8 sec

Comment From: rhshadrach

Thanks for the report. On my laptop 1.3.x takes ~3s, main takes ~15s. I ran two git bisects, one failing if it took longer than 5s, one failing if it took longer than 10s.

5s:

Author: jbrockmendel Date: Fri Sep 10 17:17:41 2021 -0700

REF: remove libreduction.SeriesGrouper (#43505)

10s:

Author: Joris Van den Bossche Date: Sun Jan 16 17:58:29 2022 +0100

BUG: correctly instantiate subclassed DataFrame/Series in groupby apply (#45363)

cc @jbrockmendel and @jorisvandenbossche

I will note that your other 4255 PRs are good @jbrockmendel, at least in regards to this issue.

Comment From: jbrockmendel

Hah thanks @rhshadrach a little bit of positive reinforcement goes a long way.

Not really looking at the function carefully, I'm assuming its what we used to call near-worst-case where the function does very little and all the time is spent in iterating over the groups. I think in the original get-rid-of-libreduction PR I intentionally avoided using _constructor and __finalize__ to minimize the performance hit. IIRC these were later re-added in order to be Technically Correct.

Can you report some profiling results?

Comment From: jbrockmendel

Profiling results on main vs 1.3.5

import string
import random

import pandas as pd
import numpy as np
from time import time

rows = 500000
temp = np.random.randint(10**6, size=(rows, 3))
symbols = string.ascii_uppercase + string.digits
string_col = [''.join(random.choices(symbols, k=16)) for _ in range(rows)]
res = np.concatenate((temp, np.array([string_col]).T), axis=1)

df = pd.DataFrame(res)

def _format(x):
    vals = x.values
    if len(vals) > 2:
        return '-'.join(map(str, vals[:-1]))
    return np.nan


gb = df.groupby(3, sort=False)
func = {"col1": (2, _format), "col2": (1, _format)}
%prun -s cumtime gb.agg(**func)

main

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   25.986   25.986 {built-in method builtins.exec}
        1    0.001    0.001   25.986   25.986 <string>:1(<module>)
        1    0.000    0.000   25.985   25.985 generic.py:874(aggregate)
        1    0.001    0.001   25.975   25.975 apply.py:151(agg)
        1    0.000    0.000   25.975   25.975 apply.py:453(agg_dict_like)
        1    0.000    0.000   25.943   25.943 apply.py:486(<dictcomp>)
      4/2    0.001    0.000   25.943   12.972 generic.py:257(aggregate)
        2    0.000    0.000   25.943   12.971 generic.py:312(_aggregate_multiple_funcs)
        2    0.002    0.001   25.941   12.970 groupby.py:1648(_python_agg_general)
        2    0.013    0.006   25.938   12.969 ops.py:1040(agg_series)
        2    1.982    0.991   25.859   12.930 ops.py:1082(_aggregate_series_pure_python)
  1000002    0.743    0.000   18.458    0.000 ops.py:1313(__iter__)
  1000000    1.878    0.000   17.631    0.000 ops.py:1335(_chop)
  1000004    2.796    0.000    6.994    0.000 generic.py:5849(__finalize__)
  1000000    2.478    0.000    5.822    0.000 managers.py:1997(get_slice)
  1000000    0.485    0.000    3.988    0.000 groupby.py:1651(<lambda>)
  1000000    0.657    0.000    3.503    0.000 <ipython-input-1-caf266d6801e>:16(_format)
  1000000    0.411    0.000    2.737    0.000 series.py:664(values)
  1000004    0.842    0.000    2.673    0.000 series.py:343(__init__)
  1000006    0.786    0.000    2.540    0.000 series.py:659(name)
  1000000    1.046    0.000    2.327    0.000 managers.py:2021(external_values)
  1000000    0.722    0.000    2.107    0.000 base.py:5385(_getitem_slice)
  1000008    0.471    0.000    1.753    0.000 common.py:1723(validate_all_hashable)
  1000008    1.364    0.000    1.743    0.000 generic.py:260(__init__)
  1000034    0.908    0.000    1.385    0.000 base.py:691(_simple_new)
  1000035    0.424    0.000    1.283    0.000 {built-in method builtins.all}

1.3.5

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.553    6.553 {built-in method builtins.exec}
        1    0.001    0.001    6.553    6.553 <string>:1(<module>)
        1    0.000    0.000    6.552    6.552 generic.py:964(aggregate)
        1    0.000    0.000    6.543    6.543 apply.py:143(agg)
        1    0.000    0.000    6.543    6.543 apply.py:405(agg_dict_like)
        1    0.000    0.000    6.512    6.512 apply.py:435(<dictcomp>)
      4/2    0.000    0.000    6.512    3.256 generic.py:225(aggregate)
        2    0.000    0.000    6.511    3.256 generic.py:278(_aggregate_multiple_funcs)
        2    0.002    0.001    6.509    3.254 groupby.py:1315(_python_agg_general)
        2    0.010    0.005    6.408    3.204 ops.py:1019(agg_series)
        2    0.001    0.000    6.330    3.165 ops.py:1069(_aggregate_series_fast)
        2    1.461    0.730    6.240    3.120 {method 'get_result' of 'pandas._libs.reduction.SeriesGrouper' objects}
  1000000    0.357    0.000    2.675    0.000 groupby.py:1318(<lambda>)
  1000000    0.487    0.000    2.317    0.000 <ipython-input-1-caf266d6801e>:16(_format)
  1000004    0.303    0.000    1.754    0.000 series.py:632(values)
  1000005    0.645    0.000    1.508    0.000 base.py:767(_engine)
  1000004    0.434    0.000    1.451    0.000 managers.py:1670(external_values)
  1000004    0.255    0.000    0.879    0.000 blocks.py:208(external_values)
4000375/3000291    0.446    0.000    0.663    0.000 {built-in method builtins.len}
  1000004    0.369    0.000    0.624    0.000 blocks.py:2114(external_values)
  1000000    0.526    0.000    0.596    0.000 managers.py:1712(set_values)
  1000008    0.230    0.000    0.347    0.000 base.py:4405(_get_engine_target)
  1000066    0.217    0.000    0.291    0.000 base.py:794(__len__)
  2001531    0.256    0.000    0.256    0.000 {built-in method builtins.isinstance}
  1000022    0.138    0.000    0.138    0.000 managers.py:1619(_block)
  1000056    0.117    0.000    0.117    0.000 base.py:4379(_values)

so consistent with the bisect results, the big changes are the removal of libreduction code path and the added __finalize__ calls.

1) The perf hit we took by removing libreduction was expected and judged to be worth the improved correctness. 2) We could plausibly get away without __finalize__ in many cases. im not sure how we could identify those cases, cc @jorisvandenbossche ? 3) Looks like calls to external_values are about 60% slower than they used to be, maybe we could claw some of that back.

Comment From: jorisvandenbossche

Timing https://github.com/pandas-dev/pandas/issues/46505#issuecomment-1230958700 locally on main, the __finalize__ takes around ~35% of the total time for me, but then 60% of the finalize part is due to the series name validation. This is something we control ourselves, and is not for subclasses (it's Series that has "name" in the _metadata).

Changing "name" to "_name" reduces the finalize part to ~20% of the overall time using profile.

The python profiler can sometimes give a bit distorted idea, so confirming with actually timing this example: with current main I see ~13-14s, with changing "name" to "_name" I get ~11-12s, and with completely removing finalize from _chop I get ~9-9.5s. So finalize is of course a significant part, but this specific example is still slow nonetheless.

We could consider inspecting whether we have a subclass or not (once), and then only call finalize if that's the case. That would avoid it for pure pandas DataFrames. But strictly speaking also for pandas the finalize does things that a user could rely on in the UDF (flags and attrs), although that might be quite unlikely. EDIT: I think additionally checking for those attributes as well is what @jbrockmendel is doing in https://github.com/pandas-dev/pandas/pull/51109

Will check if the "name"->"_name" seems a feasible change without breaking things.

Comment From: jorisvandenbossche

Opened https://github.com/pandas-dev/pandas/pull/51784 to see what the full test suite things of this idea.

Comment From: jbrockmendel

The name->_name thing is a good catch!