Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
import string
import random
import pandas as pd
import numpy as np
from time import time
rows = 500000
temp = np.random.randint(10**6, size=(rows, 3))
symbols = string.ascii_uppercase + string.digits
string_col = [''.join(random.choices(symbols, k=16)) for _ in range(rows)]
res = np.concatenate((temp, np.array([string_col]).T), axis=1)
df = pd.DataFrame(res)
def _format(x):
vals = x.values
if len(vals) > 2:
return '-'.join(map(str, vals[:-1]))
return np.nan
def test2():
for _ in range(3):
start = time()
result = df.groupby(3, sort=False).agg(
col1=(2, _format),
col2=(1, _format),
)
print(f"end: {time()-start}")
test2()
Installed Versions
Prior Performance
For 1.3.5 - 4.9 sec For 1.4.1 - 19.8 sec
Comment From: rhshadrach
Thanks for the report. On my laptop 1.3.x takes ~3s, main takes ~15s. I ran two git bisects, one failing if it took longer than 5s, one failing if it took longer than 10s.
5s:
Author: jbrockmendel Date: Fri Sep 10 17:17:41 2021 -0700
REF: remove libreduction.SeriesGrouper (#43505)
10s:
Author: Joris Van den Bossche Date: Sun Jan 16 17:58:29 2022 +0100
BUG: correctly instantiate subclassed DataFrame/Series in groupby apply (#45363)
cc @jbrockmendel and @jorisvandenbossche
I will note that your other 4255 PRs are good @jbrockmendel, at least in regards to this issue.
Comment From: jbrockmendel
Hah thanks @rhshadrach a little bit of positive reinforcement goes a long way.
Not really looking at the function carefully, I'm assuming its what we used to call near-worst-case where the function does very little and all the time is spent in iterating over the groups. I think in the original get-rid-of-libreduction PR I intentionally avoided using _constructor
and __finalize__
to minimize the performance hit. IIRC these were later re-added in order to be Technically Correct.
Can you report some profiling results?
Comment From: jbrockmendel
Profiling results on main vs 1.3.5
import string
import random
import pandas as pd
import numpy as np
from time import time
rows = 500000
temp = np.random.randint(10**6, size=(rows, 3))
symbols = string.ascii_uppercase + string.digits
string_col = [''.join(random.choices(symbols, k=16)) for _ in range(rows)]
res = np.concatenate((temp, np.array([string_col]).T), axis=1)
df = pd.DataFrame(res)
def _format(x):
vals = x.values
if len(vals) > 2:
return '-'.join(map(str, vals[:-1]))
return np.nan
gb = df.groupby(3, sort=False)
func = {"col1": (2, _format), "col2": (1, _format)}
%prun -s cumtime gb.agg(**func)
main
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 25.986 25.986 {built-in method builtins.exec}
1 0.001 0.001 25.986 25.986 <string>:1(<module>)
1 0.000 0.000 25.985 25.985 generic.py:874(aggregate)
1 0.001 0.001 25.975 25.975 apply.py:151(agg)
1 0.000 0.000 25.975 25.975 apply.py:453(agg_dict_like)
1 0.000 0.000 25.943 25.943 apply.py:486(<dictcomp>)
4/2 0.001 0.000 25.943 12.972 generic.py:257(aggregate)
2 0.000 0.000 25.943 12.971 generic.py:312(_aggregate_multiple_funcs)
2 0.002 0.001 25.941 12.970 groupby.py:1648(_python_agg_general)
2 0.013 0.006 25.938 12.969 ops.py:1040(agg_series)
2 1.982 0.991 25.859 12.930 ops.py:1082(_aggregate_series_pure_python)
1000002 0.743 0.000 18.458 0.000 ops.py:1313(__iter__)
1000000 1.878 0.000 17.631 0.000 ops.py:1335(_chop)
1000004 2.796 0.000 6.994 0.000 generic.py:5849(__finalize__)
1000000 2.478 0.000 5.822 0.000 managers.py:1997(get_slice)
1000000 0.485 0.000 3.988 0.000 groupby.py:1651(<lambda>)
1000000 0.657 0.000 3.503 0.000 <ipython-input-1-caf266d6801e>:16(_format)
1000000 0.411 0.000 2.737 0.000 series.py:664(values)
1000004 0.842 0.000 2.673 0.000 series.py:343(__init__)
1000006 0.786 0.000 2.540 0.000 series.py:659(name)
1000000 1.046 0.000 2.327 0.000 managers.py:2021(external_values)
1000000 0.722 0.000 2.107 0.000 base.py:5385(_getitem_slice)
1000008 0.471 0.000 1.753 0.000 common.py:1723(validate_all_hashable)
1000008 1.364 0.000 1.743 0.000 generic.py:260(__init__)
1000034 0.908 0.000 1.385 0.000 base.py:691(_simple_new)
1000035 0.424 0.000 1.283 0.000 {built-in method builtins.all}
1.3.5
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 6.553 6.553 {built-in method builtins.exec}
1 0.001 0.001 6.553 6.553 <string>:1(<module>)
1 0.000 0.000 6.552 6.552 generic.py:964(aggregate)
1 0.000 0.000 6.543 6.543 apply.py:143(agg)
1 0.000 0.000 6.543 6.543 apply.py:405(agg_dict_like)
1 0.000 0.000 6.512 6.512 apply.py:435(<dictcomp>)
4/2 0.000 0.000 6.512 3.256 generic.py:225(aggregate)
2 0.000 0.000 6.511 3.256 generic.py:278(_aggregate_multiple_funcs)
2 0.002 0.001 6.509 3.254 groupby.py:1315(_python_agg_general)
2 0.010 0.005 6.408 3.204 ops.py:1019(agg_series)
2 0.001 0.000 6.330 3.165 ops.py:1069(_aggregate_series_fast)
2 1.461 0.730 6.240 3.120 {method 'get_result' of 'pandas._libs.reduction.SeriesGrouper' objects}
1000000 0.357 0.000 2.675 0.000 groupby.py:1318(<lambda>)
1000000 0.487 0.000 2.317 0.000 <ipython-input-1-caf266d6801e>:16(_format)
1000004 0.303 0.000 1.754 0.000 series.py:632(values)
1000005 0.645 0.000 1.508 0.000 base.py:767(_engine)
1000004 0.434 0.000 1.451 0.000 managers.py:1670(external_values)
1000004 0.255 0.000 0.879 0.000 blocks.py:208(external_values)
4000375/3000291 0.446 0.000 0.663 0.000 {built-in method builtins.len}
1000004 0.369 0.000 0.624 0.000 blocks.py:2114(external_values)
1000000 0.526 0.000 0.596 0.000 managers.py:1712(set_values)
1000008 0.230 0.000 0.347 0.000 base.py:4405(_get_engine_target)
1000066 0.217 0.000 0.291 0.000 base.py:794(__len__)
2001531 0.256 0.000 0.256 0.000 {built-in method builtins.isinstance}
1000022 0.138 0.000 0.138 0.000 managers.py:1619(_block)
1000056 0.117 0.000 0.117 0.000 base.py:4379(_values)
so consistent with the bisect results, the big changes are the removal of libreduction code path and the added __finalize__
calls.
1) The perf hit we took by removing libreduction was expected and judged to be worth the improved correctness.
2) We could plausibly get away without __finalize__
in many cases. im not sure how we could identify those cases, cc @jorisvandenbossche ?
3) Looks like calls to external_values are about 60% slower than they used to be, maybe we could claw some of that back.
Comment From: jorisvandenbossche
Timing https://github.com/pandas-dev/pandas/issues/46505#issuecomment-1230958700 locally on main, the __finalize__
takes around ~35% of the total time for me, but then 60% of the finalize part is due to the series name validation.
This is something we control ourselves, and is not for subclasses (it's Series
that has "name" in the _metadata
).
Changing "name" to "_name" reduces the finalize part to ~20% of the overall time using profile.
The python profiler can sometimes give a bit distorted idea, so confirming with actually timing this example: with current main I see ~13-14s, with changing "name" to "_name" I get ~11-12s, and with completely removing finalize from _chop
I get ~9-9.5s.
So finalize is of course a significant part, but this specific example is still slow nonetheless.
We could consider inspecting whether we have a subclass or not (once), and then only call finalize if that's the case. That would avoid it for pure pandas DataFrames. But strictly speaking also for pandas the finalize does things that a user could rely on in the UDF (flags and attrs), although that might be quite unlikely. EDIT: I think additionally checking for those attributes as well is what @jbrockmendel is doing in https://github.com/pandas-dev/pandas/pull/51109
Will check if the "name"->"_name" seems a feasible change without breaking things.
Comment From: jorisvandenbossche
Opened https://github.com/pandas-dev/pandas/pull/51784 to see what the full test suite things of this idea.
Comment From: jbrockmendel
The name->_name thing is a good catch!