Pandas Performance regression in 0.24+ on GroupBy.apply

Code Sample

import time

import numpy as np
import pandas as pd

nrows, ncols = 1000, 100

# data frame with random values and a key to be grouped by
df = pd.DataFrame(np.random.rand(nrows, ncols))
df["key"] = range(nrows)

numeric_columns = list(range(ncols))

grouping = df.groupby(by="key")

# performance regression in apply()
start = time.time()
grouping[numeric_columns].apply(lambda x: x - x.mean())
end = time.time()

print("[pandas=={}] execution time: {:.4f} seconds".format(pd.__version__, end - start))

# [pandas==0.23.4] execution time: 0.8700 seconds
# [pandas==0.24.0] execution time: 24.3790 seconds
# [pandas==0.24.2] execution time: 23.9600 seconds

Problem description

The function GroupBy.apply is a lot slower (~ 25 times) with version 0.24.0, compared to the 0.23.4 release. The problem still persists in the latest 0.24.2 release.

The code sample above shows this performance regression. The purpose of the sample is to subtract the group mean from all elements in this group.

The problem does only occur when the lambda for apply() returns a data frame. There are no performance issues with scalar return values, e.g. lambda x: x.mean().

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Comment From: WillAyd

Can you isolate the frame operation from the groupby? Curious if the regression is noticeable in the former

Comment From: blu3r4y

You mean like so? Well, there is a slight difference (~ 30 % slower) but I wouldn't trust my "benchmark" here too much. I ran the code a few times and the numbers varied between 0.5 and 1 second.

import time

import numpy as np
import pandas as pd

nrows, ncols = 1000, 1000

df = pd.DataFrame(np.random.rand(nrows, ncols))

start = time.time()
for _ in range(100):
    df.apply(lambda x: x - x.mean())
end = time.time()

print("[pandas=={}] execution time: {:.4f} seconds".format(pd.__version__, end - start))
# [pandas==0.23.4] execution time: 25.8880 seconds
# [pandas==0.24.0] execution time: 36.0216 seconds
# [pandas==0.24.2] execution time: 34.6180 seconds

Additionally, i tested the original code sample with having only one group. It's still about 2 times slower in 0.24+ compared to 0.23.4, but not as drastic as with multiple groups.

nrows, ncols = 10000, 10000

df["key"] = [1] * nrows

# [pandas==0.23.4] execution time: 5.5250 seconds
# [pandas==0.24.0] execution time: 12.1590 seconds
# [pandas==0.24.2] execution time: 12.1540 seconds

Comment From: WillAyd

Right I'm just trying to isolate potential regressions in GroupBy versus Frame operations. Given you don't see the same regression with scalars I'm inclined to believe it's the latter that may be at fault here.

Can you try your last example on master? I think those results might be misleading as the apply operation would still get called twice even with only one group in 0.24.2 (see #24748 which just changed this behavior) so might not be a clean comparison to make

@TomAugspurger we were never able to get the ASV site back up running were we?

Comment From: blu3r4y

I guess I got closer to the problem, it really seems to be related to data frame operations, i.e. subtracting a scalar from a data frame already seems to be the regression. And it's huge, on column-heavy shapes, this simple operation is 153 times slower ô.O

import timeit
import numpy as np  # 1.16.2
import pandas as pd

def benchmark():
    nrows, ncols = 100, 100
    df = pd.DataFrame(np.random.rand(nrows, ncols))
    _ = df - 1

time = timeit.timeit(benchmark, number=100)
print("# {:>8.4f} sec   pandas=={}".format(time, pd.__version__))

Here are my benchmarking results of the df - 1 operation with different shapes compared to master.

shape	0.23.4	3855a27be4f04d15e7ba7aee12f0220c93148d3d	factor
100, 100	0.0343	2.3835	x69
10000, 100	1.5811	5.0893	x3
100, 10000	1.4902	229.1662	x153
1000, 1000	1.5120	26.9160	x18

Concerning GroupBy.apply, there seems to be just a slight performance regression of up to 30 % slower performance, as indicated in my earlier comment.

import timeit
import numpy as np  # 1.16.2
import pandas as pd

def benchmark():
    nrows, ncols = 1000, 100
    df = pd.DataFrame(np.random.rand(nrows, ncols))
    df["key"] = range(nrows)

    numeric_columns = list(range(ncols))

    grouping = df.groupby(by="key")
    grouping[numeric_columns].apply(lambda x: x.mean())

time = timeit.timeit(benchmark, number=10)
print("# {:>8.4f} sec   pandas=={}".format(time, pd.__version__))

shape	0.23.4	3855a27be4f04d15e7ba7aee12f0220c93148d3d	factor
100, 100	0.2715	0.3211	x1.18
1000, 100	19.4875	26.0909	x1.34
100, 1000	0.6296	0.6103	x0.96
1000, 1000	5.0933	5.6909	x1.11

Comment From: chris-b1

Note that if this is your actual function, you can/should instead do this, which has always been faster

df[numeric_columns] - grouping[numeric_columns].transform('mean')

Comment From: TomAugspurger

Can you check for duplicates? We have another issue for ops that were previously blockwise, but are now columnwise.

Comment From: WillAyd

Local ASV result comparing current head to the last commit on 0.23.4 confirms a regression for the frame ops:

       before           after         ratio
     [af7b0ba4]       [95c78d65]
                      <master>
+      3.30±0.2ms          249±3ms    75.37  binary_ops.Ops2.time_frame_float_div_by_zero
+      8.08±0.4ms         256±10ms    31.75  binary_ops.Ops2.time_frame_int_div_by_zero
+      12.6±0.2ms          261±7ms    20.78  binary_ops.Ops2.time_frame_float_floor_by_zero
+      30.2±0.4ms          106±1ms     3.52  binary_ops.Ops.time_frame_multi_and(False, 1)
+      29.9±0.7ms          102±4ms     3.43  binary_ops.Ops.time_frame_multi_and(False, 'default')
+      34.1±0.4ms          110±2ms     3.24  binary_ops.Ops.time_frame_multi_and(True, 1)
+      39.2±0.4ms        111±0.9ms     2.83  binary_ops.Ops.time_frame_multi_and(True, 'default')
+     3.85±0.06ms       5.18±0.1ms     1.35  binary_ops.Ops.time_frame_add(True, 1)
+      28.6±0.1μs       37.4±0.2μs     1.31  binary_ops.Ops2.time_series_dot
+     3.54±0.06ms       4.29±0.3ms     1.21  binary_ops.Ops.time_frame_add(False, 1)
+         512±1μs          600±3μs     1.17  binary_ops.Ops2.time_frame_series_dot
-         107±2ms         61.2±1ms     0.57  binary_ops.Ops.time_frame_comparison(False, 'default')
-         108±1ms       61.1±0.9ms     0.57  binary_ops.Ops.time_frame_comparison(False, 1)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I've updated the title to reflect this as I think that is the larger issue. @blu3r4y if you can run ASVs for GroupBy apply to confirm regression there that could be helpful as another issue

Comment From: blu3r4y

Can you check for duplicates? We have another issue for ops that were previously blockwise, but are now columnwise.

Yes, regression with DataFrame + Scalar Ops in 0.24+ have already been reported in #24990.

@WillAyd I will run ASV for GroupBy.apply soon, so that we keep this issue isolated on GroupBy.apply, right? Or should I make a new issue then?

Comment From: WillAyd

That makes sense - thanks Mario!

Sent from my iPhone

On Mar 27, 2019, at 3:14 AM, Mario Kahlhofer notifications@github.com wrote:

Can you check for duplicates? We have another issue for ops that were previously blockwise, but are now columnwise.

Yes, regression with DataFrame + Scalar Ops in 0.24+ have already been reported in #24990.

@WillAyd I will run ASV for GroupBy.apply soon, so that we keep this issue isolated on GroupBy.apply, right? Or should I make a new issue then?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Comment From: blu3r4y

As suggested by @WillAyd, i ran ASV for GroupBy.apply in order to only observe the impact on that.

I did the asv call 5 times for warm-up and 5 times for the sake of reporting, on a clean Debian 9.7 CPU-optimized droplet from Digital Ocean. I observed that the results drastically differ between runs, this is why I did multiple runs in the first place.
I observed a regression between v0.23.4 and v0.24.0 in time_groupby_apply_dict_return (1.78x - 3.10x) and one in time_scalar_function_single_col (1.12x - 1.52x)
Comparing v0.23.4 and the current HEAD, these regression are still observable: time_groupby_apply_dict_return (1.76x - 2.42x) and time_scalar_function_single_col (1.11x - 1.72x)

These findings must be taken with a grain of salt, since the benchmarks do not show stable results (maybe interesting for the discussion in #23412). Although, I guess the impact on time_groupby_apply_dict_return and time_scalar_function_single_col is significant in any case.

I renamed THIS issue to focus on the performance regression in GroupBy.apply, since I reckon #24990 already got to the other root cause, namely the performance regression on DataFrame Binary Ops, which heavily affected my initial MWE on column-heavy shapes as well - as older comments here showed.

v0.23.4 0409521665bd436a10aea7e06336066bf07ff057 vs. v0.24.0 83eb2428ceb6257042173582f3f436c2c887aa69 (5 warm-up runs + 5 reported runs)

$ for i in {1..10}; do asv continuous -f 1.1 v0.23.4 v0.24.0 -b groupby.Apply; done

       before           after         ratio
     [04095216]       [83eb2428]
+      85.6±0.2ms        152±0.8ms     1.78  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+      48.1±0.2ms       54.4±0.2ms     1.13  groupby.Apply.time_scalar_function_multi_col
-         643±4ms          415±4ms     0.64  groupby.Apply.time_copy_overhead_single_col

       before           after         ratio
     [04095216]       [83eb2428]
+      49.4±0.1ms        153±0.1ms     3.10  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+        9.25±3ms      14.1±0.07ms     1.52  groupby.Apply.time_scalar_function_single_col
+         646±2ms          752±1ms     1.16  groupby.Apply.time_copy_overhead_single_col

       before           after         ratio
     [04095216]       [83eb2428]
+      49.2±0.2ms         120±30ms     2.43  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+     6.07±0.06ms      6.77±0.08ms     1.12  groupby.Apply.time_scalar_function_single_col

       before           after         ratio
     [04095216]       [83eb2428]
+       69.2±20ms        151±0.2ms     2.19  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+       35.1±10ms       54.8±0.5ms     1.56  groupby.Apply.time_scalar_function_multi_col
+       1.30±0.4s       1.96±0.01s     1.51  groupby.Apply.time_copy_function_multi_col
+     12.4±0.06ms      13.9±0.04ms     1.12  groupby.Apply.time_scalar_function_single_col

       before           after         ratio
     [04095216]       [83eb2428]
+      49.4±0.4ms         121±40ms     2.44  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+        914±20ms        1.49±0.5s     1.63  groupby.Apply.time_copy_function_multi_col
+     6.00±0.03ms      6.76±0.08ms     1.13  groupby.Apply.time_scalar_function_single_col

v0.23.4 0409521665bd436a10aea7e06336066bf07ff057 vs. HEAD 437efa6e974e506c7cc5f142d5186bf6a7f5ce13 (5 warm-up runs + 5 reported runs)

$ for i in {1..10}; do asv continuous -f 1.1 v0.23.4 HEAD -b groupby.Apply; done

       before           after         ratio
     [04095216]       [437efa6e]
+      86.6±0.1ms          152±1ms     1.76  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+     5.99±0.03ms         10.3±3ms     1.72  groupby.Apply.time_scalar_function_single_col

       before           after         ratio
     [04095216]       [437efa6e]
+      49.0±0.1ms         119±30ms     2.42  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+         908±8ms       1.08±0.01s     1.19  groupby.Apply.time_copy_function_multi_col
+     6.02±0.07ms      6.67±0.05ms     1.11  groupby.Apply.time_scalar_function_single_col

       before           after         ratio
     [04095216]       [437efa6e]
+     5.99±0.03ms         10.2±4ms     1.70  groupby.Apply.time_scalar_function_single_col
+         353±3ms          411±1ms     1.16  groupby.Apply.time_copy_overhead_single_col
+      22.0±0.1ms       24.7±0.2ms     1.12  groupby.Apply.time_scalar_function_multi_col

       before           after         ratio
     [04095216]       [437efa6e]
+      48.4±0.3ms       54.1±0.2ms     1.12  groupby.Apply.time_scalar_function_multi_col

       before           after         ratio
     [04095216]       [437efa6e]
+      49.2±0.5ms         118±30ms     2.41  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+     6.04±0.05ms         10.2±4ms     1.69  groupby.Apply.time_scalar_function_single_col
+         352±1ms        592±200ms     1.68  groupby.Apply.time_copy_overhead_single_col

Comment From: pv

FWIW, on a desktop computer the benchmark numbers are fairly stable:

$ for i in {1..10}; do asv continuous -f 1 v0.23.4 v0.24.0 -b groupby.Apply --cpu-affinity 5; done
       before           after         ratio
     [04095216]       [83eb2428]
     <v0.23.4^0>       <v0.24.0^0>
+      57.8±0.2ms       94.1±0.3ms     1.63  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+         465±5ms          522±5ms     1.12  groupby.Apply.time_copy_overhead_single_col
+      1.19±0.01s       1.34±0.02s     1.12  groupby.Apply.time_copy_function_multi_col

       before           after         ratio
     [04095216]       [83eb2428]
     <v0.23.4^0>       <v0.24.0^0>
+      56.2±0.1ms       93.5±0.2ms     1.66  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+         451±2ms          526±4ms     1.17  groupby.Apply.time_copy_overhead_single_col
+         1.17±0s       1.35±0.01s     1.16  groupby.Apply.time_copy_function_multi_col
+     6.36±0.09ms       7.35±0.2ms     1.16  groupby.Apply.time_scalar_function_single_col
+      23.1±0.7ms       26.3±0.4ms     1.14  groupby.Apply.time_scalar_function_multi_col

       before           after         ratio
     [04095216]       [83eb2428]
     <v0.23.4^0>       <v0.24.0^0>
+      57.0±0.6ms       93.3±0.9ms     1.64  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+         451±2ms          526±4ms     1.17  groupby.Apply.time_copy_overhead_single_col
+        23.0±1ms       26.7±0.1ms     1.16  groupby.Apply.time_scalar_function_multi_col
+      1.17±0.01s       1.35±0.01s     1.15  groupby.Apply.time_copy_function_multi_col
+      6.64±0.3ms      7.29±0.07ms     1.10  groupby.Apply.time_scalar_function_single_col

       before           after         ratio
     [04095216]       [83eb2428]
     <v0.23.4^0>       <v0.24.0^0>
+      56.0±0.2ms       93.4±0.8ms     1.67  groupby.ApplyDictReturn.time_groupby_apply_dict_return
+     6.38±0.08ms      7.47±0.03ms     1.17  groupby.Apply.time_scalar_function_single_col
+      23.4±0.2ms       27.0±0.5ms     1.16  groupby.Apply.time_scalar_function_multi_col
+         454±2ms          520±8ms     1.15  groupby.Apply.time_copy_overhead_single_col
+      1.18±0.01s       1.33±0.01s     1.13  groupby.Apply.time_copy_function_multi_col

If the accuracy is not sufficient, you can add -a processes=5 to run 5 rounds (instead of the default 2), to get a better sample of the fluctuations.

Comment From: rhshadrach

I'm seeing 0.2614 seconds on my machine on main; but that means relatively little in isolation. However I think this issue is too old, comparisons with 0.23 perf are no longer useful.

Pandas Performance regression in 0.24+ on GroupBy.apply

Code Sample

Problem description

Output of pd.show_versions()

v0.23.4 0409521665bd436a10aea7e06336066bf07ff057 vs. v0.24.0 83eb2428ceb6257042173582f3f436c2c887aa69 (5 warm-up runs + 5 reported runs)

v0.23.4 0409521665bd436a10aea7e06336066bf07ff057 vs. HEAD 437efa6e974e506c7cc5f142d5186bf6a7f5ce13 (5 warm-up runs + 5 reported runs)

Output of `pd.show_versions()`