Code Sample
import time
import numpy as np
import pandas as pd
nrows, ncols = 1000, 100
# data frame with random values and a key to be grouped by
df = pd.DataFrame(np.random.rand(nrows, ncols))
df["key"] = range(nrows)
numeric_columns = list(range(ncols))
grouping = df.groupby(by="key")
# performance regression in apply()
start = time.time()
grouping[numeric_columns].apply(lambda x: x - x.mean())
end = time.time()
print("[pandas=={}] execution time: {:.4f} seconds".format(pd.__version__, end - start))
# [pandas==0.23.4] execution time: 0.8700 seconds
# [pandas==0.24.0] execution time: 24.3790 seconds
# [pandas==0.24.2] execution time: 23.9600 seconds
Problem description
The function GroupBy.apply
is a lot slower (~ 25 times) with version 0.24.0, compared to the 0.23.4 release. The problem still persists in the latest 0.24.2 release.
The code sample above shows this performance regression. The purpose of the sample is to subtract the group mean from all elements in this group.
The problem does only occur when the lambda for apply()
returns a data frame.
There are no performance issues with scalar return values, e.g. lambda x: x.mean()
.
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
Comment From: WillAyd
Can you isolate the frame operation from the groupby? Curious if the regression is noticeable in the former
Comment From: blu3r4y
You mean like so? Well, there is a slight difference (~ 30 % slower) but I wouldn't trust my "benchmark" here too much. I ran the code a few times and the numbers varied between 0.5 and 1 second.
import time
import numpy as np
import pandas as pd
nrows, ncols = 1000, 1000
df = pd.DataFrame(np.random.rand(nrows, ncols))
start = time.time()
for _ in range(100):
df.apply(lambda x: x - x.mean())
end = time.time()
print("[pandas=={}] execution time: {:.4f} seconds".format(pd.__version__, end - start))
# [pandas==0.23.4] execution time: 25.8880 seconds
# [pandas==0.24.0] execution time: 36.0216 seconds
# [pandas==0.24.2] execution time: 34.6180 seconds
Additionally, i tested the original code sample with having only one group. It's still about 2 times slower in 0.24+ compared to 0.23.4, but not as drastic as with multiple groups.
nrows, ncols = 10000, 10000
df["key"] = [1] * nrows
# [pandas==0.23.4] execution time: 5.5250 seconds
# [pandas==0.24.0] execution time: 12.1590 seconds
# [pandas==0.24.2] execution time: 12.1540 seconds
Comment From: WillAyd
Right I'm just trying to isolate potential regressions in GroupBy versus Frame operations. Given you don't see the same regression with scalars I'm inclined to believe it's the latter that may be at fault here.
Can you try your last example on master? I think those results might be misleading as the apply operation would still get called twice even with only one group in 0.24.2 (see #24748 which just changed this behavior) so might not be a clean comparison to make
@TomAugspurger we were never able to get the ASV site back up running were we?
Comment From: blu3r4y
I guess I got closer to the problem, it really seems to be related to data frame operations, i.e. subtracting a scalar from a data frame already seems to be the regression. And it's huge, on column-heavy shapes, this simple operation is 153 times slower ô.O
import timeit
import numpy as np # 1.16.2
import pandas as pd
def benchmark():
nrows, ncols = 100, 100
df = pd.DataFrame(np.random.rand(nrows, ncols))
_ = df - 1
time = timeit.timeit(benchmark, number=100)
print("# {:>8.4f} sec pandas=={}".format(time, pd.__version__))
Here are my benchmarking results of the df - 1
operation with different shapes compared to master.
shape | 0.23.4 | 3855a27be4f04d15e7ba7aee12f0220c93148d3d | factor |
---|---|---|---|
100, 100 | 0.0343 | 2.3835 | x69 |
10000, 100 | 1.5811 | 5.0893 | x3 |
100, 10000 | 1.4902 | 229.1662 | x153 |
1000, 1000 | 1.5120 | 26.9160 | x18 |
Concerning GroupBy.apply
, there seems to be just a slight performance regression of up to 30 % slower performance, as indicated in my earlier comment.
import timeit
import numpy as np # 1.16.2
import pandas as pd
def benchmark():
nrows, ncols = 1000, 100
df = pd.DataFrame(np.random.rand(nrows, ncols))
df["key"] = range(nrows)
numeric_columns = list(range(ncols))
grouping = df.groupby(by="key")
grouping[numeric_columns].apply(lambda x: x.mean())
time = timeit.timeit(benchmark, number=10)
print("# {:>8.4f} sec pandas=={}".format(time, pd.__version__))
shape | 0.23.4 | 3855a27be4f04d15e7ba7aee12f0220c93148d3d | factor |
---|---|---|---|
100, 100 | 0.2715 | 0.3211 | x1.18 |
1000, 100 | 19.4875 | 26.0909 | x1.34 |
100, 1000 | 0.6296 | 0.6103 | x0.96 |
1000, 1000 | 5.0933 | 5.6909 | x1.11 |
Comment From: chris-b1
Note that if this is your actual function, you can/should instead do this, which has always been faster
df[numeric_columns] - grouping[numeric_columns].transform('mean')
Comment From: TomAugspurger
Can you check for duplicates? We have another issue for ops that were previously blockwise, but are now columnwise.
Comment From: WillAyd
Local ASV result comparing current head to the last commit on 0.23.4 confirms a regression for the frame ops:
before after ratio
[af7b0ba4] [95c78d65]
<master>
+ 3.30±0.2ms 249±3ms 75.37 binary_ops.Ops2.time_frame_float_div_by_zero
+ 8.08±0.4ms 256±10ms 31.75 binary_ops.Ops2.time_frame_int_div_by_zero
+ 12.6±0.2ms 261±7ms 20.78 binary_ops.Ops2.time_frame_float_floor_by_zero
+ 30.2±0.4ms 106±1ms 3.52 binary_ops.Ops.time_frame_multi_and(False, 1)
+ 29.9±0.7ms 102±4ms 3.43 binary_ops.Ops.time_frame_multi_and(False, 'default')
+ 34.1±0.4ms 110±2ms 3.24 binary_ops.Ops.time_frame_multi_and(True, 1)
+ 39.2±0.4ms 111±0.9ms 2.83 binary_ops.Ops.time_frame_multi_and(True, 'default')
+ 3.85±0.06ms 5.18±0.1ms 1.35 binary_ops.Ops.time_frame_add(True, 1)
+ 28.6±0.1μs 37.4±0.2μs 1.31 binary_ops.Ops2.time_series_dot
+ 3.54±0.06ms 4.29±0.3ms 1.21 binary_ops.Ops.time_frame_add(False, 1)
+ 512±1μs 600±3μs 1.17 binary_ops.Ops2.time_frame_series_dot
- 107±2ms 61.2±1ms 0.57 binary_ops.Ops.time_frame_comparison(False, 'default')
- 108±1ms 61.1±0.9ms 0.57 binary_ops.Ops.time_frame_comparison(False, 1)
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
I've updated the title to reflect this as I think that is the larger issue. @blu3r4y if you can run ASVs for GroupBy apply to confirm regression there that could be helpful as another issue
Comment From: blu3r4y
Can you check for duplicates? We have another issue for ops that were previously blockwise, but are now columnwise.
Yes, regression with DataFrame + Scalar Ops in 0.24+ have already been reported in #24990.
@WillAyd I will run ASV for GroupBy.apply
soon, so that we keep this issue isolated on GroupBy.apply
, right? Or should I make a new issue then?
Comment From: WillAyd
That makes sense - thanks Mario!
Sent from my iPhone
On Mar 27, 2019, at 3:14 AM, Mario Kahlhofer notifications@github.com wrote:
Can you check for duplicates? We have another issue for ops that were previously blockwise, but are now columnwise.
Yes, regression with DataFrame + Scalar Ops in 0.24+ have already been reported in #24990.
@WillAyd I will run ASV for GroupBy.apply soon, so that we keep this issue isolated on GroupBy.apply, right? Or should I make a new issue then?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Comment From: blu3r4y
As suggested by @WillAyd, i ran ASV for GroupBy.apply
in order to only observe the impact on that.
- I did the
asv
call 5 times for warm-up and 5 times for the sake of reporting, on a clean Debian 9.7 CPU-optimized droplet from Digital Ocean. I observed that the results drastically differ between runs, this is why I did multiple runs in the first place. - I observed a regression between v0.23.4 and v0.24.0 in
time_groupby_apply_dict_return
(1.78x - 3.10x) and one intime_scalar_function_single_col
(1.12x - 1.52x) - Comparing v0.23.4 and the current HEAD, these regression are still observable:
time_groupby_apply_dict_return
(1.76x - 2.42x) andtime_scalar_function_single_col
(1.11x - 1.72x)
These findings must be taken with a grain of salt, since the benchmarks do not show stable results (maybe interesting for the discussion in #23412). Although, I guess the impact on time_groupby_apply_dict_return
and time_scalar_function_single_col
is significant in any case.
I renamed THIS issue to focus on the performance regression in GroupBy.apply
, since I reckon #24990 already got to the other root cause, namely the performance regression on DataFrame Binary Ops, which heavily affected my initial MWE on column-heavy shapes as well - as older comments here showed.
v0.23.4 0409521665bd436a10aea7e06336066bf07ff057 vs. v0.24.0 83eb2428ceb6257042173582f3f436c2c887aa69 (5 warm-up runs + 5 reported runs)
$ for i in {1..10}; do asv continuous -f 1.1 v0.23.4 v0.24.0 -b groupby.Apply; done
before after ratio
[04095216] [83eb2428]
+ 85.6±0.2ms 152±0.8ms 1.78 groupby.ApplyDictReturn.time_groupby_apply_dict_return
+ 48.1±0.2ms 54.4±0.2ms 1.13 groupby.Apply.time_scalar_function_multi_col
- 643±4ms 415±4ms 0.64 groupby.Apply.time_copy_overhead_single_col
before after ratio
[04095216] [83eb2428]
+ 49.4±0.1ms 153±0.1ms 3.10 groupby.ApplyDictReturn.time_groupby_apply_dict_return
+ 9.25±3ms 14.1±0.07ms 1.52 groupby.Apply.time_scalar_function_single_col
+ 646±2ms 752±1ms 1.16 groupby.Apply.time_copy_overhead_single_col
before after ratio
[04095216] [83eb2428]
+ 49.2±0.2ms 120±30ms 2.43 groupby.ApplyDictReturn.time_groupby_apply_dict_return
+ 6.07±0.06ms 6.77±0.08ms 1.12 groupby.Apply.time_scalar_function_single_col
before after ratio
[04095216] [83eb2428]
+ 69.2±20ms 151±0.2ms 2.19 groupby.ApplyDictReturn.time_groupby_apply_dict_return
+ 35.1±10ms 54.8±0.5ms 1.56 groupby.Apply.time_scalar_function_multi_col
+ 1.30±0.4s 1.96±0.01s 1.51 groupby.Apply.time_copy_function_multi_col
+ 12.4±0.06ms 13.9±0.04ms 1.12 groupby.Apply.time_scalar_function_single_col
before after ratio
[04095216] [83eb2428]
+ 49.4±0.4ms 121±40ms 2.44 groupby.ApplyDictReturn.time_groupby_apply_dict_return
+ 914±20ms 1.49±0.5s 1.63 groupby.Apply.time_copy_function_multi_col
+ 6.00±0.03ms 6.76±0.08ms 1.13 groupby.Apply.time_scalar_function_single_col
v0.23.4 0409521665bd436a10aea7e06336066bf07ff057 vs. HEAD 437efa6e974e506c7cc5f142d5186bf6a7f5ce13 (5 warm-up runs + 5 reported runs)
$ for i in {1..10}; do asv continuous -f 1.1 v0.23.4 HEAD -b groupby.Apply; done
before after ratio
[04095216] [437efa6e]
+ 86.6±0.1ms 152±1ms 1.76 groupby.ApplyDictReturn.time_groupby_apply_dict_return
+ 5.99±0.03ms 10.3±3ms 1.72 groupby.Apply.time_scalar_function_single_col
before after ratio
[04095216] [437efa6e]
+ 49.0±0.1ms 119±30ms 2.42 groupby.ApplyDictReturn.time_groupby_apply_dict_return
+ 908±8ms 1.08±0.01s 1.19 groupby.Apply.time_copy_function_multi_col
+ 6.02±0.07ms 6.67±0.05ms 1.11 groupby.Apply.time_scalar_function_single_col
before after ratio
[04095216] [437efa6e]
+ 5.99±0.03ms 10.2±4ms 1.70 groupby.Apply.time_scalar_function_single_col
+ 353±3ms 411±1ms 1.16 groupby.Apply.time_copy_overhead_single_col
+ 22.0±0.1ms 24.7±0.2ms 1.12 groupby.Apply.time_scalar_function_multi_col
before after ratio
[04095216] [437efa6e]
+ 48.4±0.3ms 54.1±0.2ms 1.12 groupby.Apply.time_scalar_function_multi_col
before after ratio
[04095216] [437efa6e]
+ 49.2±0.5ms 118±30ms 2.41 groupby.ApplyDictReturn.time_groupby_apply_dict_return
+ 6.04±0.05ms 10.2±4ms 1.69 groupby.Apply.time_scalar_function_single_col
+ 352±1ms 592±200ms 1.68 groupby.Apply.time_copy_overhead_single_col
Comment From: pv
FWIW, on a desktop computer the benchmark numbers are fairly stable:
$ for i in {1..10}; do asv continuous -f 1 v0.23.4 v0.24.0 -b groupby.Apply --cpu-affinity 5; done
before after ratio
[04095216] [83eb2428]
<v0.23.4^0> <v0.24.0^0>
+ 57.8±0.2ms 94.1±0.3ms 1.63 groupby.ApplyDictReturn.time_groupby_apply_dict_return
+ 465±5ms 522±5ms 1.12 groupby.Apply.time_copy_overhead_single_col
+ 1.19±0.01s 1.34±0.02s 1.12 groupby.Apply.time_copy_function_multi_col
before after ratio
[04095216] [83eb2428]
<v0.23.4^0> <v0.24.0^0>
+ 56.2±0.1ms 93.5±0.2ms 1.66 groupby.ApplyDictReturn.time_groupby_apply_dict_return
+ 451±2ms 526±4ms 1.17 groupby.Apply.time_copy_overhead_single_col
+ 1.17±0s 1.35±0.01s 1.16 groupby.Apply.time_copy_function_multi_col
+ 6.36±0.09ms 7.35±0.2ms 1.16 groupby.Apply.time_scalar_function_single_col
+ 23.1±0.7ms 26.3±0.4ms 1.14 groupby.Apply.time_scalar_function_multi_col
before after ratio
[04095216] [83eb2428]
<v0.23.4^0> <v0.24.0^0>
+ 57.0±0.6ms 93.3±0.9ms 1.64 groupby.ApplyDictReturn.time_groupby_apply_dict_return
+ 451±2ms 526±4ms 1.17 groupby.Apply.time_copy_overhead_single_col
+ 23.0±1ms 26.7±0.1ms 1.16 groupby.Apply.time_scalar_function_multi_col
+ 1.17±0.01s 1.35±0.01s 1.15 groupby.Apply.time_copy_function_multi_col
+ 6.64±0.3ms 7.29±0.07ms 1.10 groupby.Apply.time_scalar_function_single_col
before after ratio
[04095216] [83eb2428]
<v0.23.4^0> <v0.24.0^0>
+ 56.0±0.2ms 93.4±0.8ms 1.67 groupby.ApplyDictReturn.time_groupby_apply_dict_return
+ 6.38±0.08ms 7.47±0.03ms 1.17 groupby.Apply.time_scalar_function_single_col
+ 23.4±0.2ms 27.0±0.5ms 1.16 groupby.Apply.time_scalar_function_multi_col
+ 454±2ms 520±8ms 1.15 groupby.Apply.time_copy_overhead_single_col
+ 1.18±0.01s 1.33±0.01s 1.13 groupby.Apply.time_copy_function_multi_col
If the accuracy is not sufficient, you can add -a processes=5
to run 5 rounds (instead of the default 2), to get a better sample of the fluctuations.
Comment From: rhshadrach
I'm seeing 0.2614 seconds on my machine on main; but that means relatively little in isolation. However I think this issue is too old, comparisons with 0.23 perf are no longer useful.