Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Calling df.agg([function])
is much slower than df.agg(function)
when there are many columns and few rows.
I apologize if this is a known issue, I could not find a reference based on keywords that come to mind. Inspired by this SO question.
from timeit import timeit
import pandas as pd # version 1.4.0
import numpy as np # version 1.22.1
np.random.seed(0)
num_cols = 1000
df_ten_rows = pd.DataFrame(np.random.randint(0, 10, size=(10, num_cols)))
df_10k_rows = pd.DataFrame(np.random.randint(0, 10, size=(10_000, num_cols)))
assert pd.DataFrame(df_ten_rows.agg("sum").rename("sum")).T.equals(df_ten_rows.agg(["sum"]))
assert pd.DataFrame(df_10k_rows.agg("sum").rename("sum")).T.equals(df_10k_rows.agg(["sum"]))
setup = """
import pandas as pd # version 1.4.0
import numpy as np # version 1.22.1
np.random.seed(0)
num_cols = 1000
df_ten_rows = pd.DataFrame(np.random.randint(0, 10, size=(10, num_cols)))
df_10k_rows = pd.DataFrame(np.random.randint(0, 10, size=(10_000, num_cols)))
"""
number = 10 # nubmer of repetitions
codes = {
"10 rows, agg on sum": 'pd.DataFrame(df_ten_rows.agg("sum").rename("sum")).T',
"10 rows, agg on [sum]": 'df_ten_rows.agg(["sum"])',
"10k rows, agg on sum": 'pd.DataFrame(df_10k_rows.agg("sum").rename("sum")).T',
"10k rows, agg on [sum]": 'df_10k_rows.agg(["sum"])',
}
times = {
description: timeit(code, setup=setup, number=number)/number
for description, code in codes.items()
}
# {
# '10 rows, agg on sum': 0.005952695199812297,
# '10 rows, agg on [sum]': 2.789351126600013,
# '10k rows, agg on sum': 0.018355781599893817,
# '10k rows, agg on [sum]': 2.341517233000195
# }
Installed Versions
Prior Performance
No response
Comment From: phofl
Please try on newer pandas, e.g. 1.4
Comment From: yakymp
Updated example using pandas 1.4
Comment From: rhshadrach
Thanks for the report!
One may think of DataFrame.agg(['sum'])
as an alias for DataFrame.sum()
(the former returning a DataFrame, the latter a Series), but it is not. Internally, supplying a list of aggregations breaks up the DataFrame into individual Series, and aggregates each of these. This is the reason DataFrame.agg(['sum'])
gives a transpose of DataFrame.sum()
, and why it is so much less performant on wide frames. I think this should be fixed.
The trouble is that because agg
allows for partial failures, you can't call DataFrame.sum()
under the hood and retain the current behavior. Allowing partial failure is deprecated, and once removed, would enable us to move forward. To maintain the current behavior, a hefty amount of reshaping would have to be done for the transpose (especially to get the index levels in the correct order when dealing with a MultiIndex), and I don't think it will be possible to handle all dtypes correctly. Instead, I think we should not be transposing the result when comparing .agg('sum')
and .agg(['sum'])
.
However we run into more trouble because the code paths for agg
and apply
are intertwined - sometimes agg
internally calls apply
and vice-versa (https://github.com/pandas-dev/pandas/issues/41112#issuecomment-902366002). Because each of these can accept a UDF, it is hard to understand how we might be impacting user code. This is why I think an experimental option as proposed in #41112 is the right way to go.
This performance would be improved by #45557, when using the experimental option.
Comment From: jbrockmendel
Now that the deprecation @rhshadrach referred to is enforced, it may be viable to improve this.
Comment From: rhshadrach
I've looked into this some more, the issue is that to use DataFrame reductions (instead of breaking up the frame into Series and using the Series reductions) needs to involve taking a transpose. E.g if you have a frame with a mixture of int/float dtypes, using DataFrame.sum
results in a float Series and then upon transposing to a 1-row frame has the wrong dtypes. Since this needs to work for an arbitrary UDF, we don't have the necessary information to cast afterwards.
I'd be in favor of not transposing the result, so that df.agg('sum')
and df.agg(['sum'])
are the same result except the former is a Series and the latter a DataFrame. But this would require a deprecation and I don't see how to do that with without adding an argument to agg
to allow adoption of future behavior (e.g. transpose_result
).
Comment From: jbrockmendel
DataFrame.agg(['sum']) gives a transpose of DataFrame.sum()
I would not have guessed this. Am I the only one who finds this weird?
But this would require a deprecation and I don't see how to do that with without adding an argument [...]
I've been thinking lately that something a pattern like pd.options.set_option("future.agg_behavior")
might be useful for opting in to future behavior
Comment From: rhshadrach
I would not have guessed this. Am I the only one who finds this weird?
I think it goes back to allowing partial failure - that can only be done by first breaking up the frame into series and operating on each one. Once you've done this, it's a natural result of using concat.
I've been thinking lately that something a pattern like
pd.options.set_option("future.agg_behavior")
might be useful for opting in to future behavior
+1