Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this issue exists on the latest version of pandas.

  • [ ] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Calling df.agg([function]) is much slower than df.agg(function) when there are many columns and few rows.

I apologize if this is a known issue, I could not find a reference based on keywords that come to mind. Inspired by this SO question.

from timeit import timeit

import pandas as pd  # version 1.4.0
import numpy as np   # version 1.22.1

np.random.seed(0)

num_cols = 1000
df_ten_rows = pd.DataFrame(np.random.randint(0, 10, size=(10, num_cols)))
df_10k_rows = pd.DataFrame(np.random.randint(0, 10, size=(10_000, num_cols)))

assert pd.DataFrame(df_ten_rows.agg("sum").rename("sum")).T.equals(df_ten_rows.agg(["sum"]))
assert pd.DataFrame(df_10k_rows.agg("sum").rename("sum")).T.equals(df_10k_rows.agg(["sum"]))

setup = """
import pandas as pd  # version 1.4.0
import numpy as np   # version 1.22.1

np.random.seed(0)

num_cols = 1000
df_ten_rows = pd.DataFrame(np.random.randint(0, 10, size=(10, num_cols)))
df_10k_rows = pd.DataFrame(np.random.randint(0, 10, size=(10_000, num_cols)))
"""
number = 10 # nubmer of repetitions

codes = {
  "10 rows, agg on sum": 'pd.DataFrame(df_ten_rows.agg("sum").rename("sum")).T',
  "10 rows, agg on [sum]": 'df_ten_rows.agg(["sum"])',
  "10k rows, agg on sum": 'pd.DataFrame(df_10k_rows.agg("sum").rename("sum")).T',
  "10k rows, agg on [sum]": 'df_10k_rows.agg(["sum"])',
}

times = {
  description: timeit(code, setup=setup, number=number)/number 
  for description, code in codes.items()
  }

# {
#   '10 rows, agg on sum':    0.005952695199812297, 
#   '10 rows, agg on [sum]':  2.789351126600013, 
#   '10k rows, agg on sum':   0.018355781599893817, 
#   '10k rows, agg on [sum]': 2.341517233000195
# }

Installed Versions

INSTALLED VERSIONS ------------------ commit : bb1f651536508cdfef8550f93ace7849b00046ee python : 3.8.12.final.0 python-bits : 64 OS : Linux OS-release : 5.11.0-1028-gcp Version : #32~20.04.1-Ubuntu SMP Wed Jan 12 20:08:27 UTC 2022 machine : x86_64 processor : byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.4.0 numpy : 1.22.1 pytz : 2021.3 dateutil : 2.8.2 pip : 21.2.dev0 setuptools : 56.0.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.1 IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.4.3 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.4.23 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None

Prior Performance

No response

Comment From: phofl

Please try on newer pandas, e.g. 1.4

Comment From: yakymp

Updated example using pandas 1.4

Comment From: rhshadrach

Thanks for the report!

One may think of DataFrame.agg(['sum']) as an alias for DataFrame.sum() (the former returning a DataFrame, the latter a Series), but it is not. Internally, supplying a list of aggregations breaks up the DataFrame into individual Series, and aggregates each of these. This is the reason DataFrame.agg(['sum']) gives a transpose of DataFrame.sum(), and why it is so much less performant on wide frames. I think this should be fixed.

The trouble is that because agg allows for partial failures, you can't call DataFrame.sum() under the hood and retain the current behavior. Allowing partial failure is deprecated, and once removed, would enable us to move forward. To maintain the current behavior, a hefty amount of reshaping would have to be done for the transpose (especially to get the index levels in the correct order when dealing with a MultiIndex), and I don't think it will be possible to handle all dtypes correctly. Instead, I think we should not be transposing the result when comparing .agg('sum') and .agg(['sum']).

However we run into more trouble because the code paths for agg and apply are intertwined - sometimes agg internally calls apply and vice-versa (https://github.com/pandas-dev/pandas/issues/41112#issuecomment-902366002). Because each of these can accept a UDF, it is hard to understand how we might be impacting user code. This is why I think an experimental option as proposed in #41112 is the right way to go.

This performance would be improved by #45557, when using the experimental option.

Comment From: jbrockmendel

Now that the deprecation @rhshadrach referred to is enforced, it may be viable to improve this.

Comment From: rhshadrach

I've looked into this some more, the issue is that to use DataFrame reductions (instead of breaking up the frame into Series and using the Series reductions) needs to involve taking a transpose. E.g if you have a frame with a mixture of int/float dtypes, using DataFrame.sum results in a float Series and then upon transposing to a 1-row frame has the wrong dtypes. Since this needs to work for an arbitrary UDF, we don't have the necessary information to cast afterwards.

I'd be in favor of not transposing the result, so that df.agg('sum') and df.agg(['sum']) are the same result except the former is a Series and the latter a DataFrame. But this would require a deprecation and I don't see how to do that with without adding an argument to agg to allow adoption of future behavior (e.g. transpose_result).

Comment From: jbrockmendel

DataFrame.agg(['sum']) gives a transpose of DataFrame.sum()

I would not have guessed this. Am I the only one who finds this weird?

But this would require a deprecation and I don't see how to do that with without adding an argument [...]

I've been thinking lately that something a pattern like pd.options.set_option("future.agg_behavior") might be useful for opting in to future behavior

Comment From: rhshadrach

I would not have guessed this. Am I the only one who finds this weird?

I think it goes back to allowing partial failure - that can only be done by first breaking up the frame into series and operating on each one. Once you've done this, it's a natural result of using concat.

I've been thinking lately that something a pattern like pd.options.set_option("future.agg_behavior") might be useful for opting in to future behavior

+1