Code Sample, a copy-pastable example if possible

import pandas as pd
import random

SIZE = 100000
GROUPS = 10000  # The larger, the more extreme the timing differences
CARDINALITY = 10
CAT = pd.CategoricalDtype([i for i in range(CARDINALITY)])
df_int = pd.DataFrame({
    'group': [random.randint(0, GROUPS) for i in range(SIZE)],
    'cat': [random.choice(CAT.categories) for i in range(SIZE)],
})
df_cat = df_int.astype({'cat': CAT})
df_str = df_cat.astype({'cat': 'str'})

%timeit df_int.groupby('group').last()
# 100 loops, best of 3: 5.51 ms per loop
%timeit df_str.groupby('group').last()
# 100 loops, best of 3: 11.4 ms per loop
%timeit df_cat.groupby('group').last()
# 1 loop, best of 3: 4.62 s per loop

Problem description

We have a dataframe like the above in production and noticed a particular groupby call was drastically slowing down our pipeline. We were able to workaround the issue by converting categorical types to strings before the groupby operation.

The issue occurs when categorical values are in the grouped frame (not when they're included in the groupby index). The problem scales with the number of groups. The larger the GROUPS value in the snippet above, the worse the timing differences are.

We didn't expect using categories to cause these kind of performance issues. Is this expected, or could it be something worth optimizing for?

Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit : None python : 3.6.9.final.0 python-bits : 64 OS : Linux OS-release : 4.14.137+ machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 0.25.3 numpy : 1.18.2 pytz : 2018.9 dateutil : 2.8.1 pip : 19.3.1 setuptools : 46.0.0 Cython : 0.29.15 pytest : 3.6.4 hypothesis : None sphinx : 1.8.5 blosc : None feather : 0.4.0 xlsxwriter : None lxml.etree : 4.2.6 html5lib : 1.0.1 pymysql : None psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64) jinja2 : 2.11.1 IPython : 5.5.0 pandas_datareader: 0.7.4 bs4 : 4.6.3 bottleneck : 1.3.2 fastparquet : None gcsfs : None lxml.etree : 4.2.6 matplotlib : 3.2.1 numexpr : 2.7.1 odfpy : None openpyxl : 2.5.9 pandas_gbq : 0.11.0 pyarrow : 0.14.1 pytables : None s3fs : 0.4.0 scipy : 1.4.1 sqlalchemy : 1.3.15 tables : 3.4.4 xarray : 0.15.0 xlrd : 1.1.0 xlwt : 1.3.0 xlsxwriter : None

Comment From: TomAugspurger

Initially, I thought this would be due to unobserved categories, but the issue is present even with observed=True. The slowdown seems to be from doing the categorical aggregation in Python (_aggregate_series_pure_python) rather than Cython (_cython_agg_general). I suppose that Categorical cannot be passed to our Cython aggfuncs, which expects a 2d ndarray.

Comment From: rben01

I'm also seeing this on pandas 1.0.3. As with OP, the performance difference is astronomical between strings and categories.

pd.show_versions():

INSTALLED VERSIONS ------------------ commit : None python : 3.8.1.final.0 python-bits : 64 OS : Darwin OS-release : 19.4.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.0.3 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 46.0.0.post20200309 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 7.13.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.3 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.15.1 pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Comment From: rtlee9

BaseGrouper raises a notImplemented exception here for categorical dtypes but not other dtypes. My understanding is this results in a slower, python based aggregation here. Happy to open a PR but not sure of the best approach yet.

Comment From: TomAugspurger

Ideally we would pass the categorical codes down to cython, group by those, and then convert back to a CategoricalIndex when we're done in Cython.

Comment From: adbull

Also seeing this bug. A faster work-around seems to be to use merge/drop_duplicates:

cats = pd.Categorical(range(10**6))
df = pd.DataFrame(dict(x=cats, y=cats))
%timeit df.astype(str).groupby(['x', 'y']).count()
# 2.84 s ± 62.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.merge(df[['x', 'y']].drop_duplicates().reset_index()).groupby('index').count()
# 845 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comment From: jbrockmendel

Closed by #52120. The df_cat.groupby("group").last() is now slightly faster than the df_int... call.