Pandas BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category'

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# EXAMPLE 1

import numpy as np
import pandas as pd
import time

t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=13000000, replace=True)
col2_values = np.random.choice(col2_possible_values, size=13000000, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)

# EXAMPLE 2

import numpy as np
import pandas as pd
import time

t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=13000000, replace=True)
col2_values = np.random.choice(col2_possible_values, size=13000000, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
sample_df['col2'] = sample_df['col2'].astype('category')
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)

Issue Description

I have 2022 Macbook Pro M1 Pro, pandas 1.4.1, numpy 1.22.2

I noticed significant performance drop when trying to perform

sample_df.groupby("col1")["col2"].value_counts().unstack()

when col2 or both col1 and col2 are of type 'category' instead of default type 'object'. Operation in EXAMPLE 1 runs around ~25 seconds on my computer (similar for 2019 Macbook Pro with Intel processor). In EXAMPLE 2 I have run the operation for more than 20 minutes and it still did not finish (on 2019 Macbook Pro with Intel processor running time is similar for EXAMPLE 1 and EXAMPLE 2).

Moreover, for M1 in EXAMPLE 1, the peak memory usage is around 8-10GB while for EXAMPLE 2 it well exceeds 30GB.

Expected Behavior

on M1:

EXAMPLE 1 and EXAMPLE 2 should perform roughly the same as they do on Intel processor.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 06d230151e6f18fdb8139d09abf539867a8cd481 python : 3.9.10.final.0 python-bits : 64 OS : Darwin OS-release : 21.3.0 Version : Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 1.4.1 numpy : 1.22.2 pytz : 2021.3 dateutil : 2.8.2 pip : 22.0.3 setuptools : 60.9.3 Cython : 0.29.28 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.3 jinja2 : 3.0.3 IPython : 8.0.1 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.5.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 7.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None None

Comment From: samukweku

@whypandasslow What happens if you set observed=True in the group by? Since you are dealing with categorical, this might offer a performance improvement

Comment From: whypandasslow

Hey @samukweku . I tried it and almost no improvement at all. And whether it is turned on or off, the difference in running time between EXAMPLE 1 and EXAMPLE 2 is in factor of hundreds

Comment From: rhshadrach

When value_counts was added to DataFrameGroupBy, I noticed that it seemed to perform better than the SeriesGroupBy implementation. That seems to be the case here when combined with observed=True. Can you verify the performance of:

processed_df = sample_df.groupby("col1", observed=True)[["col2"]].value_counts().unstack()

Comment From: whypandasslow

Hey @rhshadrach! I have checked 4 cases on a reduced sample of 130000 records: example 1 using dataframe approach - 0.356 seconds example 1 using dataframe approach - 0.3622 seconds example 2 using series approach - 2509 seconds example 2 using dataframe approach - 43 seconds

I see only slight improvement for dataframe approach when a column is of type object, however I see HUGE improvement when a column is of type category. Yet still, leaving the column as object yields much better running-time (in the factor of 100x) then when it is converted to category.

So the performance problem for the operation on columns of type 'category' still persits no matter what approach is used

Comment From: rhshadrach

@whypandasslow - I'm not seeing the same. Are you by chance missing setting observed=True? I am using the following code

Code

import numpy as np
import pandas as pd
import time

size = 130000

t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)

# EXAMPLE 2

t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
sample_df['col2'] = sample_df['col2'].astype('category')
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1", observed=True)[["col2"]].value_counts().unstack()
print(time.time()-t0)

And getting the output

16.620705127716064
0.5117876529693604
16.49775242805481
0.5147149562835693

showing no significant difference between SeriesGroupBy with object dtype and DataFrameGroupBy with categorical dtype.

Comment From: whypandasslow

@rhshadrach, @samukweku instead copy-pasting @rhshadrach 's line, I just added another layer of square brackets around "col2" to make it a dataframe. This is how I got my 43 seconds. Using both, Dataframe approach and parmeter observed=True I got similar results to those of @rhshadrach. Conclusion, working with column of type 'category', one must use both observed=True and Dataframe approach. Doing default group by on series yields horrible results. Doin group by on dataframe improves the performance but only combined with observed=True yields best performance. Wat is more, compbinig observed=True with series group by gives no improvement over series group by without observed=True. In the end, using both group by on dataframe and observed=True on columns of type 'category' I got better perfomance than with group by on series on columns of type 'object'. Thank you @rhshadrach and @samukweku! My problem is solved!

Comment From: rhshadrach

@whypandasslow - thanks for the response. Using DataFrameGroupby is merely a workaround. The performance issues in SeriesGroupBy should be fixed.

Comment From: LucasG0

Hi, I am interested in working on this issue. May I take it ?

Comment From: LucasG0

In the output of SeriesGroupBy.value_counts with categoricals, each group has as many rows as the number of categories, thus producing many extra rows. I don't think it is expected, and fixing this should probably fix performances too.

>>> df = pd.DataFrame(zip(["A", "B", "C"], ["X", "Y", "Z"]), columns=["col1", "col2"])
>>> df
    col1    col2
0   A   X
1   B   Y
2   C   Z
>>> df.groupby("col1")["col2"].value_counts()
col1  col2
A     X       1
B     Y       1
C     Z       1
Name: col2, dtype: int64

>>> df['col2'] = df['col2'].astype('category')
>>> df.groupby("col1")["col2"].value_counts()
col1   
A     X    1
      Y    0
      Z    0
B     Y    1
      X    0
      Z    0
C     Z    1
      X    0
      Y    0
Name: col2, dtype: int64

Comment From: samukweku

@LucasG0 if you pass observed=True to the groupby , it should reduce the number of rows

Comment From: LucasG0

observed=True does not reduce the number of rows if used with a SeriesGroupBy, but it does for a DataFrameGroupBy (above workaround).

>>> df2.groupby("col1", observed=True)["col2"].value_counts()
col1   
A     X    1
      Y    0
      Z    0
B     Y    1
      X    0
      Z    0
C     Z    1
      X    0
      Y    0
Name: col2, dtype: int64

>>> df2.groupby("col1")[["col2"]].value_counts()
col1  col2
A     X       1
      Y       0
      Z       0
B     Y       1
      X       0
      Z       0
C     Z       1
      X       0
      Y       0
dtype: int64
>>> df2.groupby("col1", observed=True)[["col2"]].value_counts()
col1  col2
A     X       1
B     Y       1
C     Z       1
dtype: int64

Comment From: rhshadrach

@LucasG0 - thanks for finding that SeriesGroupBy.value_counts does not implement observed=True properly. It appears that is a separate bug. Can you see if there is already an issue for this, and create one if there is not.

Comment From: LucasG0

@rhshadrach It does not seem it has been reported yet. I created the issue and also detailled why I have a doubt whether the observed parameter should have an effect here

Comment From: LucasG0

I observed the same performance issue than above on 130k records, and I think the issue here is between SeriesGroupBy.value_counts and DataFrameGroupBy.value_counts on categoricals:

%%time
df2.groupby("col1")[["col2"]].value_counts()
CPU times: total: 38.7 s
Wall time: 39.8 s

%%time
df2.groupby("col1")["col2"].value_counts()
CPU times: total: 15min 27s
Wall time: 15min 28s

Also, as discussed in #46357 DataFrameGroupBy.value_counts result should not be impacted by observed parameter in this case. Once it is effective, the above workaround combining DataFrameGroupBy version + observed=True won't be relevant anymore, and I think the workaround (less efficient though) will be to come back to object dtype for this operation (unless an observed parameter is introduced to value_counts API).

Comment From: rhshadrach

Also, as discussed in #46357 DataFrameGroupBy.value_counts result should not be impacted by observed parameter in this case. Once it is effective, the above workaround combining DataFrameGroupBy version + observed=True won't be relevant anymore, and I think the workaround (less efficient though) will be to come back to object dtype for this operation (unless an observed parameter is introduced to value_counts API).

While this is partially true, I still think there is a significant performance issue with SeriesGroupBy. I changed the line

https://github.com/pandas-dev/pandas/blob/48d515958d5805f0e62e34b7424097e5575089a8/pandas/core/groupby/generic.py#L1731

to be observed=False. I think this fixes the bug in #46357. Doing

df = pd.DataFrame(zip(["A", "B", "C"], ["X", "Y", "Z"]), columns=["col1", "col2"])
df['col2'] = df['col2'].astype('category')
print(df.groupby("col1")[["col2"]].value_counts())

gives

col1  col2
A     X       1
      Y       0
      Z       0
B     Y       1
      X       0
      Z       0
C     Z       1
      X       0
      Y       0
dtype: int64

Now using a modified example from https://github.com/pandas-dev/pandas/issues/46202#issuecomment-1059654115

Code

import numpy as np
import pandas as pd
import time

size = 13000

col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])

t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)

# EXAMPLE 2

col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
sample_df['col2'] = sample_df['col2'].astype('category')

t0 = time.time()
processed_df = sample_df.groupby("col1", observed=True)["col2"].value_counts().unstack()
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1", observed=True)[["col2"]].value_counts().unstack()
print(time.time()-t0)

I get the timings

0.047942399978637695 # SeriesGroupBy, object
25.37419080734253    # SeriesGroupBy, category
3.2430622577667236   # DataFrameGroupBy, category

Comment From: LucasG0

While this is partially true, I still think there is a significant performance issue with SeriesGroupBy

Yes, what I meant is that when both this issue and https://github.com/pandas-dev/pandas/issues/46357 are fixed, it won't be possible to retrieve only observed values in order to get the best performances using categorical dtype, but a prior casting to object dtype will be needed.

Comment From: LucasG0

take