Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# EXAMPLE 1
import numpy as np
import pandas as pd
import time
t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]
col1_values = np.random.choice(col1_possible_values, size=13000000, replace=True)
col2_values = np.random.choice(col2_possible_values, size=13000000, replace=True)
sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
print(time.time()-t0)
t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)
# EXAMPLE 2
import numpy as np
import pandas as pd
import time
t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]
col1_values = np.random.choice(col1_possible_values, size=13000000, replace=True)
col2_values = np.random.choice(col2_possible_values, size=13000000, replace=True)
sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
sample_df['col2'] = sample_df['col2'].astype('category')
print(time.time()-t0)
t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)
Issue Description
I have 2022 Macbook Pro M1 Pro, pandas 1.4.1, numpy 1.22.2
I noticed significant performance drop when trying to perform
sample_df.groupby("col1")["col2"].value_counts().unstack()
when col2 or both col1 and col2 are of type 'category' instead of default type 'object'. Operation in EXAMPLE 1 runs around ~25 seconds on my computer (similar for 2019 Macbook Pro with Intel processor). In EXAMPLE 2 I have run the operation for more than 20 minutes and it still did not finish (on 2019 Macbook Pro with Intel processor running time is similar for EXAMPLE 1 and EXAMPLE 2).
Moreover, for M1 in EXAMPLE 1, the peak memory usage is around 8-10GB while for EXAMPLE 2 it well exceeds 30GB.
Expected Behavior
on M1:
EXAMPLE 1 and EXAMPLE 2 should perform roughly the same as they do on Intel processor.
Installed Versions
Comment From: samukweku
@whypandasslow What happens if you set observed=True
in the group by? Since you are dealing with categorical, this might offer a performance improvement
Comment From: whypandasslow
Hey @samukweku . I tried it and almost no improvement at all. And whether it is turned on or off, the difference in running time between EXAMPLE 1 and EXAMPLE 2 is in factor of hundreds
Comment From: rhshadrach
When value_counts
was added to DataFrameGroupBy, I noticed that it seemed to perform better than the SeriesGroupBy implementation. That seems to be the case here when combined with observed=True
. Can you verify the performance of:
processed_df = sample_df.groupby("col1", observed=True)[["col2"]].value_counts().unstack()
Comment From: whypandasslow
Hey @rhshadrach! I have checked 4 cases on a reduced sample of 130000 records: example 1 using dataframe approach - 0.356 seconds example 1 using dataframe approach - 0.3622 seconds example 2 using series approach - 2509 seconds example 2 using dataframe approach - 43 seconds
I see only slight improvement for dataframe approach when a column is of type object, however I see HUGE improvement when a column is of type category. Yet still, leaving the column as object yields much better running-time (in the factor of 100x) then when it is converted to category.
So the performance problem for the operation on columns of type 'category' still persits no matter what approach is used
Comment From: rhshadrach
@whypandasslow - I'm not seeing the same. Are you by chance missing setting observed=True
? I am using the following code
Code
import numpy as np
import pandas as pd
import time
size = 130000
t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]
col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)
sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
print(time.time()-t0)
t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)
# EXAMPLE 2
t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]
col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)
sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
sample_df['col2'] = sample_df['col2'].astype('category')
print(time.time()-t0)
t0 = time.time()
processed_df = sample_df.groupby("col1", observed=True)[["col2"]].value_counts().unstack()
print(time.time()-t0)
And getting the output
16.620705127716064
0.5117876529693604
16.49775242805481
0.5147149562835693
showing no significant difference between SeriesGroupBy with object dtype and DataFrameGroupBy with categorical dtype.
Comment From: whypandasslow
@rhshadrach, @samukweku instead copy-pasting @rhshadrach 's line, I just added another layer of square brackets around "col2" to make it a dataframe. This is how I got my 43 seconds. Using both, Dataframe approach and parmeter observed=True I got similar results to those of @rhshadrach. Conclusion, working with column of type 'category', one must use both observed=True and Dataframe approach. Doing default group by on series yields horrible results. Doin group by on dataframe improves the performance but only combined with observed=True yields best performance. Wat is more, compbinig observed=True with series group by gives no improvement over series group by without observed=True. In the end, using both group by on dataframe and observed=True on columns of type 'category' I got better perfomance than with group by on series on columns of type 'object'. Thank you @rhshadrach and @samukweku! My problem is solved!
Comment From: rhshadrach
@whypandasslow - thanks for the response. Using DataFrameGroupby is merely a workaround. The performance issues in SeriesGroupBy should be fixed.
Comment From: LucasG0
Hi, I am interested in working on this issue. May I take it ?
Comment From: LucasG0
In the output of SeriesGroupBy.value_counts
with categoricals, each group has as many rows as the number of categories, thus producing many extra rows. I don't think it is expected, and fixing this should probably fix performances too.
>>> df = pd.DataFrame(zip(["A", "B", "C"], ["X", "Y", "Z"]), columns=["col1", "col2"])
>>> df
col1 col2
0 A X
1 B Y
2 C Z
>>> df.groupby("col1")["col2"].value_counts()
col1 col2
A X 1
B Y 1
C Z 1
Name: col2, dtype: int64
>>> df['col2'] = df['col2'].astype('category')
>>> df.groupby("col1")["col2"].value_counts()
col1
A X 1
Y 0
Z 0
B Y 1
X 0
Z 0
C Z 1
X 0
Y 0
Name: col2, dtype: int64
Comment From: samukweku
@LucasG0 if you pass observed=True
to the groupby , it should reduce the number of rows
Comment From: LucasG0
observed=True
does not reduce the number of rows if used with a SeriesGroupBy
, but it does for a DataFrameGroupBy
(above workaround).
>>> df2.groupby("col1", observed=True)["col2"].value_counts()
col1
A X 1
Y 0
Z 0
B Y 1
X 0
Z 0
C Z 1
X 0
Y 0
Name: col2, dtype: int64
>>> df2.groupby("col1")[["col2"]].value_counts()
col1 col2
A X 1
Y 0
Z 0
B Y 1
X 0
Z 0
C Z 1
X 0
Y 0
dtype: int64
>>> df2.groupby("col1", observed=True)[["col2"]].value_counts()
col1 col2
A X 1
B Y 1
C Z 1
dtype: int64
Comment From: rhshadrach
@LucasG0 - thanks for finding that SeriesGroupBy.value_counts does not implement observed=True
properly. It appears that is a separate bug. Can you see if there is already an issue for this, and create one if there is not.
Comment From: LucasG0
@rhshadrach It does not seem it has been reported yet. I created the issue and also detailled why I have a doubt whether the observed
parameter should have an effect here
Comment From: LucasG0
I observed the same performance issue than above on 130k records, and I think the issue here is between SeriesGroupBy.value_counts
and DataFrameGroupBy.value_counts
on categoricals:
%%time
df2.groupby("col1")[["col2"]].value_counts()
CPU times: total: 38.7 s
Wall time: 39.8 s
%%time
df2.groupby("col1")["col2"].value_counts()
CPU times: total: 15min 27s
Wall time: 15min 28s
Also, as discussed in #46357 DataFrameGroupBy.value_counts
result should not be impacted by observed
parameter in this case. Once it is effective, the above workaround combining DataFrameGroupBy
version + observed=True
won't be relevant anymore, and I think the workaround (less efficient though) will be to come back to object dtype for this operation (unless an observed
parameter is introduced to value_counts
API).
Comment From: rhshadrach
Also, as discussed in #46357
DataFrameGroupBy.value_counts
result should not be impacted byobserved
parameter in this case. Once it is effective, the above workaround combiningDataFrameGroupBy
version +observed=True
won't be relevant anymore, and I think the workaround (less efficient though) will be to come back to object dtype for this operation (unless anobserved
parameter is introduced tovalue_counts
API).
While this is partially true, I still think there is a significant performance issue with SeriesGroupBy. I changed the line
https://github.com/pandas-dev/pandas/blob/48d515958d5805f0e62e34b7424097e5575089a8/pandas/core/groupby/generic.py#L1731
to be observed=False
. I think this fixes the bug in #46357. Doing
df = pd.DataFrame(zip(["A", "B", "C"], ["X", "Y", "Z"]), columns=["col1", "col2"])
df['col2'] = df['col2'].astype('category')
print(df.groupby("col1")[["col2"]].value_counts())
gives
col1 col2
A X 1
Y 0
Z 0
B Y 1
X 0
Z 0
C Z 1
X 0
Y 0
dtype: int64
Now using a modified example from https://github.com/pandas-dev/pandas/issues/46202#issuecomment-1059654115
Code
import numpy as np
import pandas as pd
import time
size = 13000
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]
col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)
sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)
# EXAMPLE 2
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]
col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)
sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
sample_df['col2'] = sample_df['col2'].astype('category')
t0 = time.time()
processed_df = sample_df.groupby("col1", observed=True)["col2"].value_counts().unstack()
print(time.time()-t0)
t0 = time.time()
processed_df = sample_df.groupby("col1", observed=True)[["col2"]].value_counts().unstack()
print(time.time()-t0)
I get the timings
0.047942399978637695 # SeriesGroupBy, object
25.37419080734253 # SeriesGroupBy, category
3.2430622577667236 # DataFrameGroupBy, category
Comment From: LucasG0
While this is partially true, I still think there is a significant performance issue with SeriesGroupBy
Yes, what I meant is that when both this issue and https://github.com/pandas-dev/pandas/issues/46357 are fixed, it won't be possible to retrieve only observed values in order to get the best performances using categorical dtype, but a prior casting to object
dtype will be needed.
Comment From: LucasG0
take