Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
df = pd.DataFrame({
"cat_1": pd.Categorical(list("AB"), categories=list("ABCDE"), ordered=True),
"cat_2": pd.Categorical([1, 2], categories=[1, 2, 3], ordered=True),
"value_1": np.random.uniform(size=2),
})
chunk1 = df[df.cat_1 == "A"]
chunk2 = df[df.cat_1 == "B"]
df1 = chunk1.groupby("cat_1", observed=False).min()
df2 = chunk2.groupby("cat_1", observed=False).min()
df3 = pd.concat([df1, df2], ignore_index=False)
res3 = df3.groupby(level=0, observed=False).min()
print(f"\n{res3}")
Issue Description
When performing a groupby/min
with a categorical dtype and observed=False
, the results differ between 1.5.3
(and 2.0
) and 2.1.
Output with 1.5.3 or 2.0:
cat_2 value_1
cat_1
A 1 0.384993
B 2 0.955231
C NaN NaN
D NaN NaN
E NaN NaN
Output with the latest main
:
cat_2 value_1
cat_1
A 1 0.297557
B 1 0.081856
C 1 NaN
D 1 NaN
E 1 NaN
The change can be traced to this PR:
- https://github.com/pandas-dev/pandas/pull/52120
Expected Behavior
I'm not sure if the changed behavior is intended. Please advise.
Installed Versions
Comment From: DeaMariaLeon
Thank you for reporting @j-bennet.
I get this when I run your example, on pandas version 2.1.0.dev0+171.gc293caf2e9
:
cat_2 value_1
cat_1
A 1 0.837825
B 2 0.687282
C NaN NaN
D NaN NaN
E NaN NaN
Comment From: MarcoGorelli
I can't reproduce this either, I'm also getting NaN
for categories C D E
with the latest release candidate
cat_2 value_1
cat_1
A 1 0.251036
B 2 0.204336
C NaN NaN
D NaN NaN
E NaN NaN
Comment From: j-bennet
It's reproducible with the latest main
, 2.1.0.dev0+295.g25c1942c39
, or with the latest nightly, 2.1.0.dev0+293.gd22d1f2db0
.
Comment From: MarcoGorelli
ah yes, can reproduce with the latest nightly, thanks!
so, the regression is between 2.0 and 2.1 - updating the title then, thanks for the report!
cc @jbrockmendel
Comment From: j-bennet
git bisect
says it was introduced here:
https://github.com/pandas-dev/pandas/commit/14affe0808078ee380a868df544701f372b6b02f
Comment From: jbrockmendel
Not intended.
In WrappedCythonOp._ea_wrap_cython_operation there is a case for isinstance(values, Categorical)
. The last few lines of that case are
if self.how in self.cast_blocklist:
return res_values
return values._from_backing_data(res_values)
I think the fix here is:
if self.how in self.cast_blocklist:
return res_values
+ elif self.how in ["first", "last", "min", "max"]:
+ res_values[result_mask==1] = -1
return values._from_backing_data(res_values)
Comment From: DeaMariaLeon
@j-bennet do you want to do the PR to fix it? If you don't I'll do it. Could you let me know please?
Comment From: j-bennet
@DeaMariaLeon Go ahead, I'm not familiar enough with the codebase. Thank you.
Comment From: j-bennet
Thank you @DeaMariaLeon .