Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
print(pd.show_versions())
df_normal = pd.DataFrame({"x": ["a", "b", "b", np.nan]})
print(df_normal, end="\n\n")
print(df_normal.groupby("x", dropna=False).size(), end="\n\n")
df_categorical = pd.DataFrame({"x": ["a", "b", "b", np.nan]}).astype(
{"x": pd.CategoricalDtype()}
)
print(df_categorical, end="\n\n")
print(df_categorical.groupby("x", dropna=False).size(), end="\n\n")
Issue Description
Although dropna=False, pandas drops the NA in the categorical column. This can be seen because it has not shown up in the produced series.
If the data type is not categorical, the groupby correctly keeps the NA in the column, as shown by df_normal
Expected Behavior
- Pandas does not drop the NA in the categorical column in the groupby when dropna=False
- The normal and the categorical DataFrame function the same
Installed Versions
Comment From: erg0dic
First as a disclaimer, I'm not a regular here but I tried looking into this and I think this is expected behavior but this could be highlighted better maybe as a warning or something.
From what I found, NaN cannot be a valid category type. Also, when you specify restricted categorical types like pd.CategoricalDtype(['a', 'b'])
, anything else that wasn't covered in the pd.Series
automatically becomes a NaN and isn't considered by group operations. Have a look at this issue. And the test files for categorical types. There are example test cases that you might find convincing.
Comment From: AkshayJain1995
Hi @goodyguts , I would like to pick this up. Can you assign it to me?
Comment From: radekSF
Is this a duplicate of #36327 ?
Comment From: adam-carruthers
Yes it is
Comment From: CBMollerup
36327 has been closed by #49652
Comment From: rhshadrach
Thanks @CBMollerup; the output of the last line in the OP is now:
x
a 1
b 2
NaN 1
dtype: int64
which is correct. As identified, this was closed by #49652.
Comment From: mdruiter
The bug is back in 1.5.1. I didn't find an open issue to fix this is a future release. What's the plan?
Comment From: radekSF
@mdruiter Perhaps @rhshadrach can confirm if that's the case but https://github.com/pandas-dev/pandas/pull/49652 is in milestone 2.0 so the bug will not be fixed during 1.5 - however, the solution has been implemented and will be released at some point in the future.
Comment From: rhshadrach
Yes - that's correct. I believe we're aiming to have a release candidate in early February and then release 2.0 several weeks later.