Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

print(pd.show_versions())

df_normal = pd.DataFrame({"x": ["a", "b", "b", np.nan]})

print(df_normal, end="\n\n")

print(df_normal.groupby("x", dropna=False).size(), end="\n\n")

df_categorical = pd.DataFrame({"x": ["a", "b", "b", np.nan]}).astype(
    {"x": pd.CategoricalDtype()}
)

print(df_categorical, end="\n\n")

print(df_categorical.groupby("x", dropna=False).size(), end="\n\n")

Issue Description

Although dropna=False, pandas drops the NA in the categorical column. This can be seen because it has not shown up in the produced series.

If the data type is not categorical, the groupby correctly keeps the NA in the column, as shown by df_normal

Expected Behavior

  • Pandas does not drop the NA in the categorical column in the groupby when dropna=False
  • The normal and the categorical DataFrame function the same

Installed Versions

INSTALLED VERSIONS ------------------ commit : ca60aab7340d9989d9428e11a51467658190bb6b python : 3.9.13.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19043 machine : AMD64 processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : English_United Kingdom.1252 pandas : 1.4.4 numpy : 1.23.3 pytz : 2022.2.1 dateutil : 2.8.2 setuptools : 58.1.0 pip : 22.0.4 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None None

Comment From: erg0dic

First as a disclaimer, I'm not a regular here but I tried looking into this and I think this is expected behavior but this could be highlighted better maybe as a warning or something.

From what I found, NaN cannot be a valid category type. Also, when you specify restricted categorical types like pd.CategoricalDtype(['a', 'b']), anything else that wasn't covered in the pd.Series automatically becomes a NaN and isn't considered by group operations. Have a look at this issue. And the test files for categorical types. There are example test cases that you might find convincing.

Comment From: AkshayJain1995

Hi @goodyguts , I would like to pick this up. Can you assign it to me?

Comment From: radekSF

Is this a duplicate of #36327 ?

Comment From: adam-carruthers

Yes it is

Comment From: CBMollerup

36327 has been closed by #49652

Comment From: rhshadrach

Thanks @CBMollerup; the output of the last line in the OP is now:

x
a      1
b      2
NaN    1
dtype: int64

which is correct. As identified, this was closed by #49652.

Comment From: mdruiter

The bug is back in 1.5.1. I didn't find an open issue to fix this is a future release. What's the plan?

Comment From: radekSF

@mdruiter Perhaps @rhshadrach can confirm if that's the case but https://github.com/pandas-dev/pandas/pull/49652 is in milestone 2.0 so the bug will not be fixed during 1.5 - however, the solution has been implemented and will be released at some point in the future.

Comment From: rhshadrach

Yes - that's correct. I believe we're aiming to have a release candidate in early February and then release 2.0 several weeks later.