Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

pd.DataFrame({'data':[np.nan,1,1,3,np.nan]}).groupby('data').cumcount(ascending=True)

Issue Description

I'm not sure if this a known or expected behavior, but pandas.DataFrame.groupby.cumcount() returns a float and NaN values when the groupby column contains np.nan. This is changed behavior from past versions (compared with v.1.1.0) where the count of NaN values was returned as an int. In both the older and current version, ints are returned if no NaNs are present, but the dtype returned in v.2.0.0 depends on whether or not there is a NaN.

Expected Behavior

I think the expected behavior would be as in past versions, where all values returned are ints, for example:

>>> pd.__version__
'1.1.0'
>>> pd.DataFrame({'data':[np.nan,1,1,3,np.nan]}).groupby('data').cumcount(ascending=True)
0    0
1    0
2    1
3    0
4    1
dtype: int64

Installed Versions

/usr/local/lib/python3.8/dist-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS ------------------ commit : 478d340667831908b5b4bf09a2787a11a14560c9 python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-1034-aws Version : #38~20.04.1-Ubuntu SMP Wed Mar 29 19:48:16 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : C.UTF-8 LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.0.0 numpy : 1.23.5 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.6.1 pip : 20.0.2 Cython : 0.29.34 pytest : 7.3.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.6 jinja2 : 3.1.2 IPython : 8.12.0 pandas_datareader: None bs4 : 4.12.2 bottleneck : None brotli : 1.0.9 fastparquet : None fsspec : None gcsfs : None matplotlib : 3.7.1 numba : 0.56.4 numexpr : 2.8.4 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : None tables : 3.8.0 tabulate : None xarray : 2023.1.0 xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Comment From: rhshadrach

Thanks for the report. This change was highlighted here: https://pandas.pydata.org/docs/whatsnew/v1.5.0.html#using-dropna-true-with-groupby-transforms

groupby has a dropna argument that defaults to True. As your example highlights, this argument was not being respected, and now it is. You can get the desired output by setting dropna=False:

pd.DataFrame({'data':[np.nan,1,1,3,np.nan]}).groupby('data', dropna=False).cumcount(ascending=True)
# 0    0
# 1    0
# 2    1
# 3    0
# 4    1
# dtype: int64

Does this resolve your issue?

Comment From: mluck

Yes, that's exactly what I needed.

Thanks so much!