Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
pd.DataFrame({'data':[np.nan,1,1,3,np.nan]}).groupby('data').cumcount(ascending=True)
Issue Description
I'm not sure if this a known or expected behavior, but pandas.DataFrame.groupby.cumcount() returns a float and NaN values when the groupby column contains np.nan. This is changed behavior from past versions (compared with v.1.1.0) where the count of NaN values was returned as an int. In both the older and current version, ints are returned if no NaNs are present, but the dtype returned in v.2.0.0 depends on whether or not there is a NaN.
Expected Behavior
I think the expected behavior would be as in past versions, where all values returned are ints, for example:
>>> pd.__version__
'1.1.0'
>>> pd.DataFrame({'data':[np.nan,1,1,3,np.nan]}).groupby('data').cumcount(ascending=True)
0 0
1 0
2 1
3 0
4 1
dtype: int64
Installed Versions
Comment From: rhshadrach
Thanks for the report. This change was highlighted here: https://pandas.pydata.org/docs/whatsnew/v1.5.0.html#using-dropna-true-with-groupby-transforms
groupby has a dropna argument that defaults to True. As your example highlights, this argument was not being respected, and now it is. You can get the desired output by setting dropna=False:
pd.DataFrame({'data':[np.nan,1,1,3,np.nan]}).groupby('data', dropna=False).cumcount(ascending=True)
# 0 0
# 1 0
# 2 1
# 3 0
# 4 1
# dtype: int64
Does this resolve your issue?
Comment From: mluck
Yes, that's exactly what I needed.
Thanks so much!