Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
test = pd.DataFrame({'id':list("abc")*10, "value": range(0,30)})
test['id'] = test.id.astype('category')
test.groupby('id').ngroups
# 3
a = test.iloc[:2].copy()
a.groupby('id').ngroups
# 3
Issue Description
Even though the original DF (test
) is copied and truncated as a new DF (a
), the new DF's ngroups
still seems to reference the original DF's groups. The issue goes away when the grouping variable(id
) is set to string.
Expected Behavior
groupby()
must reference the grouping variable in the new DF, not the one in the original DF from which the new DF is copied.
Installed Versions
Comment From: rhshadrach
Thanks for the report. With categorical, observed
defaults to False so you get groups even if the input has missing categories.
test = pd.DataFrame({'id':list("abc")*10, "value": range(0,30)})
test['id'] = test.id.astype('category')
a = test.iloc[:2].copy()
print(a.groupby('id').sum())
# value
# id
# a 0
# b 1
# c 0
You can set observed=True
:
print(a.groupby('id', observed=True).ngroups)
# 2
print(a.groupby('id', observed=True).sum())
# value
# id
# a 0
# b 1
Comment From: rhshadrach
Closing - comment here if you think something is not correct and can reopen.