Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
test = pd.DataFrame({'id':list("abc")*10, "value": range(0,30)})
test['id'] = test.id.astype('category')
test.groupby('id').ngroups
# 3

a = test.iloc[:2].copy()
a.groupby('id').ngroups
# 3

Issue Description

Even though the original DF (test) is copied and truncated as a new DF (a), the new DF's ngroups still seems to reference the original DF's groups. The issue goes away when the grouping variable(id) is set to string.

Expected Behavior

groupby() must reference the grouping variable in the new DF, not the one in the original DF from which the new DF is copied.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.8.15.final.0 python-bits : 64 OS : Darwin OS-release : 22.2.0 Version : Darwin Kernel Version 22.2.0: Fri Nov 11 02:03:51 PST 2022; root:xnu-8792.61.2~4/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.3 numpy : 1.22.4 pytz : 2022.7 dateutil : 2.8.2 setuptools : 59.8.0 pip : 22.3.1 Cython : 0.29.28 pytest : None hypothesis : None sphinx : None blosc : None feather : 0.4.1 xlsxwriter : None lxml.etree : 4.9.2 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.7.0 pandas_datareader: 0.10.0 bs4 : 4.11.1 bottleneck : None brotli : fastparquet : None fsspec : None gcsfs : None matplotlib : 3.5.1 numba : None numexpr : None odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : 8.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.7.3 snappy : None sqlalchemy : 1.4.46 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : 2021.5

Comment From: rhshadrach

Thanks for the report. With categorical, observed defaults to False so you get groups even if the input has missing categories.

test = pd.DataFrame({'id':list("abc")*10, "value": range(0,30)})
test['id'] = test.id.astype('category')
a = test.iloc[:2].copy()
print(a.groupby('id').sum())
#     value
# id       
# a       0
# b       1
# c       0

You can set observed=True:

print(a.groupby('id', observed=True).ngroups)
# 2
print(a.groupby('id', observed=True).sum())
#     value
# id
# a       0
# b       1

Comment From: rhshadrach

Closing - comment here if you think something is not correct and can reopen.