Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

n = 32

df = pd.DataFrame({'label': [1]*n,'bool_column': [True]*(n)})

print(df.assign(bool_column_dupe=(df.bool_column==True)).groupby("label").bool_column_dupe.sum())

Issue Description

Summing a boolean column after a groupby gives an unreasonably large and incorrect sum. This is the result from the above code: 1 8128 Name: bool_column_dupe, dtype: int64

The issue does not occur if I instead add an astype(int): print(df.assign(bool_column_dupe(df.bool_column==True).astype(int)).groupby("label").bool_column_dupe.sum())

The issue does not occur if I don't use the .assign: print(df.groupby("label").bool_column.sum())

The issue does not occur with n<32.

The issue only occurs with numpy v1.24.0. It does not occur with numpy v1.23.5.

Expected Behavior

Expected to see: 1 32 Name: bool_column_dupe, dtype: int64

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.9.13.final.0 python-bits : 64 OS : Darwin OS-release : 22.1.0 Version : Darwin Kernel Version 22.1.0: Sun Oct 9 20:14:54 PDT 2022; root:xnu-8792.41.9~2/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : en_GB.UTF-8 pandas : 1.5.2 numpy : 1.24.0 pytz : 2022.7 dateutil : 2.8.2 setuptools : 60.2.0 pip : 21.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Comment From: asishm

this might be a numpy issue

In [77]: a = np.array([True] * 32)

In [78]: b = (a == True)

In [79]: b.view('uint8')
Out[79]:
array([254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254,
       254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254, 254,
       254, 254, 254, 254, 254, 254], dtype=uint8)

In [80]: np.__version__
Out[80]: '1.24.0'

Comment From: phofl

This works on main for me, so might need tests.

We made quite a lot of changes on groupby algorithms to preserve dtypes better, so could be related to that

Comment From: asishm

I don't think this works on main

In [3]: pd.__version__, np.__version__
Out[3]: ('2.0.0.dev0+971.gca3e0c875f', '1.24.0')

In [4]: n = 32
   ...: df = pd.DataFrame({'label': 1, 'a': [True] * n})
   ...: df['b'] = df['a'] == True
   ...: df.groupby('label').sum()
Out[4]:
        a     b
label
1      32  8128

Comment From: phofl

Hm good point, did not use numpy 1.24.0 I agree that this looks like a numpy bug

Comment From: agaventa

This is now fixed in numpy 1.24.1

Comment From: Daquisu

take