Pandas ENH: groupby.max() should not cast int to int64 but keep original data type

Is your feature request related to a problem?

In pandas version 1.2.5., using groupby.max() on a large matrix of int8 datatype 0/1 values, pandas casts the dataframe to int64, resulting in

MemoryError: Unable to allocate 76.4 GiB for an array with shape (1915674, 5356) and data type int64

Traceback:

/python3.9/site-packages/pandas/core/dtypes/common.py in ensure_int_or_float(arr, copy)
    143     try:
    144         # error: Unexpected keyword argument "casting" for "astype"
--> 145         return arr.astype("int64", copy=copy, casting="safe")  # type: ignore[call-arg]
    146     except TypeError:
    147         pass

Describe the solution you'd like

Keep the original datatype, in this case int8.

Comment From: mzeitlin11

Thanks for reporting this @rd-andreas-lay! This happens because our groupby algorithms only support specific types, so we need to cast to one which is supported. Wouldn't be hard to support more types for group_min and group_max, but it would increase distribution size (since we effectively need one function per supported type).

Comment From: arubiales

Hi @mzeitlin11 ! I want to contribute to this issue in Pandas. Do you want to add support to int8? Can I work on it?

Comment From: mzeitlin11

@arubiales that would be great!

Comment From: arubiales

Thanks @mzeitlin11 I will go for it!

Any useful information as for example, the module of pandas where is located, or files, and other things to consider, is appreciated.

Comment From: mzeitlin11

This is a pretty complicated issue, so there are a lot of things to consider :), but please reach out if you'd like any help:

The cython algorithm is here: https://github.com/pandas-dev/pandas/blob/b0082d2503a9c5b4732acfe09716f49670e0fe8f/pandas/_libs/groupby.pyx#L1173. To avoid needing to upcast, the fused type should be updated to be numeric
A lot of the preprocessing (and where the upcast happens) is here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/ops.py. I would recommend using a debugger to step through an example to figure out where/why the upcast occurs and how you can avoid it.
Since the purpose of this issue is to reduce memory usage, we'll want to verify any patch with a memory benchmark, see something like https://github.com/pandas-dev/pandas/blob/12513c4cdb14fe70ec7226e12bdea70faccdc2cc/asv_bench/benchmarks/rolling.py#L192 for an example

Comment From: arubiales

Yes I know that it will take time, but I have a strong knowledge of C and Cython, so I think that with time I will do it.

Thank you for the info, I'm going to review it and take and overall idea of how everything is connected.

Comment From: arubiales

@mzeitlin11 @rd-andreas-lay . Sorry but I'm triying to reproduce the data type change with a minimum replicable example and it's impossible for me, so I'm missing something here. I'm triying the following

import numpy as np
import pandas as pd

# Create a dummy DF
df_prueba = pd.DataFrame(np.random.randint(0, 2, (100, 3), dtype=np.int8))
df_prueba["name"] = ["lion", "bird", "dog", "cat", "python"]*20

#keep the int8 type
df_group = df_prueba.groupby("name").max()
print(df_group.dtypes)

Output:

0    int8
1    int8
2    int8
dtype: object

Comment From: rd-andreas-lay

@arubiales In my understanding the final data type is recast to the original data type later on, the conversion to float is just intermediate (still potentially causing memory allocation errors - in my example an increase from 10GB to 70GB).

I'd have to run an example through the debugger though to see where the re-casting to int8 happens.

If you check your memory consumption running the example on larger dataframe, you should see an increase in memory while processing, the final result will again be smaller due the recasting to int8. Basically an inverted V shape in memory usage.

Comment From: lithomas1

closed by #46745