Is your feature request related to a problem?
In pandas version 1.2.5., using groupby.max() on a large matrix of int8 datatype 0/1 values, pandas casts the dataframe to int64, resulting in
MemoryError: Unable to allocate 76.4 GiB for an array with shape (1915674, 5356) and data type int64
Traceback:
/python3.9/site-packages/pandas/core/dtypes/common.py in ensure_int_or_float(arr, copy)
143 try:
144 # error: Unexpected keyword argument "casting" for "astype"
--> 145 return arr.astype("int64", copy=copy, casting="safe") # type: ignore[call-arg]
146 except TypeError:
147 pass
Describe the solution you'd like
Keep the original datatype, in this case int8.
Comment From: mzeitlin11
Thanks for reporting this @rd-andreas-lay! This happens because our groupby algorithms only support specific types, so we need to cast to one which is supported. Wouldn't be hard to support more types for group_min
and group_max
, but it would increase distribution size (since we effectively need one function per supported type).
Comment From: arubiales
Hi @mzeitlin11 ! I want to contribute to this issue in Pandas. Do you want to add support to int8? Can I work on it?
Comment From: mzeitlin11
@arubiales that would be great!
Comment From: arubiales
Thanks @mzeitlin11 I will go for it!
Any useful information as for example, the module of pandas where is located, or files, and other things to consider, is appreciated.
Comment From: mzeitlin11
This is a pretty complicated issue, so there are a lot of things to consider :), but please reach out if you'd like any help:
- The cython algorithm is here: https://github.com/pandas-dev/pandas/blob/b0082d2503a9c5b4732acfe09716f49670e0fe8f/pandas/_libs/groupby.pyx#L1173. To avoid needing to upcast, the fused type should be updated to be
numeric
- A lot of the preprocessing (and where the upcast happens) is here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/ops.py. I would recommend using a debugger to step through an example to figure out where/why the upcast occurs and how you can avoid it.
- Since the purpose of this issue is to reduce memory usage, we'll want to verify any patch with a memory benchmark, see something like https://github.com/pandas-dev/pandas/blob/12513c4cdb14fe70ec7226e12bdea70faccdc2cc/asv_bench/benchmarks/rolling.py#L192 for an example
Comment From: arubiales
Yes I know that it will take time, but I have a strong knowledge of C and Cython, so I think that with time I will do it.
Thank you for the info, I'm going to review it and take and overall idea of how everything is connected.
Comment From: arubiales
@mzeitlin11 @rd-andreas-lay . Sorry but I'm triying to reproduce the data type change with a minimum replicable example and it's impossible for me, so I'm missing something here. I'm triying the following
import numpy as np
import pandas as pd
# Create a dummy DF
df_prueba = pd.DataFrame(np.random.randint(0, 2, (100, 3), dtype=np.int8))
df_prueba["name"] = ["lion", "bird", "dog", "cat", "python"]*20
#keep the int8 type
df_group = df_prueba.groupby("name").max()
print(df_group.dtypes)
Output:
0 int8
1 int8
2 int8
dtype: object
Comment From: rd-andreas-lay
@arubiales In my understanding the final data type is recast to the original data type later on, the conversion to float is just intermediate (still potentially causing memory allocation errors - in my example an increase from 10GB to 70GB).
I'd have to run an example through the debugger though to see where the re-casting to int8 happens.
If you check your memory consumption running the example on larger dataframe, you should see an increase in memory while processing, the final result will again be smaller due the recasting to int8. Basically an inverted V shape in memory usage.
Comment From: lithomas1
closed by #46745