Pandas PERF: group by manipulation is slower with new arrow engine

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[X] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Following codes will re-produce my issue:

import numpy as np
import pandas as pd

# generate a bunch sample data for later use
n = 5000000

s_samples = [f"s_{i}" for i in range(1, 101)]
i_samples = [f"i_{i}" for i in range(1, 201)]
bool_samples = [True, False]

ssamples = np.random.choice(s_samples, n)
isamples = np.random.choice(i_samples, n)
d_values = np.random.randn(3, n)
b_values = np.random.choice(bool_samples, n)

df = pd.DataFrame(
    dict(s=ssamples, i=isamples, v1=d_values[0], v2=d_values[1], v3=d_values[2], f1=b_values, f2=b_values)
)

df.to_csv("sample.csv", index=None)

# read in data with different engine
df_new = pd.read_csv("sample.csv", engine="pyarrow", dtype_backend="pyarrow")
df_old = pd.read_csv("sample.csv")

# do the bechmark
%timeit df_new.groupby("s")["v1"].sum()
# >> 660 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_old.groupby("s")["v1"].sum()
# >> 311 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The new engine is 2x slower than the old engine.

Installed Versions

INSTALLED VERSIONS ------------------ commit : c2a7f1ae753737e589617ebaaff673070036d653 python : 3.9.16.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : Chinese (Simplified)_China.936 pandas : 2.0.0rc1 numpy : 1.23.5 pytz : 2022.7 dateutil : 2.8.2 setuptools : 65.6.3 pip : 23.0.1 Cython : 0.29.33 pytest : 7.1.2 hypothesis : None sphinx : 5.0.2 blosc : None feather : None xlsxwriter : 3.0.3 lxml.etree : 4.9.1 html5lib : None pymysql : 1.0.2 psycopg2 : 2.9.3 jinja2 : 3.1.2 IPython : 8.10.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : 1.3.5 brotli : fastparquet : None fsspec : 2022.11.0 gcsfs : None matplotlib : 3.7.1 numba : 0.56.4 numexpr : 2.8.4 odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.0 snappy : sqlalchemy : 1.4.39 tables : 3.7.0 tabulate : 0.8.10 xarray : 2022.11.0 xlrd : 2.0.1 zstandard : 0.19.0 tzdata : None qtpy : 2.2.0 pyqt5 : None

Prior Performance

No response

Comment From: phofl

This is expected. GroupBy isn't implemented for arrow yet

Comment From: wegamekinglc

@phofl thanks for your comments

Comment From: topper-123

Is there a explanation about the current limitations when using arrow? There have been several blog posts talking up benefits of having arrow in Pandas and I think it could be a good idea laying out the current performance limitations of using arrow in Pandas.

Comment From: jbrockmendel

I'm working on a patch here and having trouble making a performant conversion from the ArrowArray to MaskedArray. The non-working method looks like:

def _to_masked(self):
           pa_dtype = self._pa_array.type
        if pa.types.is_floating(pa_dtype):
            nbits = pa_dtype.bit_width
            dtype = f"Float{nbits}"
            np_dtype = dtype.lower()
            from pandas.core.arrays import FloatingArray as arr_cls
        elif pa.types.is_unsigned_integer(pa_dtype):
            nbits = pa_dtype.bit_width
            dtype = f"UInt{nbits}"
            np_dtype = dtype.lower()
            from pandas.core.arrays import IntegerArray as arr_cls
        elif pa.types.is_signed_integer(pa_dtype):
            nbits = pa_dtype.bit_width
            dtype = f"Int{nbits}"
            np_dtype = dtype.lower()
            from pandas.core.arrays import IntegerArray as arr_cls
        elif pa.types.is_boolean(pa_dtype):
            dtype = "boolean"
            np_dtype = "bool"
            from pandas.core.arrays import BooleanArray as arr_cls
        else:
            raise NotImplementedError

        data = self._pa_array.combine_chunks()
        buffs = data.buffers()
        assert len(buffs) == 2
        mask = self.isna()
        arr = np.array(buffs[1], dtype=np_dtype)
        return arr_cls(arr, mask)

But it looks like this is not the correct way to get arr. @jorisvandenbossche how do I get the correct ndarray here?

Comment From: phofl

Did you try to_numpy on the ArrowExtensionArray and setting the na_value to 1?

Comment From: jbrockmendel

That seems to work, thanks. Branch is ready for once the EA._groupby_op PR is merged.

Comment From: jorisvandenbossche

Did you try to_numpy on the ArrowExtensionArray and setting the na_value to 1?

That doesn't give you a masked array, though (if that's what is needed). And will make a copy if there are missing values.

We normally already have this functionality in the __from_arrow__ methods on the MaskedArray classes. The easiest way to directly reuse this might be self._pa_array.to_pandas(types_mapper=pd.io._util._arrow_dtype_mapping().get). But the underlying functionality could maybe be refactored as a more general utility to use internally (pyarrow will returns a Series here, which we don't need)

Comment From: phofl

@jorisvandenbossche Regarding the copy: I am aware that we create a copy when we have missing values, but wouldn't arrow do the same?

Comment From: jorisvandenbossche

We only need to copy the bitmask (to convert in a bytemask). The actual data (buffs[1] in Brock's snippet above) don't need to be copied, even if there are missing values. While relying on pyarrow's to_numpy / __array__ will always give a full copy of the actual data once there are missing values.

Comment From: phofl

@jorisvandenbossche I looked into this and to_pandas is 5 times slower compared to the to_numpy way. Not sure what's going on under the hood, but I guess that one of the problems Is that we get back a Series

Comment From: lukemanley

I took a look at this too and noticed pd.read_csv was returning pyarrow chunked arrays with quite a few chunks (using the example in the OP):

df_new.apply(lambda x: x.array._pa_array.num_chunks)

s     383
i     383
v1    383
v2    383
v3    383
f1    383
f2    383
dtype: int64

Doesn't make up the 5x, but I see some improvement if you combine chunks first:

arr = df_new["v1"].array._pa_array

%timeit pd.Float64Dtype().__from_arrow__(arr)
# 17.6 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pd.Float64Dtype().__from_arrow__(arr.combine_chunks())
# 12.1 ms ± 43.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Comment From: jorisvandenbossche

Also using the example from the OP, I see even a bigger difference:

In [36]: %timeit df_new["v1"].array._pa_array.to_numpy()
20.5 ms ± 539 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [38]: %timeit df_new["v1"].array._pa_array.to_pandas(types_mapper=pd.io._util._arrow_dtype_mapping().get)
270 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But according to a simple profile of the python (non-naive) time, that slow time is almost entirely due to our Series(..) constructor. That seems a separate issue on our side that we should fix (it seems to consider the input masked array as a generic sequence or something like that).

Comparing directly with __from_arrow__ (instead of going through to_pandas) as Luke did is therefore indeed a more useful comparison. I see:

In [40]: arr = df_new["v1"].array._pa_array

In [41]: %timeit arr.to_numpy()
20.1 ms ± 917 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [42]: %timeit pd.Float64Dtype().__from_arrow__(arr)
35.5 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [43]: %timeit pd.Float64Dtype().__from_arrow__(arr.combine_chunks())
24.9 ms ± 941 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Doing the combine_chunks indeed helps, and I am not sure why we actually don't call this inside __from_arrow__ instead of looping over each chunk, converting that, and then concatenating our extension arrays (calling _concat_same_type). Further, even with combining chunks, this is still a bit slower, and looking at a basic profile of pd.Float64Dtype().__from_arrow__(arr.combine_chunks()), this seems is largely due to an additional copy of the data (which doesn't happen in to_numpy():

https://github.com/pandas-dev/pandas/blob/d182a3495a95c7067c13690cfe037597c3714d5d/pandas/core/arrays/numeric.py#L99-L100

In general for __from_arrow__ for the masked arrays that might make sense (to ensure the arrays are writeable/mutable, although if you have multiple chunks we concat anyway, so we should actually only do this copy if there is only a single chunk), but of course in the context of just getting the data for feeding it into a groupby algo, this copy is wasteful.

The specific example data we are using here also doesn't have missing values. Illustrating that when you have missing values, converting to data+mask is faster than to_numpy (and using pyarrow_array_to_numpy_and_mask directly, to avoid this combining chunks and unnecessary copy of __from_array__):

In [66]: nparr = np.random.randn(10_000_000)

In [67]: nparr[0] = np.nan

In [68]: arr = pa.array(nparr, from_pandas=True)

In [69]: %timeit arr.to_numpy(zero_copy_only=False)
39.8 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [70]: from pandas.core.arrays.arrow._arrow_utils import pyarrow_array_to_numpy_and_mask

In [71]: %timeit pyarrow_array_to_numpy_and_mask(arr, np.dtype("float64"))
15.3 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Comment From: jorisvandenbossche

But according to a simple profile of the python (non-naive) time, that slow time is almost entirely due to our Series(..) constructor. That seems a separate issue on our side that we should fix (it seems to consider the input masked array as a generic sequence or something like that).

This was actually also an issue in pyarrow and not just pandas, where in ChunkedArray.to_pandas(), we were not actually calling __from_arrow__, but just converting the array as normal (so in this case to a single numpy float array, just like to_numpy) and then passing this to the Series constructor with the specified dtypes, i.e. doing pd.Series(nparr, dtype=pd.Float64Dtype()). And then this is rather slow on our side (we have to check for NaNs to convert that to a mask, so not sure if that is avoidable).

But this should be fixed on the pyarrow side in the upcoming 12.0 release (where ChunkedArray.to_pandas will now correctly go through __from_arrow__, xref https://github.com/apache/arrow/pull/34559)