Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[X] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Following codes will re-produce my issue:
import numpy as np
import pandas as pd
# generate a bunch sample data for later use
n = 5000000
s_samples = [f"s_{i}" for i in range(1, 101)]
i_samples = [f"i_{i}" for i in range(1, 201)]
bool_samples = [True, False]
ssamples = np.random.choice(s_samples, n)
isamples = np.random.choice(i_samples, n)
d_values = np.random.randn(3, n)
b_values = np.random.choice(bool_samples, n)
df = pd.DataFrame(
dict(s=ssamples, i=isamples, v1=d_values[0], v2=d_values[1], v3=d_values[2], f1=b_values, f2=b_values)
)
df.to_csv("sample.csv", index=None)
# read in data with different engine
df_new = pd.read_csv("sample.csv", engine="pyarrow", dtype_backend="pyarrow")
df_old = pd.read_csv("sample.csv")
# do the bechmark
%timeit df_new.groupby("s")["v1"].sum()
# >> 660 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_old.groupby("s")["v1"].sum()
# >> 311 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The new engine is 2x slower than the old engine.
Installed Versions
Prior Performance
No response
Comment From: phofl
This is expected. GroupBy isn't implemented for arrow yet
Comment From: wegamekinglc
@phofl thanks for your comments
Comment From: topper-123
Is there a explanation about the current limitations when using arrow? There have been several blog posts talking up benefits of having arrow in Pandas and I think it could be a good idea laying out the current performance limitations of using arrow in Pandas.
Comment From: jbrockmendel
I'm working on a patch here and having trouble making a performant conversion from the ArrowArray to MaskedArray. The non-working method looks like:
def _to_masked(self):
pa_dtype = self._pa_array.type
if pa.types.is_floating(pa_dtype):
nbits = pa_dtype.bit_width
dtype = f"Float{nbits}"
np_dtype = dtype.lower()
from pandas.core.arrays import FloatingArray as arr_cls
elif pa.types.is_unsigned_integer(pa_dtype):
nbits = pa_dtype.bit_width
dtype = f"UInt{nbits}"
np_dtype = dtype.lower()
from pandas.core.arrays import IntegerArray as arr_cls
elif pa.types.is_signed_integer(pa_dtype):
nbits = pa_dtype.bit_width
dtype = f"Int{nbits}"
np_dtype = dtype.lower()
from pandas.core.arrays import IntegerArray as arr_cls
elif pa.types.is_boolean(pa_dtype):
dtype = "boolean"
np_dtype = "bool"
from pandas.core.arrays import BooleanArray as arr_cls
else:
raise NotImplementedError
data = self._pa_array.combine_chunks()
buffs = data.buffers()
assert len(buffs) == 2
mask = self.isna()
arr = np.array(buffs[1], dtype=np_dtype)
return arr_cls(arr, mask)
But it looks like this is not the correct way to get arr
. @jorisvandenbossche how do I get the correct ndarray here?
Comment From: phofl
Did you try to_numpy
on the ArrowExtensionArray
and setting the na_value
to 1?
Comment From: jbrockmendel
That seems to work, thanks. Branch is ready for once the EA._groupby_op PR is merged.
Comment From: jorisvandenbossche
Did you try
to_numpy
on theArrowExtensionArray
and setting thena_value
to 1?
That doesn't give you a masked array, though (if that's what is needed). And will make a copy if there are missing values.
We normally already have this functionality in the __from_arrow__
methods on the MaskedArray classes. The easiest way to directly reuse this might be self._pa_array.to_pandas(types_mapper=pd.io._util._arrow_dtype_mapping().get)
. But the underlying functionality could maybe be refactored as a more general utility to use internally (pyarrow will returns a Series here, which we don't need)
Comment From: phofl
@jorisvandenbossche Regarding the copy: I am aware that we create a copy when we have missing values, but wouldn't arrow do the same?
Comment From: jorisvandenbossche
We only need to copy the bitmask (to convert in a bytemask). The actual data (buffs[1]
in Brock's snippet above) don't need to be copied, even if there are missing values.
While relying on pyarrow's to_numpy
/ __array__
will always give a full copy of the actual data once there are missing values.
Comment From: phofl
@jorisvandenbossche I looked into this and to_pandas
is 5 times slower compared to the to_numpy
way. Not sure what's going on under the hood, but I guess that one of the problems Is that we get back a Series
Comment From: lukemanley
I took a look at this too and noticed pd.read_csv
was returning pyarrow chunked arrays with quite a few chunks (using the example in the OP):
df_new.apply(lambda x: x.array._pa_array.num_chunks)
s 383
i 383
v1 383
v2 383
v3 383
f1 383
f2 383
dtype: int64
Doesn't make up the 5x, but I see some improvement if you combine chunks first:
arr = df_new["v1"].array._pa_array
%timeit pd.Float64Dtype().__from_arrow__(arr)
# 17.6 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.Float64Dtype().__from_arrow__(arr.combine_chunks())
# 12.1 ms ± 43.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Comment From: jorisvandenbossche
Also using the example from the OP, I see even a bigger difference:
In [36]: %timeit df_new["v1"].array._pa_array.to_numpy()
20.5 ms ± 539 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [38]: %timeit df_new["v1"].array._pa_array.to_pandas(types_mapper=pd.io._util._arrow_dtype_mapping().get)
270 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But according to a simple profile of the python (non-naive) time, that slow time is almost entirely due to our Series(..) constructor. That seems a separate issue on our side that we should fix (it seems to consider the input masked array as a generic sequence or something like that).
Comparing directly with __from_arrow__
(instead of going through to_pandas
) as Luke did is therefore indeed a more useful comparison. I see:
In [40]: arr = df_new["v1"].array._pa_array
In [41]: %timeit arr.to_numpy()
20.1 ms ± 917 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [42]: %timeit pd.Float64Dtype().__from_arrow__(arr)
35.5 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [43]: %timeit pd.Float64Dtype().__from_arrow__(arr.combine_chunks())
24.9 ms ± 941 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Doing the combine_chunks
indeed helps, and I am not sure why we actually don't call this inside __from_arrow__
instead of looping over each chunk, converting that, and then concatenating our extension arrays (calling _concat_same_type
).
Further, even with combining chunks, this is still a bit slower, and looking at a basic profile of pd.Float64Dtype().__from_arrow__(arr.combine_chunks())
, this seems is largely due to an additional copy of the data (which doesn't happen in to_numpy()
:
https://github.com/pandas-dev/pandas/blob/d182a3495a95c7067c13690cfe037597c3714d5d/pandas/core/arrays/numeric.py#L99-L100
In general for __from_arrow__
for the masked arrays that might make sense (to ensure the arrays are writeable/mutable, although if you have multiple chunks we concat anyway, so we should actually only do this copy if there is only a single chunk), but of course in the context of just getting the data for feeding it into a groupby algo, this copy is wasteful.
The specific example data we are using here also doesn't have missing values. Illustrating that when you have missing values, converting to data+mask is faster than to_numpy
(and using pyarrow_array_to_numpy_and_mask
directly, to avoid this combining chunks and unnecessary copy of __from_array__
):
In [66]: nparr = np.random.randn(10_000_000)
In [67]: nparr[0] = np.nan
In [68]: arr = pa.array(nparr, from_pandas=True)
In [69]: %timeit arr.to_numpy(zero_copy_only=False)
39.8 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [70]: from pandas.core.arrays.arrow._arrow_utils import pyarrow_array_to_numpy_and_mask
In [71]: %timeit pyarrow_array_to_numpy_and_mask(arr, np.dtype("float64"))
15.3 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Comment From: jorisvandenbossche
But according to a simple profile of the python (non-naive) time, that slow time is almost entirely due to our Series(..) constructor. That seems a separate issue on our side that we should fix (it seems to consider the input masked array as a generic sequence or something like that).
This was actually also an issue in pyarrow and not just pandas, where in ChunkedArray.to_pandas()
, we were not actually calling __from_arrow__
, but just converting the array as normal (so in this case to a single numpy float array, just like to_numpy
) and then passing this to the Series constructor with the specified dtypes, i.e. doing pd.Series(nparr, dtype=pd.Float64Dtype())
. And then this is rather slow on our side (we have to check for NaNs to convert that to a mask, so not sure if that is avoidable).
But this should be fixed on the pyarrow side in the upcoming 12.0 release (where ChunkedArray.to_pandas
will now correctly go through __from_arrow__
, xref https://github.com/apache/arrow/pull/34559)