Pandas version checks

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({"key": range(100000), "val": "test"})
%timeit df.groupby(["key"]).first();

pa_df = df.convert_dtypes(dtype_backend="pyarrow")
%timeit pa_df.groupby(["key"]).first();

pa_df = pa_df.astype({"val": pd.StringDtype("pyarrow")})
%timeit pa_df.groupby(["key"]).first();

Issue Description

Grouping by and then aggregating on a dataframe that contains ArrowDtype(pyarrow.string()) columns is orders of magnitude slower than performing the same operations on an equivalent dataframe whose corresponding string column is of any other acceptable string type (e.g. string, StringDtype("python"), StringDtype("pyarrow")). This is surprising in particular because StringDtype("pyarrow") does not exhibit the same problem.

Note that in the bug reproduction example, DataFrame.convert_dtypes with dtype_backend="pyarrow" converts string columns to ArrowDtype(pyarrow.string()) rather than StringDtype("pyarrow").

Finally, here's a sample run, with dtypes printed out for clarity; I've reproduced this on both OS X and OpenSuse Tumbleweed for the listed pandas and pyarrow versions (as well as current main):

In [7]: import pandas as pd

In [8]: df = pd.DataFrame({"key": range(100000), "val": "test"})

In [9]: df["val"].dtype
Out[9]: dtype('O')

In [10]: %timeit df.groupby(["key"]).first();
8.37 ms ± 599 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: pa_df = df.convert_dtypes(dtype_backend="pyarrow")

In [13]: type(pa_df["val"].dtype)
Out[13]: pandas.core.dtypes.dtypes.ArrowDtype

In [14]: %timeit pa_df.groupby(["key"]).first();
2.39 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: pa_df = pa_df.astype({"val": pd.StringDtype("pyarrow")})
    ...:

In [16]: type(pa_df["val"].dtype)
Out[16]: pandas.core.arrays.string_.StringDtype

In [17]: %timeit pa_df.groupby(["key"]).first();
12.9 ms ± 306 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Expected Behavior

Aggregation performance on ArrowDtype(pyarrow.string()) columns should be comparable to aggregation performance on StringDtype("pyarrow"), string typed columns.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.13.1 python-bits : 64 OS : Darwin OS-release : 24.3.0 Version : Darwin Kernel Version 24.3.0: Thu Jan 2 20:24:16 PST 2025; root:xnu-11215.81.4~3/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : en_CA.UTF-8 LANG : None LOCALE : en_CA.UTF-8 pandas : 2.2.3 numpy : 2.2.2 pytz : 2025.1 dateutil : 2.9.0.post0 pip : 24.3.1 Cython : None sphinx : None IPython : 8.32.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2025.2.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.5 lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : 19.0.0 pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2025.1 qtpy : None pyqt5 : None

Comment From: rhshadrach

Thanks for the report!

Aggregation performance on ArrowDtype(pyarrow.string()) columns should be comparable to aggregation performance on StringDtype("pyarrow"), string typed columns.

I don't think this expectation is correct. ArrowExtensionArray (corresponding to ArrowDtype) is essentially a wrapper around general PyArrow data, whereas ArrowStringArray (corresponding to StringDtype) is a data container specially crafted to support PyArrow strings. I do not believe we aim for neither feature nor performance parity between the two. Users should prefer StringDtype.

@jorisvandenbossche @WillAyd - should

df.convert_dtypes(dtype_backend="pyarrow")

be converting to StringDtype here?

Comment From: WillAyd

Hmm I don't think so - the StringDtype is associated with the "numpy_nullable" backend. The pyarrow backend has always provided back the ArrowExtensionArray wrapper types.

Do we know where the bottleneck is performance-wise? I agree with the OP that the performance should be equivalent between the two types, if not slightly faster for the ArrowExtensionArray

Comment From: rhshadrach

Hmm I don't think so - the StringDtype is associated with the "numpy_nullable" backend.

https://github.com/pandas-dev/pandas/blob/70edaa0b4661df6f251f2e3d3ae5c55ef371fc74/pandas/core/arrays/string_arrow.py#L128

Comment From: snitish

@WillAyd looks like the bottleneck is due to ArrowExtensionArray._groupby_op not properly handling the string[pyarrow] dtype. We have special handling for StringDtype but for string[pyarrow], it hits to_masked() which again does not handle this dtype. This causes the groupby operation to be performed in pure python causing the slowness.

Comment From: WillAyd

Thanks for that insight @snitish . Yea anything we can do to remove the layers of indirection and particularly Arrow <> NumPy copies would be very beneficial

Comment From: WillAyd

FWIW I realize the OP is talking about strings, but this likely applies to the wrapped Arrow types in general. You can see the same performance issues using another Arrow type like decimal:

In [108]: df = pd.DataFrame({
     ...:     "key": pd.Series(range(100_000), dtype=pd.ArrowDtype(pa.int64())),
     ...:     "val": pd.Series(["3.14"] * 100_000, dtype=pd.ArrowDtype(pa.decimal128(10, 4)))
     ...: })
In [109]: %timeit df.groupby("key").sum()
2.83 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [110]: %timeit pa.TableGroupBy(pa.Table.from_pandas(df), "key").aggregate([("val", "sum")])
8.18 ms ± 589 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Comment From: snitish

@WillAyd also noticed that .astype("string[pyarrow]") converts the dtype to StringDtype and the aggregation is much faster, while .convert_dtypes(dtype_backend="pyarrow") does not. Both dtypes show as "string["pyarrow"], but only one is a StringDtype. Is this expected?

In [2]: df = pd.DataFrame({"key": range(100000), "val": "test"}).astype({"val": "string[pyarrow]"})

In [3]: df["val"].dtype
Out[3]: string[pyarrow]

In [4]: isinstance(df["val"].dtype, pd.StringDtype)
Out[4]: True

In [6]: %timeit df.groupby("key").first()
16.5 ms ± 234 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: pa_df = df.convert_dtypes(dtype_backend="pyarrow")

In [8]: pa_df["val"].dtype
Out[8]: string[pyarrow]

In [9]: isinstance(pa_df["val"].dtype, pd.StringDtype)
Out[9]: False

In [10]: %timeit pa_df.groupby("key").first()
4.73 s ± 44.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comment From: kzvezdarov

I meant to open a separate issue about convert_dtypes but got carried away and forgot to, sorry about that. At least per the pandas.DataFrame.convert_dtypes doc, it seems that when convert_string is set to True, string objects should be converted to StringDtype, but in practice they appear to get converted to the wrapper type ArrowDtype(pyarrow.string()) instead. In both case the dtype's name property resolves to string[pyarrow].

In [3]: df = pd.DataFrame({"val": ["test"]}).convert_dtypes(dtype_backend="pyarrow")

In [4]: type(df.dtypes["val"])
Out[4]: pandas.core.dtypes.dtypes.ArrowDtype

Comment From: jorisvandenbossche

Both dtypes show as "string["pyarrow"], but only one is a StringDtype. Is this expected?

Yes, that is expected (although very annoying they have the same repr ..). See also https://pandas.pydata.org/pdeps/0014-string-dtype.html for context. The StringDtype variant will become the default in 3.0.

Comment From: jorisvandenbossche

At least per the pandas.DataFrame.convert_dtypes doc, it seems that when convert_string is set to True, string objects should be converted to StringDtype, but in practice they appear to get converted to the wrapper type ArrowDtype(pyarrow.string()) instead.

This is of course not a simple case with one obvious correct answer, but, in general the idea is that convert_dtypes() is meant to give you the nullable dtypes, and when additionally passing dtype_backend="pyarrow", you are asking to get ArrowDtype dtypes (that is also what the docstring says).

Of course, if we are going to get more of the non-ArrowDtype nullable dtypes getting backed by pyarrow (like StringDtype now), this division gets a bit tricky (and another reason to move forward with https://github.com/pandas-dev/pandas/pull/58455). Now, I assume one issue here is that the default StringDtype right now uses the "python" string storage, and not pyarrow. So that might be a reason you are asking for dtype_backend="pyarrow". Now, on the dev version we already have switched the default storage of StringDtype to pyarrow, and so starting with pandas 3.0 if you want nullable dtypes through convert_dtypes and want to use pyarrow just for the strings, you won't need to specify dtype_backend="pyarrow".

Comment From: kzvezdarov

Makes sense, thanks for clarifying. I'd just misinterpreted the relation between convert_string=True and dtype_backend="pyarrow" to imply converting to the nullable extension type, set to the pyarrow backend.

Comment From: kzvezdarov

FWIW I realize the OP is talking about strings, but this likely applies to the wrapped Arrow types in general. You can see the same performance issues using another Arrow type like decimal:

In [108]: df = pd.DataFrame({ ...: "key": pd.Series(range(100_000), dtype=pd.ArrowDtype(pa.int64())), ...: "val": pd.Series(["3.14"] * 100_000, dtype=pd.ArrowDtype(pa.decimal128(10, 4))) ...: }) In [109]: %timeit df.groupby("key").sum() 2.83 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [110]: %timeit pa.TableGroupBy(pa.Table.from_pandas(df), "key").aggregate([("val", "sum")]) 8.18 ms ± 589 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It seems to only happen for the decimal and string ArrowDtype, the rest that I've tested with - float64, int64, boolean, timestamp - all seem to be only marginally slower. CProfile shows that for both ArrowDtype(pa.string()) and ArrowDtype(pa.decimal128(10, 4)) aggregation falls through to _agg_py_fallback: Image

whereas for other types it continues on to _cython_operation: Image

Making the ArrowDtype(pyarrow.string()) follow the handling of StringDtype seems to work - in the sense of fixing the performance issue - but I'm not that familiar with Pandas' internals, so I'm not sure if this is a correct/acceptable approach:

diff --git a/pandas/core/arrays/arrow/array.py b/pandas/core/arrays/arrow/array.py
index 0b546bed1c..89952b8f96 100644
--- a/pandas/core/arrays/arrow/array.py
+++ b/pandas/core/arrays/arrow/array.py
@@ -2376,7 +2376,9 @@ class ArrowExtensionArray(
         ids: npt.NDArray[np.intp],
         **kwargs,
     ):
-        if isinstance(self.dtype, StringDtype):
+        if isinstance(self.dtype, StringDtype) or (
+            self.dtype.type == str and self.dtype.storage == "pyarrow"
+        ):
             if how in [
                 "prod",
                 "mean",
diff --git a/pandas/core/arrays/base.py b/pandas/core/arrays/base.py
index 33745438e2..f4824255d3 100644
--- a/pandas/core/arrays/base.py
+++ b/pandas/core/arrays/base.py
@@ -2608,7 +2608,9 @@ class ExtensionArray:
         op = WrappedCythonOp(how=how, kind=kind, has_dropped_na=has_dropped_na)

         # GH#43682
-        if isinstance(self.dtype, StringDtype):
+        if isinstance(self.dtype, StringDtype) or (
+            self.dtype.type == str and self.dtype.storage == "pyarrow"
+        ):
             # StringArray
             if op.how in [
                 "prod",
@@ -2648,7 +2650,9 @@ class ExtensionArray:
             #  through cython_operation
             return res_values

-        if isinstance(self.dtype, StringDtype):
+        if isinstance(self.dtype, StringDtype) or (
+            self.dtype.type == str and self.dtype.storage == "pyarrow"
+        ):
             dtype = self.dtype
             string_array_cls = dtype.construct_array_type()
             return string_array_cls._from_sequence(res_values, dtype=dtype)