Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({"x": [1, 2, 3], "y": ["foo", "bar", "baz"]})
with pd.option_context("mode.dtype_backend", "pyarrow"):
df = df.convert_dtypes()
assert df.y.dtype == pd.ArrowDtype(pa.string()) # this passes
assert df.y.dtype == pd.StringDtype("pyarrow") # this raises
Issue Description
The DataFrame.convert_dtypes()
docstring describes the convert_string=
keyword as
convert_string: bool, default True
Whether object dtypes should be converted to `StringDtype()`.
However, when using the pyarrow
dtype backend, it looks like convert_dtypes()
is actually converting to pd.ArrowDtype(pa.string())
instead of pd.StringDtype("pyarrow")
. This seems to differ from the docstring description. It's also my current understanding (which could totally be wrong), that we should generally prefer pd.StringDtype("pyarrow")
today as it's more feature complete (xref https://github.com/pandas-dev/pandas/issues/50074#issuecomment-1338299793)
cc @mroeschke @phofl for visibility
Expected Behavior
DataFrame.convert_dtypes()
should convert to pd.StringDtype("pyarrow")
instead of pd.ArrowDtype(pa.string())
when using the pyarrow
dtype backend
Installed Versions
INSTALLED VERSIONS
------------------
commit : 7f2aa8f46a4a36937b1be8ec2498c4ff2dd4cf34
python : 3.10.4.final.0
python-bits : 64
OS : Darwin
OS-release : 22.2.0
Version : Darwin Kernel Version 22.2.0: Fri Nov 11 02:08:47 PST 2022; root:xnu-8792.61.2~4/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.0.0.dev0+1309.g7f2aa8f46a
numpy : 1.24.1
pytz : 2022.1
dateutil : 2.8.2
setuptools : 59.8.0
pip : 22.0.4
Cython : None
pytest : 7.1.3
hypothesis : None
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.2.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli :
fastparquet : 2022.12.1.dev6
fsspec : 2023.1.0+5.g012816b
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : 2.8.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0.dev316
pyreadstat : None
pyxlsb : None
s3fs : 2022.10.0
scipy : 1.9.0
snappy :
sqlalchemy : 1.4.35
tables : 3.7.0
tabulate : None
xarray : 2022.3.0
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None
Comment From: phofl
Using the ArrowExtensionArray is what we are doing for all the I/O methods, hence this is similar here (should probably adjust the docstring).
That said, https://github.com/pandas-dev/pandas/pull/50325 is the only feature missing from pd.ArrowDtype(pa.string())
compared to StringDtype (with pyarrow) and we should get that in for 2.0. Afterwards using pd.ArrowDtype(pa.string())
should be fine for downstream libraries.
Comment From: jrbourbeau
Using the ArrowExtensionArray is what we are doing for all the I/O methods
That said, https://github.com/pandas-dev/pandas/pull/50325 is the only feature missing from pd.ArrowDtype(pa.string()) compared to StringDtype (with pyarrow) and we should get that in for 2.0
Ah, I see. For some reason I thought strings were special cased to still use pd.StringDtype("pyarrow")
in, for example, read_parquet
. In that case, I agree that a docstring update is probably all that's needed here
Comment From: jrbourbeau
That said, https://github.com/pandas-dev/pandas/pull/50325 is the only feature missing from pd.ArrowDtype(pa.string()) compared to StringDtype (with pyarrow) and we should get that in for 2.0. Afterwards using pd.ArrowDtype(pa.string()) should be fine for downstream libraries.
Just following up on this comment. Now that https://github.com/pandas-dev/pandas/pull/51207, does this mean there's no longer a need to use pd.StringDtype("pyarrow")
as pd.ArrowDtype(pa.string())
is now just as performant/feature complete?
Comment From: mroeschke
Correct but there is a slight caveat.
In terms of the string operations, using pd.StringDtype("pyarrow")
is more permissive and will fall back to numpy-based operations while using pd.ArrowDtype(pa.string())
is more strict and will raise NotImplementedError
if the associated pyarrow compute function is not available. The fall back behavior of pd.StringDtype("pyarrow")
is essentially equivalent to using pd.StringDtype("python")
for those methods.
Would that significantly impact the Dask use case?
Comment From: jorisvandenbossche
I think convert_dtypes
should prefer StringDtype
over ArrowDtype
. We have already been exposing this for a longer time, and so people could already do (using pandas 1.5 here):
In [1]: pd.options.mode.string_storage = "pyarrow"
In [2]: df = pd.DataFrame({'a': ["foo", "bar"]}, dtype=object)
In [3]: df2 = df.convert_dtypes()
In [5]: df2.dtypes
Out[5]:
a string
dtype: object
In [6]: df2.dtypes[0]
Out[6]: string[pyarrow]
Changing that to ArrowDtype is a breaking change (since as mentioned the behaviour is not exactly the same for all operations). Of course this is still experimental, so we can do breaking changes. But personally, I think the StringDtype behaviour is actually better for most users, and is also the nicer API.
(and IMO we should do that everywhere, so not only in convert_dtypes but also in the IO methods. Eg also for read_parquet
this is a change in behaviour)
Comment From: phofl
Sorry missed the button.
this is still the behavior on 2.0, but we are using the ArrowExtensionArray, when the dtype_backend option is set on top of it (which is new for 2.0)
Comment From: mroeschke
Right in the above example this is still StringDtype("pyarrow")
In [2]: type(df2.dtypes[0])
Out[2]: pandas.core.arrays.string_.StringDtype
While I'm not opposed to the StringDtype("pyarrow")
API style in theory (though for consistency there should be equivalent Type("storage") for all dtypes), we should move users away from the StringDtype("pyarrow")
array implementation of sometime falling back or directly returning numpy backed objects. As we've been working toward removing silent casting to different types in other operations, I think the same should apply here; if an operation is not supported by the arrow-backed implementation it would be more explicit for the user to cast to the numpy based implementation if the operation is supported there.
Comment From: phofl
I think we can close here? Since the addition of dtype_backend this can be configured explicitly
Comment From: phofl
Removing milestone and blocker for now