Pandas BUG: DataFrame.convert_dtypes() converting object to pd.ArrowDtype instead of pd.StringDtype

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"x": [1, 2, 3], "y": ["foo", "bar", "baz"]})
with pd.option_context("mode.dtype_backend", "pyarrow"):
    df = df.convert_dtypes()

assert df.y.dtype == pd.ArrowDtype(pa.string()) # this passes
assert df.y.dtype == pd.StringDtype("pyarrow")  # this raises

Issue Description

The DataFrame.convert_dtypes() docstring describes the convert_string= keyword as

convert_string: bool, default True
    Whether object dtypes should be converted to `StringDtype()`.

However, when using the pyarrow dtype backend, it looks like convert_dtypes() is actually converting to pd.ArrowDtype(pa.string()) instead of pd.StringDtype("pyarrow"). This seems to differ from the docstring description. It's also my current understanding (which could totally be wrong), that we should generally prefer pd.StringDtype("pyarrow") today as it's more feature complete (xref https://github.com/pandas-dev/pandas/issues/50074#issuecomment-1338299793)

cc @mroeschke @phofl for visibility

Expected Behavior

DataFrame.convert_dtypes() should convert to pd.StringDtype("pyarrow") instead of pd.ArrowDtype(pa.string()) when using the pyarrow dtype backend

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 7f2aa8f46a4a36937b1be8ec2498c4ff2dd4cf34
python           : 3.10.4.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 22.2.0
Version          : Darwin Kernel Version 22.2.0: Fri Nov 11 02:08:47 PST 2022; root:xnu-8792.61.2~4/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.0.dev0+1309.g7f2aa8f46a
numpy            : 1.24.1
pytz             : 2022.1
dateutil         : 2.8.2
setuptools       : 59.8.0
pip              : 22.0.4
Cython           : None
pytest           : 7.1.3
hypothesis       : None
sphinx           : 4.5.0
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.2.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : None
brotli           :
fastparquet      : 2022.12.1.dev6
fsspec           : 2023.1.0+5.g012816b
gcsfs            : None
matplotlib       : 3.5.1
numba            : None
numexpr          : 2.8.0
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 11.0.0.dev316
pyreadstat       : None
pyxlsb           : None
s3fs             : 2022.10.0
scipy            : 1.9.0
snappy           :
sqlalchemy       : 1.4.35
tables           : 3.7.0
tabulate         : None
xarray           : 2022.3.0
xlrd             : None
zstandard        : None
tzdata           : None
qtpy             : None
pyqt5            : None

Comment From: phofl

Using the ArrowExtensionArray is what we are doing for all the I/O methods, hence this is similar here (should probably adjust the docstring).

That said, https://github.com/pandas-dev/pandas/pull/50325 is the only feature missing from pd.ArrowDtype(pa.string()) compared to StringDtype (with pyarrow) and we should get that in for 2.0. Afterwards using pd.ArrowDtype(pa.string()) should be fine for downstream libraries.

Comment From: jrbourbeau

Using the ArrowExtensionArray is what we are doing for all the I/O methods

That said, https://github.com/pandas-dev/pandas/pull/50325 is the only feature missing from pd.ArrowDtype(pa.string()) compared to StringDtype (with pyarrow) and we should get that in for 2.0

Ah, I see. For some reason I thought strings were special cased to still use pd.StringDtype("pyarrow") in, for example, read_parquet. In that case, I agree that a docstring update is probably all that's needed here

Comment From: jrbourbeau

That said, https://github.com/pandas-dev/pandas/pull/50325 is the only feature missing from pd.ArrowDtype(pa.string()) compared to StringDtype (with pyarrow) and we should get that in for 2.0. Afterwards using pd.ArrowDtype(pa.string()) should be fine for downstream libraries.

Just following up on this comment. Now that https://github.com/pandas-dev/pandas/pull/51207, does this mean there's no longer a need to use pd.StringDtype("pyarrow") as pd.ArrowDtype(pa.string()) is now just as performant/feature complete?

Comment From: mroeschke

Correct but there is a slight caveat.

In terms of the string operations, using pd.StringDtype("pyarrow") is more permissive and will fall back to numpy-based operations while using pd.ArrowDtype(pa.string()) is more strict and will raise NotImplementedError if the associated pyarrow compute function is not available. The fall back behavior of pd.StringDtype("pyarrow") is essentially equivalent to using pd.StringDtype("python") for those methods.

Would that significantly impact the Dask use case?

Comment From: jorisvandenbossche

I think convert_dtypes should prefer StringDtype over ArrowDtype. We have already been exposing this for a longer time, and so people could already do (using pandas 1.5 here):

In [1]: pd.options.mode.string_storage = "pyarrow"

In [2]: df = pd.DataFrame({'a': ["foo", "bar"]}, dtype=object)

In [3]: df2 = df.convert_dtypes()

In [5]: df2.dtypes
Out[5]: 
a    string
dtype: object

In [6]: df2.dtypes[0]
Out[6]: string[pyarrow]

Changing that to ArrowDtype is a breaking change (since as mentioned the behaviour is not exactly the same for all operations). Of course this is still experimental, so we can do breaking changes. But personally, I think the StringDtype behaviour is actually better for most users, and is also the nicer API.

(and IMO we should do that everywhere, so not only in convert_dtypes but also in the IO methods. Eg also for read_parquet this is a change in behaviour)

Comment From: phofl

Sorry missed the button.

this is still the behavior on 2.0, but we are using the ArrowExtensionArray, when the dtype_backend option is set on top of it (which is new for 2.0)

Comment From: mroeschke

Right in the above example this is still StringDtype("pyarrow")

In [2]: type(df2.dtypes[0])
Out[2]: pandas.core.arrays.string_.StringDtype

While I'm not opposed to the StringDtype("pyarrow") API style in theory (though for consistency there should be equivalent Type("storage") for all dtypes), we should move users away from the StringDtype("pyarrow") array implementation of sometime falling back or directly returning numpy backed objects. As we've been working toward removing silent casting to different types in other operations, I think the same should apply here; if an operation is not supported by the arrow-backed implementation it would be more explicit for the user to cast to the numpy based implementation if the operation is supported there.

Comment From: phofl

I think we can close here? Since the addition of dtype_backend this can be configured explicitly

Comment From: phofl

Removing milestone and blocker for now