Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
from io import StringIO
csv = StringIO("""\
x
1
abc""")
# do not set dtype - let it be picked up automatically
df = pd.read_csv(csv,dtype_backend='pyarrow')
print('dtype after read_csv with guessed dtype:',df.x.dtype)
csv = StringIO("""\
x
1
abc""")
# explicitly set dtype
df2 = pd.read_csv(csv,dtype_backend='pyarrow',dtype = {'x':'string[pyarrow]'})
print('dtype after read_csv with explicit string[pyarrow] dtype:',df2.x.dtype)
"""
dtype of df2.x should be string[pyarrow] but it is string.
It is indeed marked as string[pyarrow] when called outside print() function: df2.x.dtype. But still with pd.to_numeric() df2.x is coerced to float64 (not to double[pyarrow] which is the case for df.x)
"""
print('df.x coerced to_numeric:',pd.to_numeric(df.x, errors='coerce').dtype)
print('df2.x coerced to_numeric:',pd.to_numeric(df2.x, errors='coerce').dtype)
Issue Description
Function read_csv() whith pyarrow backend returns data as string[pyarrow] if dtype is guessed. But when dtype is set explicitly as string[pyarrow] it is returned as string. The latter could be coerced with to_numeric() to float64 which is not the case for the former that would be coerced to double[pyarrow].
Expected Behavior
dtype of df2.x should be string[pyarrow]
Installed Versions
Comment From: mroeschke
Thanks for the report. Unfortunately part of the issue here is that str
of the older StringDtype("pyarrow")
implementation returns "string"
even though it's also backed by pyarrow
In [17]: str(df2.x.dtype)
Out[17]: 'string'
In [18]: repr(df2.x.dtype)
Out[18]: 'string[pyarrow]'
In [19]: df2.x.array
Out[19]:
<ArrowStringArray>
['1', 'abc']
Length: 2, dtype: string
so specifying {'x':'string[pyarrow]'}
will set the dtype to StringDtype("pyarrow")
although dtype: string
is shown in the series repr.
Comment From: almasgit
The issue goes beyond just representation. Subsequent coercion to numeric converts it to float64 so it is not an arrow type anymore.
pd.to_numeric(df2.x, errors='coerce').array
<PandasArray>
[1.0, nan]
Length: 2, dtype: float64
pd.to_numeric(df.x, errors='coerce').array
<ArrowExtensionArray>
[1.0, nan]
Length: 2, dtype: double[pyarrow]
The analisys workflow begins to be very different for df.x and df2.x after coercion, for example in nan detection:
pd.to_numeric(df2.x, errors='coerce').isna().sum()
1
pd.to_numeric(df.x, errors='coerce').isna().sum()
0
So just the fact that dtype {'x':'string[pyarrow]'}
was explicitly specified changes the workflow which is unexpected.
Comment From: mroeschke
Okay the core issue here then is to_numeric
does return an ArrowExtensionArray
for StringDtype("pyarrow")
dtype data
In [1]: pd.to_numeric(pd.Series(["1"], dtype="string[pyarrow]"))
Out[1]:
0 1
dtype: int64
In [2]: import pyarrow as pa
In [3]: pd.to_numeric(pd.Series(["1"], dtype=pd.ArrowDtype(pa.string())))
Out[3]:
0 1
dtype: int64[pyarrow]
Comment From: phofl
string[pyarrow]
results in an ArrowStringArray which is treated like "numpy_nullable" dtypes, so it should return "Float64" instead of "float64". If you want to get an ArrowExtensionArray
for strings like for int64[pyarrow]
then you have to use pd.ArrowDtype(pa.string())
instead of string[pyarrow]