Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from io import StringIO
import pandas as pd

input = StringIO(
    """,ID,Team,Puzzler
0,ID_1,10,1
1,ID_2,11,0
2,ID_3,10,0
3,ID_4,11,1""")

df = pd.read_csv(input, dtype_backend='pyarrow', index_col=0)
df = df.astype({'Team': 'category'})
df.to_parquet('example.parquet')

Issue Description

I get the error message

pyarrow.lib.ArrowInvalid: ('Could not convert <pyarrow.Int64Scalar: 10> with type pyarrow.lib.Int64Scalar: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column Team with type category')

when trying to call .to_parquet on the dataframe. I only get the error when setting the type with .astype() function. If i instead load the dataframe as df = pd.read_csv(input, dtype_backend='pyarrow', index_col=0, dtype={'Team': 'category') it works fine.

Expected Behavior

To create a parquet file named 'example.parquet' without error.

Installed Versions

/home/simon/Code/bugreport/.venv/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS ------------------ commit : aa1f96bc6b0bcd8413880be363c61177511d2481 python : 3.10.6.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-40-generic Version : #41~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 31 16:00:14 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 2.1.0.dev0+627.gaa1f96bc6b numpy : 1.24.3 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.7.2 pip : 23.1.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Comment From: mroeschke

Here's a simpler reproducer; this looks to be an issue with pyarrow not being able to convert Categoricals with ArrowDtype

In [9]: pa.Table.from_pandas(pd.DataFrame(pd.Categorical(np.array([1], dtype=object))))
Out[9]: 
pyarrow.Table
0: dictionary<values=int64, indices=int8, ordered=0>
----
0: [  -- dictionary:
[1]  -- indices:
[0]]

In [10]: pa.Table.from_pandas(pd.DataFrame(pd.Categorical(pd.Series([1], dtype="int64[pyarrow]"))))
ArrowInvalid: ('Could not convert <pyarrow.Int64Scalar: 1> with type pyarrow.lib.Int64Scalar: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column 0 with type category')

Comment From: mroeschke

I think this will need to be fully addressed in pyarrow so closing. Feel free to report this in Arrow

Comment From: phofl

This should be fixed on the next arrow release, we already have an issue about this