Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "y": list("caaab")})
df["y"] = df["y"].astype(pd.StringDtype("pyarrow"))
df["y"] = df["y"].astype("category")
df.to_parquet("test-arrow.parquet")
Issue Description
I have a dataframe with string[pyarrow]
. I can convert it to category
, but the resulting dataframe can't be written with to_parquet
:
Traceback (most recent call last):
File "/Users/jbennet/Library/Application Support/JetBrains/PyCharm2022.3/scratches/writeparquet_pyarrow.py", line 6, in <module>
df.to_parquet("test-arrow.parquet")
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/frame.py", line 2976, in to_parquet
return to_parquet(
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/io/parquet.py", line 430, in to_parquet
impl.write(
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/io/parquet.py", line 174, in write
table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
File "pyarrow/table.pxi", line 3557, in pyarrow.lib.Table.from_pandas
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 611, in dataframe_to_arrays
arrays = [convert_column(c, f)
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 611, in <listcomp>
arrays = [convert_column(c, f)
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 598, in convert_column
raise e
File "/Users/jbennet/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 592, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 310, in pyarrow.lib.array
File "pyarrow/array.pxi", line 2608, in pyarrow.lib.DictionaryArray.from_arrays
File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ("Could not convert <pyarrow.StringScalar: 'a'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column y with type category')
Expected Behavior
Expected parquet file to be created.
Installed Versions
Comment From: phofl
Hi, thanks for your report. This looks like a bug in pyarrow. Their conversion routine for categoricals can't deal with categories backed by pyarrow arrays and chunked arrays
Comment From: j-bennet
Filed a corresponding issue in pyarrow: https://github.com/apache/arrow/issues/34449.
Comment From: phofl
Small side-note: This happens for any pd.ArrowDtype, e.g. int64[pyarrow]
has the same problem. Not exclusive to strings
Comment From: j-bennet
The original issue in pyarrow: https://github.com/apache/arrow/issues/33727.
This is the one to keep an eye on. https://github.com/apache/arrow/issues/34449 is a duplicate and now closed.
Comment From: jrbourbeau
Just following up here. https://github.com/apache/arrow/issues/33727 has been resolved upstream in arrow
🎉 I just tried the original failing code snippet using the latest nightly pyarrow
and it now runs successfully. Probably still worth adding a regression test for this case though
Comment From: jrbourbeau
Looking a bit closer, the dtype of the categories isn't preserved when roundtripping through parquet
In [1]: import pandas as pd
...:
...: df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "y": list("caaab")})
...: df["y"] = df["y"].astype(pd.StringDtype("pyarrow"))
...: df["y"] = df["y"].astype("category")
...: df.to_parquet("test-arrow.parquet")
In [2]: df2 = pd.read_parquet("test-arrow.parquet")
In [3]: df.y.dtype.categories
Out[3]: Index(['a', 'b', 'c'], dtype='string')
In [4]: df2.y.dtype.categories
Out[4]: Index(['a', 'b', 'c'], dtype='object')
The original categories
are pyarrow strings, but reading the parquet dataset in gives object
categories. I'm not sure if this is coming from the pandas
or pyarrow
side though