Pandas BUG: partitioning parquet by pyarrow.date32 fails when reading

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import datetime
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
s = pd.Series(pa.array([datetime.date.today(), datetime.date.today(), datetime.date.today()]), dtype='date32[pyarrow]') 
df = pd.DataFrame({'c1': s, 'c2': s})
pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=False), 'dataset', ['c1'])
ret = pd.read_parquet('dataset') # exception

Issue Description

When partitioning is used, the pyarrow date32 is written to the path and read back as a dictionary of strings instead of a dictionary of date32 types (or simply date32, I was surprised dataset writing converts to a category type automatically). When trying to cast string to date32 an exception is thrown.

Expected Behavior

Something similar to this:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
s = pd.Series(pa.array([datetime.date.today(), datetime.date.today(), datetime.date.today()]), dtype='date32[pyarrow]') 
df = pd.DataFrame({'c1': s, 'c2': s})
t = pa.Table.from_pandas(df, preserve_index=False)
pq.write_to_dataset(t, 'dataset', ['c1'])
dataset = pq.ParquetDataset('dataset/', schema=t.schema)
ret = dataset.read().to_pandas()

Which returns the original DataFrame

Installed Versions

pandas : 2.0.1 pyarrow : 11.0.0

Comment From: alippai

There is a suggestion here: https://github.com/apache/arrow/issues/22510 that the pandas metadata could be used to specify the schema needed for the pyarrow Dataset

Comment From: alippai

I'm not familiar with the pandas read_parquet() implementation yet, but something like this could work:

import pyarrow.parquet as pq

dataset_pre = pq.ParquetDataset('dataset/')
columns_to_cast = [c for c in  pq.ParquetDataset('dataset/').schema.pandas_metadata['columns'] if c['field_name'] in dataset_pre.partitioning.schema.names]

correct_schema = dataset_pre.schema.
for column in columns_to_cast:
  correct_schema = correct_schema.set(correct_schema.field_by_name(column['field_name']).with_type(convert_pandas_to_pyarrow(column['numpy_type'])))

dataset_final = pq.ParquetDataset('dataset/', schema=correct_schema)
table = dataset_final.read()
df = table.to_pandas()

Would you be interested in changing the PyArrowImpl to use something like the above or should this be part of pyarrow? I can draft a PR if you think this is desired.

Comment From: alippai

@phofl @jorisvandenbossche @mroeschke I see you are the main contributors of this code

Comment From: rhshadrach

For the partitioning column, I believe the data is not stored in the parquet files. When reading back in, the column is constructed from the file paths instead, so roundtripping with dtypes is not possible.

In any case, this is a pyarrow issue and not pandas.

Comment From: alippai

@rhshadrach pandas stores metadata info in the parquet files and this extra data makes it possible to restore the schema and dtype. Since it needs pandas specific info, it's debatable which option of the two (pyarrow or pandas) should be implemented.

This is the same suggestion as @jorisvandenbossche had in the linked JIRA issue.

Comment From: rhshadrach

pandas stores metadata info in the parquet files

I am no expert here, so correct me if this is wrong, but I believe this is done on the pyarrow side and not within pandas.

Comment From: alippai

@rhshadrach looks like it is there. I guess in this case the only question is if it's acceptable that storing and loading a pandas dataframe raises an exception. Note that storing and loading the pyarrow table has no issues.

Regardless, this can be closed, adding the notes to the pyarrow issue now.

Comment From: jorisvandenbossche

I think there are multiple aspects that interact:

When reading a partitioned dataset, the column that originates from the partitioning (embedded in the file names) is indeed reconstructed from the strings. PyArrow itself only does very little type inference when parsing those strings (I think just numbers or strings)
The default for this partitioning field is to reconstruct it as a dictionary array (categorical), because it always results in a highly repetitive value (one value for all rows that come from the same file). This can be turned off in pyarrow, would have to check if you can also do this through pd.read_parquet
When converting a pandas DataFrame to pyarrow Table (when writing), pyarrow does store metadata about the original pandas data types, and this metadata already gets used on the conversion back from pyarrow to pandas.

And it's the combination of those three items that results in having a dictionary typed column with string categories (result of the first two bullet points) that we want to try to convert to a pd.ArrowDtype("date32[day]") (because that is what the stored metadata says the original was, result of the third bullet point).

And then it's actually the pandas implementation of this conversion (in ArrowDtype.__from_arrow__) that pandas tries to convert the pyarrow array it gets passed (the dictionary) to the pyarrow type of the ArrowDtype (date32) using pyarrow's cast function. And then this cast fails with "Cast type date32[day] incompatible with dictionary type string" (which is a difficult way to say "cast from dictionary with string type to date32[day] is not implemented").

Comment From: rhshadrach

@jorisvandenbossche - is there something that should be done here on the pandas side?

Comment From: jorisvandenbossche

I was still contemplating that, it might depend on the exact behaviour we would prefer to see in practice?

If we want that the dictionary string array actually gets converted to date32, then this conversion could be fixed on the pandas side within __from_arrow__ (by special casing dictionary input, by decoding the dictionary or by only casting its dictionary values)
Another option could be to say that on the pyarrow side, if a conversion for a column based on calling __arrow_dtype__ (and when inferred from the stored pandas metadata) fails, that in such a case pyarrow falls back to the normal, default conversion (as if there was no metadata). A potential issue with this is that it might also hide errors that ideally are bubbled up to the user.
If we want the result to be a dictionary of date32, then in theory pyarrow could know that if the original pandas metadata indicated a non-dictionary type, it should cast the dictionary values to that type, and then create a pandas.Categorical for it. But that might get a bit complicated. A problem is that at the point of converting the pyarrow.Table to pandas.DataFrame, pyarrow doesn't know anymore that a certain dictionary typed column originates from a partitioning field, so it's difficult to do something specific to this use case.

Comment From: rhshadrach

As a user, I desire round tripping regardless of whether partitions are used or not. For example:

df1 = pd.DataFrame({'a': [1 ,1, 2], 'b': [3, 4, 5]})
df2 = pd.DataFrame({'a': ['1', '1', '2'], 'b': [3, 4, 5]})

If I call df1.to_parquet('test.parquet', partition_cols=['a']); pd.read_parquet('test.parquet') then I get a DataFrame with a as int64 and likewise doing the same with df2 gives a as object containing strings. Similarly for any other dtype that is supported by parquet and has a faithful repr.

I personally do not find getting dictionary array / Categorical for partition columns to be a feature.

Comment From: alippai

I don't have a strong opinion on categorical vs non-categorical (but if defaulting to convert, then a flag to disable it is always nice). The performance improvements are usually non-negligible, so the extra conversion makes sense.

The int vs string vs date issue is more annoying (mainly because it raises Exception). I've just realized that I had similar issues with pyarrow partition filters as well (the behavior has changed in the recent versions): https://github.com/apache/arrow/issues/34727

Comment From: jorisvandenbossche

If I call df1.to_parquet('test.parquet', partition_cols=['a']); pd.read_parquet('test.parquet') then I get a DataFrame with a as int64 and likewise doing the same with df2 gives a as object containing strings. Similarly for any other dtype that is supported by parquet and has a faithful repr.

Hmm, that's not what I see:

In [8]: df1.to_parquet('test.parquet', partition_cols=['a']); df1_roundtrip = pd.read_parquet('test.parquet')

In [9]: df1_roundtrip.dtypes
Out[9]: 
b       int64
a    category
dtype: object

In [10]: df1_roundtrip["a"].dtype
Out[10]: CategoricalDtype(categories=[1, 2], ordered=False)

In [11]: df1_roundtrip["a"].dtype.categories
Out[11]: Int64Index([1, 2], dtype='int64')

So the categories are correctly converted back to integers, but the column itself is categorical.

(above is with pyarrow 11.0.0 and pandas 1.5.3)

Comment From: rhshadrach

Hmm, that's not what I see:

@jorisvandenbossche: https://github.com/pandas-dev/pandas/issues/53008#issuecomment-1533986068 was my desired behavior, not current state. But I'd be more than happy with a flag to disable categorical conversion, and even really okay if I'm stuck with categories. It's the dtype conversion (string -> int), especially the stripping of leading 0s, that is a pain point for me.

So the categories are correctly converted back to integers, but the column itself is categorical.

df2 are also converted to integers, even though they start out as strings.