Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import pyarrow as pa
data = [0, 123456789, None, 2**63 - 1, -123456789]
pa_type = pa.timestamp('ns', tz='UTC')
arr = pa.array(data, type=pa_type)
table = pa.Table.from_arrays([arr], ['ts_col'])
dtype = pd.DatetimeTZDtype(unit='ns', tz='UTC')
table.to_pandas(types_mapper=lambda _: dtype)
Output:
File /opt/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py:778, in _reconstruct_block(item, columns, extension_columns)
776 pandas_dtype = extension_columns[name]
777 if not hasattr(pandas_dtype, '__from_arrow__'):
--> 778 raise ValueError("This column does not support to be converted "
779 "to a pandas ExtensionArray")
780 pd_ext_arr = pandas_dtype.__from_arrow__(arr)
781 block = _int.make_block(pd_ext_arr, placement=placement)
ValueError: This column does not support to be converted to a pandas ExtensionArray
Issue Description
Since DatetimeTZDtype does not support __from_arrow__
this extension dtype can't be used in pyarrow's to_pandas
method.
Expected Behavior
The example should return a pandas DataFrame with a DatetimeTZDtype column.
Installed Versions
Comment From: jbrockmendel
Shouldn't users use a ArrowDtype instead?
Comment From: tswast
@jbrockmendel Some additional context: I'm working on some improvements to interoperability between BigQuery and pandas. BigQuery represents TIMESTAMP data as microseconds UTC, so corresponds most closely to DatetimeTZDtype(unit='us', tz='UTC')
.
BigQuery can send query response data back as arrow record batches, so going direct from pyarrow to a corresponding pandas dtype would be ideal.
Indeed we could use the arrow dtype for everything, but I presume the Series.dt
accessor and datetime indexes would be more cumbersome to use at the moment with an arrow dtype wrapping an arrow timestamp, right?
Comment From: jbrockmendel
Indeed we could use the arrow dtype for everything, but I presume the Series.dt accessor and datetime indexes would be more cumbersome to use at the moment with an arrow dtype wrapping an arrow timestamp, right?
@mroeschke recently implemented Series.dt with timestamp-ArrowDtype. You wouldn't get a DatetimeIndex, but you should have Index.dt available
Comment From: tswast
Series.dt with timestamp-ArrowDtype
Very cool! I'll have to try that out.
I think it still makes sense to have DatetimeTZDtype.__from_arrow__
available for consistency with the other extension dtypes, but this does unblock me.
Comment From: jorisvandenbossche
For historical context, the reason that DatetimeTZDtype is missing __from_arrow__
is because this is a data type that has always already been handled natively by pyarrow itself in its to_pandas conversion, and __from_arrow__
was introduced for custom dtypes. Now, it can certainly be added for consistency, and to enable the use case you bring up (although pyarrow will change behaviour here to preserve the unit in the near future, so this specific example should no longer need to use types_mapper
)