Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pyarrow as pa
import pandas as pd
from datetime import datetime
arr = pa.array([datetime(1, 1, 1)], pa.timestamp('s', tz='America/New_York'))
table = pa.table({'a': arr})
table.to_pandas(safe=False)
# a
# 0 1754-08-30 17:47:41.128654848-04:56
table.column('a').to_pandas(safe=False)
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# File "pyarrow/array.pxi", line 837, in pyarrow.lib._PandasConvertible.to_pandas
# return self._to_pandas(options, categories=categories,
# File "pyarrow/table.pxi", line 469, in pyarrow.lib.ChunkedArray._to_pandas
# arr = pandas_dtype.__from_arrow__(self)
# File "/Users/alenkafrim/repos/pyarrow-dev/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 848, in __from_arrow__
# array = array.cast(pyarrow.timestamp(unit=self._unit), safe=True)
# File "pyarrow/table.pxi", line 551, in pyarrow.lib.ChunkedArray.cast
# return _pc().cast(self, target_type, safe=safe, options=options)
# File "/Users/alenkafrim/repos/arrow/python/pyarrow/compute.py", line 400, in cast
# return call_function("cast", [arr], options, memory_pool)
# File "pyarrow/_compute.pyx", line 572, in pyarrow._compute.call_function
# return func.call(args, options=options, memory_pool=memory_pool,
# File "pyarrow/_compute.pyx", line 367, in pyarrow._compute.Function.call
# result = GetResultValue(
# File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
# return check_status(status)
# File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
# raise ArrowInvalid(message)
# pyarrow.lib.ArrowInvalid: Casting from timestamp[s, tz=America/New_York] to timestamp[ns] would result in out of bounds timestamp: -62135596800
# /Users/alenkafrim/repos/arrow/cpp/src/arrow/compute/exec.cc:920 kernel_->exec(kernel_ctx_, input, &output)
# /Users/alenkafrim/repos/arrow/cpp/src/arrow/compute/function.cc:276 executor->Execute(input, &listener)
arr = pa.array([1, 2, 3], pa.timestamp("ms", "Europe/Brussels"))
table = pa.table({"name": arr})
result = table.column("name").to_pandas()
result.name
# is None, but should be "name"
Issue Description
Some tests in PyArrow started failing about a day ago https://github.com/apache/arrow/issues/35235. The conversion from pyarrow table to pandas for timestamps with timezones is working fine, but column conversion to Series fails if timezone is not None.
The tests are failing on the Arrow code freeze branch made at the beginning of this week. The same tests pass with latest pandas release version, but fail with the development version/ nightly.
We suspect this could be connected with https://github.com/pandas-dev/pandas/pull/52677 but are not sure.
@mroeschke do you maybe have an idea of what could be triggering this?
Expected Behavior
The conversion from table column to pandas should not fail.
Installed Versions
Comment From: mroeschke
array = array.cast(pyarrow.timestamp(unit=self._unit), safe=True)
From this line, looks like https://github.com/pandas-dev/pandas/pull/52201 was the change.
I guess this is tricky because the __from_arrow__
can essentially ignore the safe=False
in the to_pandas
call. I don't think it would be good to end up with 1754-08-30 17:47:41.128654848-04:56
even if safe=False
?
Comment From: mroeschke
is None, but should be "name"
Guessing this also is due to going through __from_arrow__
now, but since __from_arrow__
does not have access to the column name I'm guessing this is a pyarrow issue?
Comment From: AlenkaF
From this line, looks like https://github.com/pandas-dev/pandas/pull/52201 was the change.
Thank you for this info, it is correct! 👍
I guess this is tricky because the from_arrow can essentially ignore the safe=False in the to_pandas call. I don't think it would be good to end up with 1754-08-30 17:47:41.128654848-04:56 even if safe=False?
After talking to Joris we think this is actually connected to the fact that in pyarrow we still cast the datetime to ns
if the resolution differ (https://github.com/apache/arrow/issues/34462) and so pandas receives the datetime with the ns
resolution and does the casting which isn't needed anymore (https://github.com/tswast/pandas/blob/8c509257eaee411fbbe9a7e696bfad3d01bb1e1f/pandas/core/dtypes/dtypes.py#L845) and which fail for out of bounds values.
Guessing this also is due to going through
__from_arrow__
now, but since__from_arrow__
does not have access to the column name I'm guessing this is a pyarrow issue?
Correct.
I will close this issue as the work to update the code after https://github.com/pandas-dev/pandas/pull/52201 needs to be done on pyarrow side.