Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pyarrow as pa
import pandas as pd
from datetime import datetime

arr = pa.array([datetime(1, 1, 1)], pa.timestamp('s', tz='America/New_York'))
table = pa.table({'a': arr})

table.to_pandas(safe=False)
#                                     a
# 0 1754-08-30 17:47:41.128654848-04:56
table.column('a').to_pandas(safe=False)
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "pyarrow/array.pxi", line 837, in pyarrow.lib._PandasConvertible.to_pandas
#     return self._to_pandas(options, categories=categories,
#   File "pyarrow/table.pxi", line 469, in pyarrow.lib.ChunkedArray._to_pandas
#     arr = pandas_dtype.__from_arrow__(self)
#   File "/Users/alenkafrim/repos/pyarrow-dev/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 848, in __from_arrow__
#     array = array.cast(pyarrow.timestamp(unit=self._unit), safe=True)
#   File "pyarrow/table.pxi", line 551, in pyarrow.lib.ChunkedArray.cast
#     return _pc().cast(self, target_type, safe=safe, options=options)
#   File "/Users/alenkafrim/repos/arrow/python/pyarrow/compute.py", line 400, in cast
#     return call_function("cast", [arr], options, memory_pool)
#   File "pyarrow/_compute.pyx", line 572, in pyarrow._compute.call_function
#     return func.call(args, options=options, memory_pool=memory_pool,
#   File "pyarrow/_compute.pyx", line 367, in pyarrow._compute.Function.call
#     result = GetResultValue(
#   File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
#     return check_status(status)
#   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
#     raise ArrowInvalid(message)
# pyarrow.lib.ArrowInvalid: Casting from timestamp[s, tz=America/New_York] to timestamp[ns] would result in out of bounds timestamp: -62135596800
# /Users/alenkafrim/repos/arrow/cpp/src/arrow/compute/exec.cc:920  kernel_->exec(kernel_ctx_, input, &output)
# /Users/alenkafrim/repos/arrow/cpp/src/arrow/compute/function.cc:276  executor->Execute(input, &listener)

arr = pa.array([1, 2, 3], pa.timestamp("ms", "Europe/Brussels"))
table = pa.table({"name": arr})
result = table.column("name").to_pandas()
result.name
# is None, but should be "name"

Issue Description

Some tests in PyArrow started failing about a day ago https://github.com/apache/arrow/issues/35235. The conversion from pyarrow table to pandas for timestamps with timezones is working fine, but column conversion to Series fails if timezone is not None.

The tests are failing on the Arrow code freeze branch made at the beginning of this week. The same tests pass with latest pandas release version, but fail with the development version/ nightly.

We suspect this could be connected with https://github.com/pandas-dev/pandas/pull/52677 but are not sure.

@mroeschke do you maybe have an idea of what could be triggering this?

Expected Behavior

The conversion from table column to pandas should not fail.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d91b04c30e39ee6cc1c776f60631e1c37be547ef python : 3.10.10.final.0 python-bits : 64 OS : Darwin OS-release : 22.4.0 Version : Darwin Kernel Version 22.4.0: Mon Mar 6 21:00:41 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T8103 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.0.dev0+548.gd91b04c30e numpy : 1.21.6 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 67.2.0 pip : 23.0.1 Cython : 0.29.33 pytest : 7.2.2 hypothesis : 6.70.0 sphinx : 5.3.0 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.11.0 pandas_datareader: None bs4 : 4.12.0 bottleneck : None brotli : None fastparquet : None fsspec : 2023.3.0 gcsfs : None matplotlib : 3.7.1 numba : 0.56.4 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 12.0.0.dev418+g6432a2382e.d20230414 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : 0.9.0 xarray : 2023.2.0 xlrd : None zstandard : None tzdata : 2023.3 qtpy : None

Comment From: mroeschke

array = array.cast(pyarrow.timestamp(unit=self._unit), safe=True)

From this line, looks like https://github.com/pandas-dev/pandas/pull/52201 was the change.

I guess this is tricky because the __from_arrow__ can essentially ignore the safe=False in the to_pandas call. I don't think it would be good to end up with 1754-08-30 17:47:41.128654848-04:56 even if safe=False?

Comment From: mroeschke

is None, but should be "name"

Guessing this also is due to going through __from_arrow__ now, but since __from_arrow__ does not have access to the column name I'm guessing this is a pyarrow issue?

Comment From: AlenkaF

From this line, looks like https://github.com/pandas-dev/pandas/pull/52201 was the change.

Thank you for this info, it is correct! 👍

I guess this is tricky because the from_arrow can essentially ignore the safe=False in the to_pandas call. I don't think it would be good to end up with 1754-08-30 17:47:41.128654848-04:56 even if safe=False?

After talking to Joris we think this is actually connected to the fact that in pyarrow we still cast the datetime to ns if the resolution differ (https://github.com/apache/arrow/issues/34462) and so pandas receives the datetime with the ns resolution and does the casting which isn't needed anymore (https://github.com/tswast/pandas/blob/8c509257eaee411fbbe9a7e696bfad3d01bb1e1f/pandas/core/dtypes/dtypes.py#L845) and which fail for out of bounds values.

Guessing this also is due to going through __from_arrow__ now, but since __from_arrow__ does not have access to the column name I'm guessing this is a pyarrow issue?

Correct.

I will close this issue as the work to update the code after https://github.com/pandas-dev/pandas/pull/52201 needs to be done on pyarrow side.