On pandas 1.5 (so no deprecation warning):
>>> pd.Index(['2021-01-01 00:00:00+01:00', '2021-01-02 00:00:00+01:00']).astype("datetime64[ns]")
DatetimeIndex(['2020-12-31 23:00:00', '2021-01-01 23:00:00'], dtype='datetime64[ns]', freq=None)
On the main branch:
>>> pd.Index(['2021-01-01 00:00:00+01:00', '2021-01-02 00:00:00+01:00']).astype("datetime64[ns]")
...
TypeError: Cannot use .astype to convert from timezone-aware dtype to timezone-naive dtype.
Use obj.tz_localize(None) or obj.tz_convert('UTC').tz_localize(None) instead.
This started to fail a few days ago on pyarrow's CI. This comes up if you roundtrip a pandas DataFrame where the columns are a tz-aware DatetimeIndex (in Arrow they will be string columns, and then in the arrow->pandas conversion we try to cast the string to datetime64 and then localize. We should probably directly cast to the tz-aware dtype).
From a quick look at the recent commits and the code path this takes, might be caused by https://github.com/pandas-dev/pandas/pull/50015 (cc @jbrockmendel)
Comment From: jorisvandenbossche
And to be clear, this is certainly a dubious case where it is not clear if we actually want this behaviour. I think it might make sense to raise for this in the future. But if we want to do that, it should still be a conscious decision (not accidental side effect of unrelated PR), and should probably first be deprecated.
Comment From: MarcoGorelli
From a quick look at the recent commits and the code path this takes, might be caused by https://github.com/pandas-dev/pandas/pull/50015 (cc @jbrockmendel)
git bisect
confirms: https://www.kaggle.com/code/marcogorelli/pandas-regression-example?scriptVersionId=113357910
Comment From: jbrockmendel
Yah that particular .astype is definitely wonky. We should add a whatnsew note announcing the new behavior as an API change in 2.0.
Comment From: jorisvandenbossche
It's something we can easily deprecate first, I think?
Comment From: jorisvandenbossche
@jbrockmendel do you have time to take a look at this?
Comment From: jbrockmendel
will take a look
Comment From: simonjayhawkins
It's something we can easily deprecate first, I think?
https://github.com/pandas-dev/pandas/pull/51514 closes by documenting this as expected.
Comment From: jorisvandenbossche
To be explicit about the implications of doing this as a breaking change: this breaks pandas<->pyarrow roundtrips, and so eg also pandas<->parquet roundtripping (admittedly, in a somewhat corner case of using a DatetimeIndex for the column labels)
So users doing pd.read_parquet(..) with a file they wrote earlier will now get an error (with released pyarrow) that they can't easily solve themselves.