Pandas REGR: casting datetime strings with offzet to tz-naive datetime64 fails

On pandas 1.5 (so no deprecation warning):

>>> pd.Index(['2021-01-01 00:00:00+01:00', '2021-01-02 00:00:00+01:00']).astype("datetime64[ns]")
DatetimeIndex(['2020-12-31 23:00:00', '2021-01-01 23:00:00'], dtype='datetime64[ns]', freq=None)

On the main branch:

>>> pd.Index(['2021-01-01 00:00:00+01:00', '2021-01-02 00:00:00+01:00']).astype("datetime64[ns]")
...
TypeError: Cannot use .astype to convert from timezone-aware dtype to timezone-naive dtype. 
Use obj.tz_localize(None) or obj.tz_convert('UTC').tz_localize(None) instead.

This started to fail a few days ago on pyarrow's CI. This comes up if you roundtrip a pandas DataFrame where the columns are a tz-aware DatetimeIndex (in Arrow they will be string columns, and then in the arrow->pandas conversion we try to cast the string to datetime64 and then localize. We should probably directly cast to the tz-aware dtype).

From a quick look at the recent commits and the code path this takes, might be caused by https://github.com/pandas-dev/pandas/pull/50015 (cc @jbrockmendel)

Comment From: jorisvandenbossche

And to be clear, this is certainly a dubious case where it is not clear if we actually want this behaviour. I think it might make sense to raise for this in the future. But if we want to do that, it should still be a conscious decision (not accidental side effect of unrelated PR), and should probably first be deprecated.

Comment From: MarcoGorelli

From a quick look at the recent commits and the code path this takes, might be caused by https://github.com/pandas-dev/pandas/pull/50015 (cc @jbrockmendel)

git bisect confirms: https://www.kaggle.com/code/marcogorelli/pandas-regression-example?scriptVersionId=113357910

Comment From: jbrockmendel

Yah that particular .astype is definitely wonky. We should add a whatnsew note announcing the new behavior as an API change in 2.0.

Comment From: jorisvandenbossche

It's something we can easily deprecate first, I think?

Comment From: jorisvandenbossche

@jbrockmendel do you have time to take a look at this?

Comment From: jbrockmendel

will take a look

Comment From: simonjayhawkins

It's something we can easily deprecate first, I think?

https://github.com/pandas-dev/pandas/pull/51514 closes by documenting this as expected.

Comment From: jorisvandenbossche

To be explicit about the implications of doing this as a breaking change: this breaks pandas<->pyarrow roundtrips, and so eg also pandas<->parquet roundtripping (admittedly, in a somewhat corner case of using a DatetimeIndex for the column labels)

So users doing pd.read_parquet(..) with a file they wrote earlier will now get an error (with released pyarrow) that they can't easily solve themselves.