Pandas BUG: dt.timezone pydatetime parsed differently for ISO vs non-ISO dates

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

ts = dt.datetime(2020, 1, 1, 17, tzinfo=dt.timezone(-dt.timedelta(hours=1)))
ts_str = '2020-01-01 00:00:00-01:00'
print(to_datetime([ts, ts_str], format='%Y-%m-%d %H:%M:%S%z'))
print(to_datetime([ts, ts_str], format='%Y-%d-%m %H:%M:%S%z'))

Issue Description

This gives

DatetimeIndex(['2020-01-01 17:00:00-01:00', '2020-01-01 00:00:00-01:00'], dtype='datetime64[ns, pytz.FixedOffset(-60)]', freq=None)
Index([2020-01-01 17:00:00-01:00, 2020-01-01 00:00:00-01:00], dtype='object')

Expected Behavior

Not sure they should be different. I think ideally from a user's perspective, it shouldn't matter whether a format is ISO8601 or not

Installed Versions

INSTALLED VERSIONS ------------------ commit : 3377a6d33d813e3817df0786117eaa08d5dcbf96 python : 3.8.15.final.0 python-bits : 64 OS : Linux OS-release : 5.10.102.1-microsoft-standard-WSL2 Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 2.0.0.dev0+817.g3377a6d33d numpy : 1.23.5 pytz : 2022.6 dateutil : 2.8.2 setuptools : 65.5.1 pip : 22.3.1 Cython : 0.29.32 pytest : 7.2.0 hypothesis : 6.59.0 sphinx : 4.5.0 blosc : None feather : None xlsxwriter : 3.0.3 lxml.etree : 4.9.1 html5lib : 1.1 pymysql : 1.0.2 psycopg2 : 2.9.3 jinja2 : 3.0.3 IPython : 8.7.0 pandas_datareader: 0.10.0 bs4 : 4.11.1 bottleneck : 1.3.5 brotli : fastparquet : 2022.11.0 fsspec : 2021.11.0 gcsfs : 2021.11.0 matplotlib : 3.6.2 numba : 0.56.4 numexpr : 2.8.3 odfpy : None openpyxl : 3.0.10 pandas_gbq : 0.17.9 pyarrow : 9.0.0 pyreadstat : 1.2.0 pyxlsb : 1.0.10 s3fs : 2021.11.0 scipy : 1.9.3 snappy : sqlalchemy : 1.4.44 tables : 3.7.0 tabulate : 0.9.0 xarray : 2022.11.0 xlrd : 2.0.1 zstandard : 0.19.0 tzdata : None qtpy : None pyqt5 : None None

Comment From: mroeschke

FWIW I think we want any fixed offset timezone to always parse to datetime.timezone objects instead of pytz.FixedOffset eventually https://github.com/pandas-dev/pandas/pull/49677

Comment From: MarcoGorelli

FWIW I think we want any fixed offset timezone to always parse to datetime.timezone objects instead of pytz.FixedOffset eventually #49677

thanks! looks like that'll be for pandas3.0 at least

for the next one though, do you agree that Index([2020-01-01 17:00:00-01:00, 2020-01-01 00:00:00-01:00], dtype='object') isn't right?

Comment From: mroeschke

do you agree that Index([2020-01-01 17:00:00-01:00, 2020-01-01 00:00:00-01:00], dtype='object') isn't right?

I'd actually say it's technically correct. The ts timezone is a dt.timezone object while ts_str timezone will be converted to a pytz.FixedOffset object even though they represent the same offset. I remember fixing some related bugs in the past with the philosophy of "don't coerce different timezone library objects into the same object unless an explicit operation like tz_convert("UTC") is called"

Similar reason why I think this is the correct behavior too

In [15]: pd.Index([pd.Timestamp("2022", tz="dateutil/US/Pacific"), pd.Timestamp("2022", tz="US/Pacific")])
Out[15]: Index([2022-01-01 00:00:00-08:00, 2022-01-01 00:00:00-08:00], dtype='object')

In [16]: pd.Index([pd.Timestamp("2022", tz="US/Pacific"), pd.Timestamp("2022", tz="US/Pacific")])
Out[16]: DatetimeIndex(['2022-01-01 00:00:00-08:00', '2022-01-01 00:00:00-08:00'], dtype='datetime64[ns, US/Pacific]', freq=None)

Comment From: MarcoGorelli

Thanks - yeah I can see that argument

OK so it's the ISO path that would need fixing

Comment From: MarcoGorelli

@jbrockmendel do you have thoughts on this?

I think the solution will need to be to combine the ISO and non-ISO paths (the only difference being how they handle string inputs), I'm just trying to figure out which parts to keep from which

Between this and https://github.com/pandas-dev/pandas/issues/50071, it looks like we need to aim for a mixture of them

Comment From: jbrockmendel

BTW @MarcoGorelli im getting a lot of pings and trying to triage. pls let me know if any are time-sensitive or blockers etc.

I think the solution will need to be to combine the ISO and non-ISO paths

Do you mean moving the array_strptime logic into array_to_datetime? if so, yikes.

looks like that'll be for pandas3.0 at least

id be very happy to revive that. #49677 only changed fixed-offsets, so didnt require zoneinfo. i.e. doesnt need to wait until we drop py38

I lean towards agreeing with Marco that the behavior should not hinge on what implementations of FixedOffset the user or our defaults use.

Comment From: mroeschke

id be very happy to revive that. https://github.com/pandas-dev/pandas/pull/49677 only changed fixed-offsets, so didnt require zoneinfo. i.e. doesnt need to wait until we drop py38

True, I think this would be an okay API change and signal that pandas is moving toward more stdlib tz implementations anyways

Comment From: MarcoGorelli

Here's another inconsistency: when there's a timezone-aware string and a timezone-naive datetime.

In the ISO case, the naive one is converted, whilst in the non-ISO case, we get an object Index.

In [2]: pd.to_datetime(["2020-01-01 01:00:00-01:00", datetime(2020, 1, 1, 3, 0)], format='%Y-%d-%m %H:%M:%S%z')
Out[2]: Index([2020-01-01 01:00:00-01:00, 2020-01-01 03:00:00], dtype='object')

In [3]: pd.to_datetime(["2020-01-01 01:00:00-01:00", datetime(2020, 1, 1, 3, 0)], format='%Y-%m-%d %H:%M:%S%z')
Out[3]: DatetimeIndex(['2020-01-01 01:00:00-01:00', '2020-01-01 02:00:00-01:00'], dtype='datetime64[ns, pytz.FixedOffset(-60)]', freq=None)

I think I prefer the first one

BTW @MarcoGorelli im getting a lot of pings and trying to triage. pls let me know if any are time-sensitive or blockers etc.

Thanks @jbrockmendel , and sorry for the all the pings. I'd really like to get PDEP4 in for pandas2.0, so if we could agree here on which path (ISO or non-ISO) is correct on what, then that'd help unblock it

Do you mean moving the array_strptime logic into array_to_datetime? if so, yikes.

I was thinking the opposite (calling string_to_dts from within array_strptime)

Comment From: jbrockmendel

Unless the user passed utc=True, then seeing a tznaive pydatetime and a tzaware-when-parsed string should raise in array_to_datetime, just like it would if the tzaware object were a pydatetime. So this should go down the relevant fallback path. i.e. i think the behavior in the first case looks more-correct.

Comment From: MarcoGorelli

closed in https://github.com/pandas-dev/pandas/pull/50242