Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
ts = dt.datetime(2020, 1, 1, 17, tzinfo=dt.timezone(-dt.timedelta(hours=1)))
ts_str = '2020-01-01 00:00:00-01:00'
print(to_datetime([ts, ts_str], format='%Y-%m-%d %H:%M:%S%z'))
print(to_datetime([ts, ts_str], format='%Y-%d-%m %H:%M:%S%z'))
Issue Description
This gives
DatetimeIndex(['2020-01-01 17:00:00-01:00', '2020-01-01 00:00:00-01:00'], dtype='datetime64[ns, pytz.FixedOffset(-60)]', freq=None)
Index([2020-01-01 17:00:00-01:00, 2020-01-01 00:00:00-01:00], dtype='object')
Expected Behavior
Not sure they should be different. I think ideally from a user's perspective, it shouldn't matter whether a format
is ISO8601 or not
Installed Versions
Comment From: mroeschke
FWIW I think we want any fixed offset timezone to always parse to datetime.timezone
objects instead of pytz.FixedOffset
eventually https://github.com/pandas-dev/pandas/pull/49677
Comment From: MarcoGorelli
FWIW I think we want any fixed offset timezone to always parse to
datetime.timezone
objects instead ofpytz.FixedOffset
eventually #49677
thanks! looks like that'll be for pandas3.0 at least
for the next one though, do you agree that Index([2020-01-01 17:00:00-01:00, 2020-01-01 00:00:00-01:00], dtype='object')
isn't right?
Comment From: mroeschke
do you agree that Index([2020-01-01 17:00:00-01:00, 2020-01-01 00:00:00-01:00], dtype='object') isn't right?
I'd actually say it's technically correct. The ts
timezone is a dt.timezone
object while ts_str
timezone will be converted to a pytz.FixedOffset
object even though they represent the same offset. I remember fixing some related bugs in the past with the philosophy of "don't coerce different timezone library objects into the same object unless an explicit operation like tz_convert("UTC")
is called"
Similar reason why I think this is the correct behavior too
In [15]: pd.Index([pd.Timestamp("2022", tz="dateutil/US/Pacific"), pd.Timestamp("2022", tz="US/Pacific")])
Out[15]: Index([2022-01-01 00:00:00-08:00, 2022-01-01 00:00:00-08:00], dtype='object')
In [16]: pd.Index([pd.Timestamp("2022", tz="US/Pacific"), pd.Timestamp("2022", tz="US/Pacific")])
Out[16]: DatetimeIndex(['2022-01-01 00:00:00-08:00', '2022-01-01 00:00:00-08:00'], dtype='datetime64[ns, US/Pacific]', freq=None)
Comment From: MarcoGorelli
Thanks - yeah I can see that argument
OK so it's the ISO path that would need fixing
Comment From: MarcoGorelli
@jbrockmendel do you have thoughts on this?
I think the solution will need to be to combine the ISO and non-ISO paths (the only difference being how they handle string inputs), I'm just trying to figure out which parts to keep from which
Between this and https://github.com/pandas-dev/pandas/issues/50071, it looks like we need to aim for a mixture of them
Comment From: jbrockmendel
BTW @MarcoGorelli im getting a lot of pings and trying to triage. pls let me know if any are time-sensitive or blockers etc.
I think the solution will need to be to combine the ISO and non-ISO paths
Do you mean moving the array_strptime logic into array_to_datetime? if so, yikes.
looks like that'll be for pandas3.0 at least
id be very happy to revive that. #49677 only changed fixed-offsets, so didnt require zoneinfo. i.e. doesnt need to wait until we drop py38
I lean towards agreeing with Marco that the behavior should not hinge on what implementations of FixedOffset the user or our defaults use.
Comment From: mroeschke
id be very happy to revive that. https://github.com/pandas-dev/pandas/pull/49677 only changed fixed-offsets, so didnt require zoneinfo. i.e. doesnt need to wait until we drop py38
True, I think this would be an okay API change and signal that pandas is moving toward more stdlib tz implementations anyways
Comment From: MarcoGorelli
Here's another inconsistency: when there's a timezone-aware string and a timezone-naive datetime.
In the ISO case, the naive one is converted, whilst in the non-ISO case, we get an object
Index.
In [2]: pd.to_datetime(["2020-01-01 01:00:00-01:00", datetime(2020, 1, 1, 3, 0)], format='%Y-%d-%m %H:%M:%S%z')
Out[2]: Index([2020-01-01 01:00:00-01:00, 2020-01-01 03:00:00], dtype='object')
In [3]: pd.to_datetime(["2020-01-01 01:00:00-01:00", datetime(2020, 1, 1, 3, 0)], format='%Y-%m-%d %H:%M:%S%z')
Out[3]: DatetimeIndex(['2020-01-01 01:00:00-01:00', '2020-01-01 02:00:00-01:00'], dtype='datetime64[ns, pytz.FixedOffset(-60)]', freq=None)
I think I prefer the first one
BTW @MarcoGorelli im getting a lot of pings and trying to triage. pls let me know if any are time-sensitive or blockers etc.
Thanks @jbrockmendel , and sorry for the all the pings. I'd really like to get PDEP4 in for pandas2.0, so if we could agree here on which path (ISO or non-ISO) is correct on what, then that'd help unblock it
Do you mean moving the array_strptime logic into array_to_datetime? if so, yikes.
I was thinking the opposite (calling string_to_dts
from within array_strptime
)
Comment From: jbrockmendel
Unless the user passed utc=True
, then seeing a tznaive pydatetime and a tzaware-when-parsed string should raise in array_to_datetime, just like it would if the tzaware object were a pydatetime. So this should go down the relevant fallback path. i.e. i think the behavior in the first case looks more-correct.
Comment From: MarcoGorelli
closed in https://github.com/pandas-dev/pandas/pull/50242