Moving this from https://github.com/pandas-dev/pandas/issues/50411#issuecomment-1368935006 to its own issue.
PDEP-4 (http://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html) has made the string->datetime parsing stricter. For example, using the example from the PDEP on the latest main branch:
In [1]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'])
...
ValueError: time data "13-01-2000 00:00:00" doesn't match format "%m-%d-%Y %H:%M:%S", at position 1
The above parses fine on pandas < 2, but parses both strings differently (month first vs day first), and so with latest pandas the above now raises an error.
But if I pass those strings to the DatetimeIndex(..)
constructor, we still happily parse this inconsistently:
In [2]: pd.DatetimeIndex(['12-01-2000 00:00:00', '13-01-2000 00:00:00'])
Out[2]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)
I don't think this was discussed in the PDEP discussion, but if we are changing this, shouldn't we make the parsing in DatetimeIndex consistent as well?
(sidenote: long term might want to make parsing in DatetimeIndex even stricter, as brought up in https://github.com/pandas-dev/pandas/issues/50411#issuecomment-1387389894 and comments below, eg only accepting ISO strings, or not even accepting strings at all, since you can use to_datetime
instead. But let's leave that discussion for a separate thread, since that's something that can be deprecated, and not necessary for 2.0)
cc @MarcoGorelli
Comment From: jorisvandenbossche
Another note: I am writing this issue naively from observing the user behaviour. I am not very up to date on how DatetimeIndex(..) string parsing is implemented, and to what extent it has separate code paths for the actual parsing compared to to_datetime
. So I don't know how straightforward it would be to make both parse strings similarly (for example, are there other aspects where DatetimeIndex(..)
might behave differently when passed strings?)
Comment From: jorisvandenbossche
@jbrockmendel brought up the good point that there are some inconsistencies in keywords / capabilities between DatetimeIndex()
and to_datetime()
that complicates this.
DatetimeIndex
generally doesn't support all keywords of to_datetime
(which is good, I think, as people should use to_datetime
for advanced string parsing. Specifically, it doesn't have the format
keyword), but it also does have some keywords that to_datetime
does not have:
- The
tz
keyword (to_datetime
only hasutc=True/False
). For example, this allowspd.DatetimeIndex(["2022-01-01 09:00"], tz="Europe/Brussels")
while with to_datetime you need to dopd.to_datetime(["2022-01-01 09:00:00"]).tz_localize("Europe/Brussels")
(I seem to remember that in the past we have said that the latter is better for to_datetime because it is more explicit) - Related to
tz
, theambiguous
keyword (I assume this works similarly as the keyword intz_localize
) freq
keyword, so you can infer or set the frequency. Withto_datetime
you need to do that as a subsequent step.dtype
keyword, which allows to specify the resolution (this is new, so I wouldn't worry about this, and we need to add some way toto_datetime
to achieve this as well)
This might make it harder to change DatetimeIndex (i.e. also make its string parsing strict), as it doesn't have the format
keyword for the cases when the format can't be inferred, and you still want to make use of one of those keywords of DatetimeIndex.
Although I would argue that each of those keywords has a decent alternative with to_datetime
. If you have a case where we start raising an error in 2.0, you will need to update your code anyway to accommodate this, so then you can also change it to use to_datetime
if the fix you need is to specify the format
.
Comment From: MarcoGorelli
removing the 2.0 milestone, but there's plenty of scope to take pdep4 further for 3.0 and beyond