Moving this from https://github.com/pandas-dev/pandas/issues/50411#issuecomment-1368935006 to its own issue.

PDEP-4 (http://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html) has made the string->datetime parsing stricter. For example, using the example from the PDEP on the latest main branch:

In [1]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'])
...
ValueError: time data "13-01-2000 00:00:00" doesn't match format "%m-%d-%Y %H:%M:%S", at position 1

The above parses fine on pandas < 2, but parses both strings differently (month first vs day first), and so with latest pandas the above now raises an error.

But if I pass those strings to the DatetimeIndex(..) constructor, we still happily parse this inconsistently:

In [2]: pd.DatetimeIndex(['12-01-2000 00:00:00', '13-01-2000 00:00:00'])
Out[2]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)

I don't think this was discussed in the PDEP discussion, but if we are changing this, shouldn't we make the parsing in DatetimeIndex consistent as well?

(sidenote: long term might want to make parsing in DatetimeIndex even stricter, as brought up in https://github.com/pandas-dev/pandas/issues/50411#issuecomment-1387389894 and comments below, eg only accepting ISO strings, or not even accepting strings at all, since you can use to_datetime instead. But let's leave that discussion for a separate thread, since that's something that can be deprecated, and not necessary for 2.0)

cc @MarcoGorelli

Comment From: jorisvandenbossche

Another note: I am writing this issue naively from observing the user behaviour. I am not very up to date on how DatetimeIndex(..) string parsing is implemented, and to what extent it has separate code paths for the actual parsing compared to to_datetime. So I don't know how straightforward it would be to make both parse strings similarly (for example, are there other aspects where DatetimeIndex(..) might behave differently when passed strings?)

Comment From: jorisvandenbossche

@jbrockmendel brought up the good point that there are some inconsistencies in keywords / capabilities between DatetimeIndex() and to_datetime() that complicates this.

DatetimeIndex generally doesn't support all keywords of to_datetime (which is good, I think, as people should use to_datetime for advanced string parsing. Specifically, it doesn't have the formatkeyword), but it also does have some keywords that to_datetime does not have:

  • The tz keyword (to_datetime only has utc=True/False). For example, this allows pd.DatetimeIndex(["2022-01-01 09:00"], tz="Europe/Brussels") while with to_datetime you need to do pd.to_datetime(["2022-01-01 09:00:00"]).tz_localize("Europe/Brussels") (I seem to remember that in the past we have said that the latter is better for to_datetime because it is more explicit)
  • Related to tz, the ambiguous keyword (I assume this works similarly as the keyword in tz_localize)
  • freq keyword, so you can infer or set the frequency. With to_datetime you need to do that as a subsequent step.
  • dtype keyword, which allows to specify the resolution (this is new, so I wouldn't worry about this, and we need to add some way to to_datetime to achieve this as well)

This might make it harder to change DatetimeIndex (i.e. also make its string parsing strict), as it doesn't have the format keyword for the cases when the format can't be inferred, and you still want to make use of one of those keywords of DatetimeIndex.
Although I would argue that each of those keywords has a decent alternative with to_datetime. If you have a case where we start raising an error in 2.0, you will need to update your code anyway to accommodate this, so then you can also change it to use to_datetime if the fix you need is to specify the format.

Comment From: MarcoGorelli

removing the 2.0 milestone, but there's plenty of scope to take pdep4 further for 3.0 and beyond