Pandas Datetime parsing (PDEP-4): allow mixture of ISO formatted strings?

In pandas < 2, parsing datetime strings that are all strictly ISO formatted strings (so there is no ambiguity) but have slightly different formatting because of a different resolution, works fine:

>>> pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"])
DatetimeIndex(['2022-01-01 09:00:00', '2022-01-02 09:00:00.123000'], dtype='datetime64[ns]', freq=None)

With the changes related to datetime parsing (PDEP-4), this will now infer the format from the first, and then parsing the second fails:

>>> pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"])
...
ValueError: time data "2022-01-02T09:00:00.123" at position 1 doesn't match format "%Y-%m-%dT%H:%M:%S"

This is of course expected and can be explained with the changes in behaviour. But I do wonder if we want to allow some way to still parse such data in a fast way.

For this specific case, you can't specify a format since it is inconsistent. And before, this was actually fast because it took a fast ISO parsing path. With the current dev version of pandas, I don't see any way to parse this in a performant way?

(this also came up in pyarrow's CI, it was an easy fix on our side to update the strings in the test, but still wanted to raise this as a specific case to consider, since with real world data you can't necessarily easily update your data)

cc @MarcoGorelli

Comment From: jorisvandenbossche

This is also a bit inconsistent with the case when we cannot infer the format. In such a case, we actually still do parse each string individually (just raise a warning about doing that). For example from the original PR:

>>> pd.to_datetime(["20010101 123456", "201201"])
  UserWarning: Could not infer format, so each element will be parsed individually by `dateutil`. 
  To ensure parsing is consistent and as-expected, please specify a format.
DatetimeIndex(['2001-01-01 12:34:56', '2001-12-20 00:00:00'], dtype='datetime64[ns]', freq=None)

Comment From: MarcoGorelli

would format='%Y-%m-%dT%H:%M:%S', exact=False work for you?

Comment From: MarcoGorelli

that works for the example you posted, and is performant

In [8]: pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"], exact=False)
Out[8]: DatetimeIndex(['2022-01-01 09:00:00', '2022-01-02 09:00:00.123000'], dtype='datetime64[ns]', freq=None)

Comment From: jorisvandenbossche

That only worked because of the exact ordering that I used. Switching the values:

In [15]: pd.to_datetime(["2022-01-01T09:00:00.123", "2022-01-02T09:00:00"], exact=False)
...
ValueError: time data "2022-01-02T09:00:00" at position 1 doesn't match format "%Y-%m-%dT%H:%M:%S.%f"

In that case, you could still provide the shorter format manually:

In [16]: pd.to_datetime(["2022-01-01T09:00:00.123", "2022-01-02T09:00:00"], format="%Y-%m-%dT%H:%M:%S", exact=False)
Out[16]: DatetimeIndex(['2022-01-01 09:00:00.123000', '2022-01-02 09:00:00'], dtype='datetime64[ns]', freq=None)

But, I am wondering: isn't that another bug? I would have expected it would ignore the sub-second values (and so wanted to say that this is not a full replacement ..). But apparently we still do parse the seconds, while with exact=False I expect (based on the docstring explanation" that it finds the format in the full string, ignoring anything else.
But maybe I am just misunderstanding the purpose of exact=False (never used it myself in practice before).

Comment From: MarcoGorelli

I don't think it's a bug: neither stdlib nor pyarrow would allow that:

In [9]: pa.compute.strptime('2022-01-01T09:00:00.123', '%Y-%m-%dT%H:%M:%S', unit='ns')
...
ArrowInvalid: Failed to parse string: '2022-01-01T09:00:00.123' as a scalar of type timestamp[ns]

In [10]: dt.datetime.strptime('2022-01-01T09:00:00.123', '%Y-%m-%dT%H:%M:%S')
...
ValueError: unconverted data remains: .123

In that case, you could still provide the shorter format manually:

yup, sounds fine to me

Comment From: jorisvandenbossche

I opened https://github.com/pandas-dev/pandas/issues/50412 for the exact=False inconsistencies.

I don't think it's a bug: neither stdlib nor pyarrow would allow that:

To be clear, I don't say that those two examples should parse that. But so pandas does if you pass exact=False. But let's continue that part of the discussion in https://github.com/pandas-dev/pandas/issues/50412

In that case, you could still provide the shorter format manually:

yup, sounds fine to me

Depending on the outcome of https://github.com/pandas-dev/pandas/issues/50412, that's not actually a solution

Comment From: MarcoGorelli

Ah I'd misunderstood which part you were considering a bug - https://github.com/pandas-dev/pandas/issues/50412 makes it clear, thanks!

OK I'll get back to you on this, will need to think about it carefully

Comment From: MarcoGorelli

OK, it should work to do

DatetimeIndex(dates)

I don't think we need to add extra parameters then.

This seems like a rare usecase anyway - ~as a comparison, polars would error too~ however, polars does allow it (EDIT: my bad, I'd used pl.Date instead of pl.Datetime when writing this comment the first time):

In [6]: dataset = pl.DataFrame({"date": ["2020-01-02", "2020-01-03", "2020-01-04 00:00:00"], "index": [1, 2, 3]})

In [7]: dataset.with_column(pl.col("date").str.strptime(pl.Datetime))
Out[7]: 
shape: (3, 2)
┌─────────────────────┬───────┐
│ date                ┆ index │
│ ---                 ┆ ---   │
│ datetime[μs]        ┆ i64   │
╞═════════════════════╪═══════╡
│ 2020-01-02 00:00:00 ┆ 1     │
│ 2020-01-03 00:00:00 ┆ 2     │
│ 2020-01-04 00:00:00 ┆ 3     │
└─────────────────────┴───────┘

Comment From: MarcoGorelli

We OK to close this one then?

EDIT: I'll add better docs for this, but I don't think it warrants an extra parameter

Comment From: jorisvandenbossche

OK, it should work to do DatetimeIndex(dates)

I would find it surprising that this works for DatetimeIndex(..) and not pd.to_datetime. But that's maybe a more general question (for another issue): what is the idea to do about DatetimeIndex? PDEP-4 has been focused on making to_datetime strict, but should we apply everything discussed for to_datetime to DatetimeIndex as well (when passing strings)? Or do we want to explicitly keep DatetimeIndex more flexible?

Comment From: MarcoGorelli

Long-term (3.0?), here's what I'd really like to get to: - to_datetime accepts or infers format, and its parsing will be consistent according to that format (see https://github.com/pandas-dev/pandas/issues/12649 for an example of where people really want this strictness) - DatetimeIndex doesn't accept format, but only accepts ISO8601 strings (so, no issue with mixed-ISO8601 elements)

For 2.0, I'd suggest noting in the to_datetime doc (and in the user guide) that if someone has mixed-ISO8601 strings, then they should pass them directly to DatetimeIndex

An alternative would be add an extra argument to to_datetime, but I'm pretty -1 on that: - to_datetime is already quite complex - this opens the door to allowing other formats to be less strict

@jbrockmendel @mroeschke any thoughts on this?

Comment From: mroeschke

Semantically In my mind, always thought to_datetime should be the "more-generic" date parsing option compared to DatetimeIndex since to_datetime contains the myriad of parsing options, so I find it weird that DatetimeIndex would contain date-parsing functionality that to_datetime doesn't have.

Comment From: MarcoGorelli

Sure, thanks for commenting

How about if the format argument in to_datetime can take 'ISO8601', in which case it'll parse any ISO8601 string (without regard to exact)?

So in the above example:

pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"], format='ISO8601')

Comment From: mroeschke

Yeah that would satisfy my aspiration for to_datetime.

(FWIW my bold opinion is I think Timestamp & DatetimetimeIndex shouldn't accept strings or do any parsing and to_datetime should be meant for that)

Comment From: MarcoGorelli

OK thanks

@jorisvandenbossche if you have no objections I'll make a PR (either this or next week, but definitely before 2.0)

Comment From: jbrockmendel

Lots going on here.

Without any non-shared keywords passed, I would expect DatetimeIndex and to_datetime to accept the same things and behave the same. That is not necessarily a good thing zen-of-python-wise. If that were to be changed, I'd want to make one strictly more strict than the other, not mix-and-match (i.e. avoid "DatetimeIndex accepts this but not that, to_datetime the other way around")

I'd be on board with getting keywords out of the DatetimeIndex constructor and telling users to use to_datetime instead (also analogous for TDI/PI), but that might necessitate adding keywords to to_datetime which isn't great.

(Deprecating string parsing in Timestamp is indeed bold, not sure what i think about that)

Also:

I'm finding more and more places where the dateutil parser is not strict enough. I'm leaning towards moving towards just supporting a small(ish) collection of formats and doing the parsing internally.

Comment From: MarcoGorelli

Yeah, totally - I've love to get to the point when dateutil is only used for guessing formats, and not for parsing, and will try to get things started towards that after 2.0

But for 2.0, what do you think we should do about mixed ISO8601 formats? OK with just allowing format='ISO8601'?

I don't think allowing mixed formats in format=None is an option, as the format might be guessed to be non-ISO8601, and then there'd be no mechanism for allowing mixed formats (at least, not without removing dateutil)

Comment From: jbrockmendel

But for 2.0, what do you think we should do about mixed ISO8601 formats? OK with just allowing format='ISO8601'?

I haven't paid much attention. Is the OP still the best summary of the topic? Since we're low on time, what's the least invasive option?

Comment From: MarcoGorelli

Summary: there's no performant way to parse mixed ISO8601 strings with to_datetime, e.g. pd.to_datetime(['2020-01-01', '2020-01-01 00:00:00'])

Least invasive option I can think of: allow format='ISO8601', and then treat that like require_iso8601 used to

Comment From: jreback

similar to @jbrockmendel comment above, I would be +1 on make DatetimeIndex as strict as possible and leave the parsing to to_datetime, e.g. limiting to i8 and datetime64 inputs, ideally removing any string parsing.

Comment From: MarcoGorelli

Sure, sounds good (maybe for 3.0)

OK to support format='ISO8601' in to_datetime and let it accept anything ISO8601?

Comment From: jorisvandenbossche

A few reactions:

Lets discuss the exact parsing capabilities we want to see long-term in general for DatetimeIndex (should it match to_datetime, only ISO formatted strings, or no strings at all? ..) in a separate issue
For the original issue here, I would certainly not want the (temporary) answer to be: "if you want more flexible parsing, use DatetimeIndex instead of to_datetime" (as suggested in a commen above). I think we all agree that long term DatetimeIndex at least shouldn't have more parsing features as to_datetime, so then we should also not point users to DatetimeIndex on the short-term for that.
I don't have a strong opinion on format="ISO8601". If we want to support this use case, I think that is a nice solution for specifying this. At the same time I don't know how important this use case is (I only noticed it in the pyarrow tests where we wrote those strings in longer/shorter ISO form just for convenience (to not pad everything with all zeros). I just checked, and the pyarrow csv reader actually has such an option to say "ISO8601" in general, and that will parse a mix of ISO formatted strings.
Short term for 2.0, for DatetimeIndex, I think we should still decide if we want to apply the PDEP-4 changes in string parsing also to DatetimeIndex (I will open a separate issue for this)

Comment From: jorisvandenbossche

OK to support format='ISO8601' in to_datetime and let it accept anything ISO8601?

As I wrote above, that sounds good to me (we have something similar in pyarrow). One question though: above you say this won't be performant? (https://github.com/pandas-dev/pandas/issues/50411#issuecomment-1398059502) And can it be strict? (allow a mixture of different ISO formats, but still only ISO compliant formats)

Comment From: MarcoGorelli

This would be performant, sorry - what wouldn't be performant is the currently-suggested alternative (ser.apply(lambda x: to_datetime(x))`

And yes, it should be strict with respect to ISO8601 (so, the issue which PDEP4 addressed wouldn't be undone by this)

Cool, I'll make a PR then, thanks all for your inputs

Comment From: jorisvandenbossche

Sounds good, great!

Comment From: jorisvandenbossche

Short term for 2.0, for DatetimeIndex, I think we should still decide if we want to apply the PDEP-4 changes in string parsing also to DatetimeIndex (I will open a separate issue for this)

-> opened https://github.com/pandas-dev/pandas/issues/50894 for this

Comment From: jbrockmendel

I may be late to the party but just encountered this when troubleshooting pyarrow xfails: if a user calls to_datetime without passing format and it fails with an exception that a format is not matched, that seems weird to me. It also isn't obvious what they are supposed to do instead. This is especially an important bc it can be reached via csv parsing when the user isn't calling to_datetime directly.

Comment From: MarcoGorelli

What do you think should happen if format=None (the default) is passed?

read_csv wouldn't throw an error anyway because if it errors then it just doesn't convert, e.g.

In [4]: pd.read_csv(io.StringIO('date,value\n13-01-2000,3\n2000-01-01,4'), parse_dates=['date']).dtypes
<ipython-input-4-ec4c09ad3e02>:1: UserWarning: Parsing dates in %d-%m-%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  pd.read_csv(io.StringIO('date,value\n13-01-2000,3\n2000-01-01,4'), parse_dates=['date']).dtypes
Out[4]: 
date     object
value     int64
dtype: object

Comment From: jbrockmendel

What do you think should happen if format=None (the default) is passed?

I'd expect it to attempt guessing a format for a fastpath, and fall back to go through array_to_datetime

read_csv wouldn't throw an error anyway because if it errors then it just doesn't convert, e.g.

Haven't looked at it too closely, but tests.extension.test_arrow.TestBaseParsing.test_EA_types has xfailed cased for pa.types.is_timestamp(pa_dtype) and pa_dtype.unit in ("us", "ns") (that point to #49767) that are fixed if I go into the to_datetime code and do

+    guessed = False
    if format is None:
        format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)
+        guessed = True

    if format is not None:
+        try:
            return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)
+        except AppropriateException:
+            if not guessed:
+                raise

Comment From: MarcoGorelli

I'd expect it to attempt guessing a format for a fastpath, and fall back to go through array_to_datetime

Then we'd be back to ['12/01/2000', '13/01/2000'] being parsed as "1st December 2000, 13th January 2000", which would bring back the decade-long bug which PDEP4 fixed

Comment From: jbrockmendel

which would bring back the decade-long bug which PDEP4 fixed

Yep, thats a tough one. Was this breakage discussed-and-dismissed in the lead up to PDEP4?

In cases where PDEP4 breaks user code (e.g. the jan 13 case above) we should at the very least have an exception message that tells users how to fix it.

Comment From: MarcoGorelli

I'm OK with mixed Iso8601 parsing

It's mixed not-necessarily-iso8601 that I'm really keen to avoid