In pandas < 2, parsing datetime strings that are all strictly ISO formatted strings (so there is no ambiguity) but have slightly different formatting because of a different resolution, works fine:
>>> pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"])
DatetimeIndex(['2022-01-01 09:00:00', '2022-01-02 09:00:00.123000'], dtype='datetime64[ns]', freq=None)
With the changes related to datetime parsing (PDEP-4), this will now infer the format from the first, and then parsing the second fails:
>>> pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"])
...
ValueError: time data "2022-01-02T09:00:00.123" at position 1 doesn't match format "%Y-%m-%dT%H:%M:%S"
This is of course expected and can be explained with the changes in behaviour. But I do wonder if we want to allow some way to still parse such data in a fast way.
For this specific case, you can't specify a format since it is inconsistent. And before, this was actually fast because it took a fast ISO parsing path. With the current dev version of pandas, I don't see any way to parse this in a performant way?
(this also came up in pyarrow's CI, it was an easy fix on our side to update the strings in the test, but still wanted to raise this as a specific case to consider, since with real world data you can't necessarily easily update your data)
cc @MarcoGorelli
Comment From: jorisvandenbossche
This is also a bit inconsistent with the case when we cannot infer the format. In such a case, we actually still do parse each string individually (just raise a warning about doing that). For example from the original PR:
>>> pd.to_datetime(["20010101 123456", "201201"])
UserWarning: Could not infer format, so each element will be parsed individually by `dateutil`.
To ensure parsing is consistent and as-expected, please specify a format.
DatetimeIndex(['2001-01-01 12:34:56', '2001-12-20 00:00:00'], dtype='datetime64[ns]', freq=None)
Comment From: MarcoGorelli
would format='%Y-%m-%dT%H:%M:%S', exact=False
work for you?
Comment From: MarcoGorelli
that works for the example you posted, and is performant
In [8]: pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"], exact=False)
Out[8]: DatetimeIndex(['2022-01-01 09:00:00', '2022-01-02 09:00:00.123000'], dtype='datetime64[ns]', freq=None)
Comment From: jorisvandenbossche
That only worked because of the exact ordering that I used. Switching the values:
In [15]: pd.to_datetime(["2022-01-01T09:00:00.123", "2022-01-02T09:00:00"], exact=False)
...
ValueError: time data "2022-01-02T09:00:00" at position 1 doesn't match format "%Y-%m-%dT%H:%M:%S.%f"
In that case, you could still provide the shorter format manually:
In [16]: pd.to_datetime(["2022-01-01T09:00:00.123", "2022-01-02T09:00:00"], format="%Y-%m-%dT%H:%M:%S", exact=False)
Out[16]: DatetimeIndex(['2022-01-01 09:00:00.123000', '2022-01-02 09:00:00'], dtype='datetime64[ns]', freq=None)
But, I am wondering: isn't that another bug? I would have expected it would ignore the sub-second values (and so wanted to say that this is not a full replacement ..). But apparently we still do parse the seconds, while with exact=False
I expect (based on the docstring explanation" that it finds the format in the full string, ignoring anything else.
But maybe I am just misunderstanding the purpose of exact=False
(never used it myself in practice before).
Comment From: MarcoGorelli
I don't think it's a bug: neither stdlib nor pyarrow would allow that:
In [9]: pa.compute.strptime('2022-01-01T09:00:00.123', '%Y-%m-%dT%H:%M:%S', unit='ns')
...
ArrowInvalid: Failed to parse string: '2022-01-01T09:00:00.123' as a scalar of type timestamp[ns]
In [10]: dt.datetime.strptime('2022-01-01T09:00:00.123', '%Y-%m-%dT%H:%M:%S')
...
ValueError: unconverted data remains: .123
In that case, you could still provide the shorter format manually:
yup, sounds fine to me
Comment From: jorisvandenbossche
I opened https://github.com/pandas-dev/pandas/issues/50412 for the exact=False
inconsistencies.
I don't think it's a bug: neither stdlib nor pyarrow would allow that:
To be clear, I don't say that those two examples should parse that. But so pandas does if you pass exact=False
. But let's continue that part of the discussion in https://github.com/pandas-dev/pandas/issues/50412
In that case, you could still provide the shorter format manually:
yup, sounds fine to me
Depending on the outcome of https://github.com/pandas-dev/pandas/issues/50412, that's not actually a solution
Comment From: MarcoGorelli
Ah I'd misunderstood which part you were considering a bug - https://github.com/pandas-dev/pandas/issues/50412 makes it clear, thanks!
OK I'll get back to you on this, will need to think about it carefully
Comment From: MarcoGorelli
OK, it should work to do
DatetimeIndex(dates)
I don't think we need to add extra parameters then.
This seems like a rare usecase anyway - ~as a comparison, polars would error too~ however, polars does allow it (EDIT: my bad, I'd used pl.Date
instead of pl.Datetime
when writing this comment the first time):
In [6]: dataset = pl.DataFrame({"date": ["2020-01-02", "2020-01-03", "2020-01-04 00:00:00"], "index": [1, 2, 3]})
In [7]: dataset.with_column(pl.col("date").str.strptime(pl.Datetime))
Out[7]:
shape: (3, 2)
┌─────────────────────┬───────┐
│ date ┆ index │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪═══════╡
│ 2020-01-02 00:00:00 ┆ 1 │
│ 2020-01-03 00:00:00 ┆ 2 │
│ 2020-01-04 00:00:00 ┆ 3 │
└─────────────────────┴───────┘
Comment From: MarcoGorelli
We OK to close this one then?
EDIT: I'll add better docs for this, but I don't think it warrants an extra parameter
Comment From: jorisvandenbossche
OK, it should work to do
DatetimeIndex(dates)
I would find it surprising that this works for DatetimeIndex(..)
and not pd.to_datetime
.
But that's maybe a more general question (for another issue): what is the idea to do about DatetimeIndex
? PDEP-4 has been focused on making to_datetime
strict, but should we apply everything discussed for to_datetime
to DatetimeIndex
as well (when passing strings)? Or do we want to explicitly keep DatetimeIndex
more flexible?
Comment From: MarcoGorelli
Long-term (3.0?), here's what I'd really like to get to:
- to_datetime
accepts or infers format
, and its parsing will be consistent according to that format (see https://github.com/pandas-dev/pandas/issues/12649 for an example of where people really want this strictness)
- DatetimeIndex
doesn't accept format
, but only accepts ISO8601 strings (so, no issue with mixed-ISO8601 elements)
For 2.0, I'd suggest noting in the to_datetime
doc (and in the user guide) that if someone has mixed-ISO8601 strings, then they should pass them directly to DatetimeIndex
An alternative would be add an extra argument to to_datetime
, but I'm pretty -1 on that:
- to_datetime
is already quite complex
- this opens the door to allowing other formats to be less strict
@jbrockmendel @mroeschke any thoughts on this?
Comment From: mroeschke
Semantically In my mind, always thought to_datetime
should be the "more-generic" date parsing option compared to DatetimeIndex
since to_datetime
contains the myriad of parsing options, so I find it weird that DatetimeIndex
would contain date-parsing functionality that to_datetime
doesn't have.
Comment From: MarcoGorelli
Sure, thanks for commenting
How about if the format
argument in to_datetime
can take 'ISO8601'
, in which case it'll parse any ISO8601 string (without regard to exact
)?
So in the above example:
pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"], format='ISO8601')
Comment From: mroeschke
Yeah that would satisfy my aspiration for to_datetime
.
(FWIW my bold opinion is I think Timestamp
& DatetimetimeIndex
shouldn't accept strings or do any parsing and to_datetime
should be meant for that)
Comment From: MarcoGorelli
OK thanks
@jorisvandenbossche if you have no objections I'll make a PR (either this or next week, but definitely before 2.0)
Comment From: jbrockmendel
Lots going on here.
Without any non-shared keywords passed, I would expect DatetimeIndex and to_datetime to accept the same things and behave the same. That is not necessarily a good thing zen-of-python-wise. If that were to be changed, I'd want to make one strictly more strict than the other, not mix-and-match (i.e. avoid "DatetimeIndex accepts this but not that, to_datetime the other way around")
I'd be on board with getting keywords out of the DatetimeIndex constructor and telling users to use to_datetime instead (also analogous for TDI/PI), but that might necessitate adding keywords to to_datetime which isn't great.
(Deprecating string parsing in Timestamp is indeed bold, not sure what i think about that)
Also:
I'm finding more and more places where the dateutil parser is not strict enough. I'm leaning towards moving towards just supporting a small(ish) collection of formats and doing the parsing internally.
Comment From: MarcoGorelli
Yeah, totally - I've love to get to the point when dateutil
is only used for guessing formats, and not for parsing, and will try to get things started towards that after 2.0
But for 2.0, what do you think we should do about mixed ISO8601 formats? OK with just allowing format='ISO8601'
?
I don't think allowing mixed formats in format=None
is an option, as the format might be guessed to be non-ISO8601, and then there'd be no mechanism for allowing mixed formats (at least, not without removing dateutil)
Comment From: jbrockmendel
But for 2.0, what do you think we should do about mixed ISO8601 formats? OK with just allowing format='ISO8601'?
I haven't paid much attention. Is the OP still the best summary of the topic? Since we're low on time, what's the least invasive option?
Comment From: MarcoGorelli
Summary: there's no performant way to parse mixed ISO8601 strings with to_datetime
, e.g. pd.to_datetime(['2020-01-01', '2020-01-01 00:00:00'])
Least invasive option I can think of: allow format='ISO8601'
, and then treat that like require_iso8601
used to
Comment From: jreback
similar to @jbrockmendel comment above, I would be +1 on make DatetimeIndex
as strict as possible and leave the parsing to to_datetime
, e.g. limiting to i8 and datetime64 inputs, ideally removing any string parsing.
Comment From: MarcoGorelli
Sure, sounds good (maybe for 3.0)
OK to support format='ISO8601'
in to_datetime
and let it accept anything ISO8601?
Comment From: jorisvandenbossche
A few reactions:
- Lets discuss the exact parsing capabilities we want to see long-term in general for DatetimeIndex (should it match to_datetime, only ISO formatted strings, or no strings at all? ..) in a separate issue
- For the original issue here, I would certainly not want the (temporary) answer to be: "if you want more flexible parsing, use DatetimeIndex instead of to_datetime" (as suggested in a commen above). I think we all agree that long term DatetimeIndex at least shouldn't have more parsing features as to_datetime, so then we should also not point users to DatetimeIndex on the short-term for that.
- I don't have a strong opinion on
format="ISO8601"
. If we want to support this use case, I think that is a nice solution for specifying this. At the same time I don't know how important this use case is (I only noticed it in the pyarrow tests where we wrote those strings in longer/shorter ISO form just for convenience (to not pad everything with all zeros). I just checked, and the pyarrow csv reader actually has such an option to say "ISO8601" in general, and that will parse a mix of ISO formatted strings. - Short term for 2.0, for DatetimeIndex, I think we should still decide if we want to apply the PDEP-4 changes in string parsing also to DatetimeIndex (I will open a separate issue for this)
Comment From: jorisvandenbossche
OK to support
format='ISO8601'
into_datetime
and let it accept anything ISO8601?
As I wrote above, that sounds good to me (we have something similar in pyarrow). One question though: above you say this won't be performant? (https://github.com/pandas-dev/pandas/issues/50411#issuecomment-1398059502) And can it be strict? (allow a mixture of different ISO formats, but still only ISO compliant formats)
Comment From: MarcoGorelli
This would be performant, sorry - what wouldn't be performant is the currently-suggested alternative (ser.apply(lambda x: to_datetime(x))`
And yes, it should be strict with respect to ISO8601 (so, the issue which PDEP4 addressed wouldn't be undone by this)
Cool, I'll make a PR then, thanks all for your inputs
Comment From: jorisvandenbossche
Sounds good, great!
Comment From: jorisvandenbossche
- Short term for 2.0, for DatetimeIndex, I think we should still decide if we want to apply the PDEP-4 changes in string parsing also to DatetimeIndex (I will open a separate issue for this)
-> opened https://github.com/pandas-dev/pandas/issues/50894 for this
Comment From: jbrockmendel
I may be late to the party but just encountered this when troubleshooting pyarrow xfails: if a user calls to_datetime without passing format
and it fails with an exception that a format is not matched, that seems weird to me. It also isn't obvious what they are supposed to do instead. This is especially an important bc it can be reached via csv parsing when the user isn't calling to_datetime directly.
Comment From: MarcoGorelli
What do you think should happen if format=None
(the default) is passed?
read_csv
wouldn't throw an error anyway because if it errors then it just doesn't convert, e.g.
In [4]: pd.read_csv(io.StringIO('date,value\n13-01-2000,3\n2000-01-01,4'), parse_dates=['date']).dtypes
<ipython-input-4-ec4c09ad3e02>:1: UserWarning: Parsing dates in %d-%m-%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning.
pd.read_csv(io.StringIO('date,value\n13-01-2000,3\n2000-01-01,4'), parse_dates=['date']).dtypes
Out[4]:
date object
value int64
dtype: object
Comment From: jbrockmendel
What do you think should happen if format=None (the default) is passed?
I'd expect it to attempt guessing a format for a fastpath, and fall back to go through array_to_datetime
read_csv wouldn't throw an error anyway because if it errors then it just doesn't convert, e.g.
Haven't looked at it too closely, but tests.extension.test_arrow.TestBaseParsing.test_EA_types has xfailed cased for pa.types.is_timestamp(pa_dtype) and pa_dtype.unit in ("us", "ns")
(that point to #49767) that are fixed if I go into the to_datetime code and do
+ guessed = False
if format is None:
format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)
+ guessed = True
if format is not None:
+ try:
return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)
+ except AppropriateException:
+ if not guessed:
+ raise
Comment From: MarcoGorelli
I'd expect it to attempt guessing a format for a fastpath, and fall back to go through array_to_datetime
Then we'd be back to ['12/01/2000', '13/01/2000']
being parsed as "1st December 2000, 13th January 2000", which would bring back the decade-long bug which PDEP4 fixed
Comment From: jbrockmendel
which would bring back the decade-long bug which PDEP4 fixed
Yep, thats a tough one. Was this breakage discussed-and-dismissed in the lead up to PDEP4?
In cases where PDEP4 breaks user code (e.g. the jan 13 case above) we should at the very least have an exception message that tells users how to fix it.
Comment From: MarcoGorelli
I'm OK with mixed Iso8601 parsing
It's mixed not-necessarily-iso8601 that I'm really keen to avoid