Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True, utc=True, errors='ignore')

Issue Description

When I run this code, I get the following warning:

"UserWarning: Parsing '15/09/1979' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing."

Which is strange, because I already have set infer_datetime_format=True.

Expected Behavior

You would expect no warning at all, since "infer_datetime_format=True" is already provided as an argument.

Installed Versions

pymysql : None html5lib : None lxml.etree : 4.8.0 xlsxwriter : None feather : None blosc : None sphinx : None hypothesis : None pytest : 6.2.5 Cython : None setuptools : 58.1.0 pip : 22.0.3 dateutil : 2.8.2 pytz : 2021.3 numpy : 1.22.2 pandas : 1.4.1

Comment From: rhshadrach

Can you provide a small DataFrame that reproduces this warning? I attempted with

df = pd.DataFrame({'date': ['15/09/1979']})

on main and did not get a warning.

Comment From: MarcoGorelli

closing for now, will reopen if you provide a reproducible example @moojen

Comment From: benoit9126

@MarcoGorelli Here is a small example to reproduce this warning:

pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)

It leads to this output

<input>:1: UserWarning: Parsing '31/05/2000' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
<input>:1: UserWarning: Parsing '31/05/2001' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
DatetimeIndex(['2000-01-01', '2000-05-31', '2001-05-31', '2000-01-02'], dtype='datetime64[ns]', freq=None)

The problem is the following: the infer_datetime_format option takes the first not NaN object in the array and uses it to guess the format of the date in the provided array. Here, with "01/01/2000", the guessed format is MM/DD/YYYY.

Later in the conversion function, there is an internal error because '31/05/2000' can not be converted using the guessed format and the format is locally changed (per object, not for the entire array) into the good one DD/MM/YYYY. It leads to a warning per unique date in the array, where the inversion of month and day is required to convert the date.

As you can see in my example, the last date in the array "01/02/2000" is converted using the guessed format into '2000-01-02' (without warning) while I hoped "01-02-2000"...

The user warning is annoying because it is emitted per object in the array. With a large array, it leads to thousands of warning lines: one per value when there is no other option than to change the initial guess to parse the date.

This issue seems to be related to #12585

Comment From: rhshadrach

Agreed that the warning should only be emitted once; and the message saying to pass infer_datetime_format=True is confusing. The trickier issues about how to infer better seems to be well captured by #12585, so I'd recommend scoping this issue on just the warning message and count.

Comment From: DrNickBailey

infer_datetime_format seems flawed. Is it assuming the silly American MM-DD-YYYY format? (why is it not taking my locality into account?).

But I'm not sure what it's doing as when I tried with the above sample and just switched the order of the first two dates this happens:

pd.to_datetime(['15/02/2006','01/01/2000','31/05/2001','01/02/2000'], infer_datetime_format=True)

DatetimeIndex(['2006-02-15', '2000-01-01', '2001-05-31', '2000-02-01'], dtype='datetime64[ns]', freq=None)

but.....

pd.to_datetime(['01/01/2000','15/02/2006','31/05/2001','01/02/2000'], infer_datetime_format=True)

/var/folders/95/4q5tdld13b72xd3t35759k140000gq/T/ipykernel_39130/1894679187.py:1: UserWarning: Parsing '15/02/2006' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing. pd.to_datetime(['01/01/2000','15/02/2006','31/05/2001','01/02/2000'], infer_datetime_format=True) /var/folders/95/4q5tdld13b72xd3t35759k140000gq/T/ipykernel_39130/1894679187.py:1: UserWarning: Parsing '31/05/2001' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing. pd.to_datetime(['01/01/2000','15/02/2006','31/05/2001','01/02/2000'], infer_datetime_format=True) DatetimeIndex(['2000-01-01', '2006-02-15', '2001-05-31', '2000-01-02'], dtype='datetime64[ns]', freq=None)

What's going on? It's got the right date format for the first three but has oddly switched the 4th date around?

Comment From: MarcoGorelli

01/01/2000 is ambiguous (is it DD/MM/YYYY or MM/DD/YYYY?) and the format can't be inferred from it. If it was, say, 14/01/2000, then you wouldn't get the warning:

>>> pd.to_datetime(['14/01/2000', '15/01/2000'], infer_datetime_format=True)
DatetimeIndex(['2000-01-14', '2000-01-15'], dtype='datetime64[ns]', freq=None)
>>> pd.to_datetime(['11/01/2000', '15/01/2000'], infer_datetime_format=True)
<stdin>:1: UserWarning: Parsing '15/01/2000' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
DatetimeIndex(['2000-11-01', '2000-01-15'], dtype='datetime64[ns]', freq=None)

I'm tempted to suggest changing the warning to

Parsing '15/02/2006' in DD/MM/YYYY format. Provide format to ensure consistent parsing.

, as infer_datetime_format isn't always able to infer the format (as the docstring says, it only tries to). And yes, also only emitting the warning once. I'll take another look at this if I get a chance

Is it assuming the silly American MM-DD-YYYY format? (why is it not taking my locality into account?).

By default, dayfirst is False, so if you pass an ambiguous string without specifying a format, that'll be the format it assumes

Comment From: DrNickBailey

In your example the wording Parsing '15/01/2006' in DD/MM/YYYY format. Provide format to ensure consistent parsing. still makes no sense to me as a (British/international) user as I already thought '11/01/2000' was in DD/MM/YYYY, so the “warning” is confusing.

I’d also wager that dayfirst as an argument name is unclear – what day is first? dayfirst could mean a number of things to the inexperienced pandas user. monthfirst would be more understandable to the international audience. Though datetime_dayfirst. And while I'm at it format should be datetime_format for clarity.

Sorry for my hate of MM-DD. ISO 8601 is a standard for a reason and solves all ambiguity with dates!

Comment From: MarcoGorelli

Reckon this would be clearer?

>>> import pandas as pd
>>> pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)
<stdin>:1: UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently-parsed dates! Specify a format to ensure consistent parsing.
DatetimeIndex(['2000-01-01', '2000-05-31', '2001-05-31', '2000-01-02'], dtype='datetime64[ns]', freq=None)

Like this, the warning would only be emitted once, and (I think) would be a bit clearer

And while I'm at it format should be datetime_format for clarity.

This'd have to go through a deprecation cycle, not sure it'd be worth it

Comment From: TheSwallowCoder

@MarcoGorelli Here is a small example to reproduce this warning:

python pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)

It leads to this output

<input>:1: UserWarning: Parsing '31/05/2000' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing. <input>:1: UserWarning: Parsing '31/05/2001' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing. DatetimeIndex(['2000-01-01', '2000-05-31', '2001-05-31', '2000-01-02'], dtype='datetime64[ns]', freq=None)

The problem is the following: the infer_datetime_format option takes the first not NaN object in the array and uses it to guess the format of the date in the provided array. Here, with "01/01/2000", the guessed format is MM/DD/YYYY.

Later in the conversion function, there is an internal error because '31/05/2000' can not be converted using the guessed format and the format is locally changed (per object, not for the entire array) into the good one DD/MM/YYYY. It leads to a warning per unique date in the array, where the inversion of month and day is required to convert the date.

As you can see in my example, the last date in the array "01/02/2000" is converted using the guessed format into '2000-01-02' (without warning) while I hoped "01-02-2000"...

The user warning is annoying because it is emitted per object in the array. With a large array, it leads to thousands of warning lines: one per value when there is no other option than to change the initial guess to parse the date.

This issue seems to be related to #12585

Solution

Comment From: MarcoGorelli

@TheSwallowCoder can you check on upstream/main (or on the pandas 1.5.0 release candidate)?

Comment From: TheSwallowCoder

@MarcoGorelli pandas_1 5rc

Comment From: MarcoGorelli

Can't reproduce, here's the output I get from 1.5.0rc:

(.venv) marco@marco-Predator-PH315-52:~/tmp$ cat t.py 
import pandas as pd
pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)

(.venv) marco@marco-Predator-PH315-52:~/tmp$ python -c 'import pandas; print(pandas.__version__)'
1.5.0rc0
(.venv) marco@marco-Predator-PH315-52:~/tmp$ python t.py 
t.py:2: UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to ensure consistent parsing.
  pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)