Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True, utc=True, errors='ignore')
Issue Description
When I run this code, I get the following warning:
"UserWarning: Parsing '15/09/1979' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing."
Which is strange, because I already have set infer_datetime_format=True.
Expected Behavior
You would expect no warning at all, since "infer_datetime_format=True" is already provided as an argument.
Installed Versions
Comment From: rhshadrach
Can you provide a small DataFrame that reproduces this warning? I attempted with
df = pd.DataFrame({'date': ['15/09/1979']})
on main and did not get a warning.
Comment From: MarcoGorelli
closing for now, will reopen if you provide a reproducible example @moojen
Comment From: benoit9126
@MarcoGorelli Here is a small example to reproduce this warning:
pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)
It leads to this output
<input>:1: UserWarning: Parsing '31/05/2000' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
<input>:1: UserWarning: Parsing '31/05/2001' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
DatetimeIndex(['2000-01-01', '2000-05-31', '2001-05-31', '2000-01-02'], dtype='datetime64[ns]', freq=None)
The problem is the following: the infer_datetime_format
option takes the first not NaN object in the array and uses it to guess the format of the date in the provided array. Here, with "01/01/2000"
, the guessed format is MM/DD/YYYY
.
Later in the conversion function, there is an internal error because '31/05/2000' can not be converted using the guessed format and the format is locally changed (per object, not for the entire array) into the good one DD/MM/YYYY
. It leads to a warning per unique date in the array, where the inversion of month and day is required to convert the date.
As you can see in my example, the last date in the array "01/02/2000"
is converted using the guessed format into '2000-01-02'
(without warning) while I hoped "01-02-2000"
...
The user warning is annoying because it is emitted per object in the array. With a large array, it leads to thousands of warning lines: one per value when there is no other option than to change the initial guess to parse the date.
This issue seems to be related to #12585
Comment From: rhshadrach
Agreed that the warning should only be emitted once; and the message saying to pass infer_datetime_format=True is confusing. The trickier issues about how to infer better seems to be well captured by #12585, so I'd recommend scoping this issue on just the warning message and count.
Comment From: DrNickBailey
infer_datetime_format seems flawed. Is it assuming the silly American MM-DD-YYYY format? (why is it not taking my locality into account?).
But I'm not sure what it's doing as when I tried with the above sample and just switched the order of the first two dates this happens:
pd.to_datetime(['15/02/2006','01/01/2000','31/05/2001','01/02/2000'], infer_datetime_format=True)
DatetimeIndex(['2006-02-15', '2000-01-01', '2001-05-31', '2000-02-01'], dtype='datetime64[ns]', freq=None)
but.....
pd.to_datetime(['01/01/2000','15/02/2006','31/05/2001','01/02/2000'], infer_datetime_format=True)
/var/folders/95/4q5tdld13b72xd3t35759k140000gq/T/ipykernel_39130/1894679187.py:1: UserWarning: Parsing '15/02/2006' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing. pd.to_datetime(['01/01/2000','15/02/2006','31/05/2001','01/02/2000'], infer_datetime_format=True) /var/folders/95/4q5tdld13b72xd3t35759k140000gq/T/ipykernel_39130/1894679187.py:1: UserWarning: Parsing '31/05/2001' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing. pd.to_datetime(['01/01/2000','15/02/2006','31/05/2001','01/02/2000'], infer_datetime_format=True) DatetimeIndex(['2000-01-01', '2006-02-15', '2001-05-31', '2000-01-02'], dtype='datetime64[ns]', freq=None)
What's going on? It's got the right date format for the first three but has oddly switched the 4th date around?
Comment From: MarcoGorelli
01/01/2000
is ambiguous (is it DD/MM/YYYY or MM/DD/YYYY?) and the format can't be inferred from it. If it was, say, 14/01/2000
, then you wouldn't get the warning:
>>> pd.to_datetime(['14/01/2000', '15/01/2000'], infer_datetime_format=True)
DatetimeIndex(['2000-01-14', '2000-01-15'], dtype='datetime64[ns]', freq=None)
>>> pd.to_datetime(['11/01/2000', '15/01/2000'], infer_datetime_format=True)
<stdin>:1: UserWarning: Parsing '15/01/2000' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
DatetimeIndex(['2000-11-01', '2000-01-15'], dtype='datetime64[ns]', freq=None)
I'm tempted to suggest changing the warning to
Parsing '15/02/2006' in DD/MM/YYYY format. Provide format to ensure consistent parsing.
, as infer_datetime_format
isn't always able to infer the format (as the docstring says, it only tries to). And yes, also only emitting the warning once. I'll take another look at this if I get a chance
Is it assuming the silly American MM-DD-YYYY format? (why is it not taking my locality into account?).
By default, dayfirst
is False
, so if you pass an ambiguous string without specifying a format, that'll be the format it assumes
Comment From: DrNickBailey
In your example the wording Parsing '15/01/2006' in DD/MM/YYYY format. Provide format to ensure consistent parsing.
still makes no sense to me as a (British/international) user as I already thought '11/01/2000' was in DD/MM/YYYY, so the “warning” is confusing.
I’d also wager that dayfirst
as an argument name is unclear – what day is first? dayfirst
could mean a number of things to the inexperienced pandas user. monthfirst
would be more understandable to the international audience. Though datetime_dayfirst
. And while I'm at it format
should be datetime_format
for clarity.
Sorry for my hate of MM-DD. ISO 8601 is a standard for a reason and solves all ambiguity with dates!
Comment From: MarcoGorelli
Reckon this would be clearer?
>>> import pandas as pd
>>> pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)
<stdin>:1: UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently-parsed dates! Specify a format to ensure consistent parsing.
DatetimeIndex(['2000-01-01', '2000-05-31', '2001-05-31', '2000-01-02'], dtype='datetime64[ns]', freq=None)
Like this, the warning would only be emitted once, and (I think) would be a bit clearer
And while I'm at it format should be datetime_format for clarity.
This'd have to go through a deprecation cycle, not sure it'd be worth it
Comment From: TheSwallowCoder
@MarcoGorelli Here is a small example to reproduce this warning:
python pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)
It leads to this output
<input>:1: UserWarning: Parsing '31/05/2000' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing. <input>:1: UserWarning: Parsing '31/05/2001' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing. DatetimeIndex(['2000-01-01', '2000-05-31', '2001-05-31', '2000-01-02'], dtype='datetime64[ns]', freq=None)
The problem is the following: the
infer_datetime_format
option takes the first not NaN object in the array and uses it to guess the format of the date in the provided array. Here, with"01/01/2000"
, the guessed format isMM/DD/YYYY
.Later in the conversion function, there is an internal error because '31/05/2000' can not be converted using the guessed format and the format is locally changed (per object, not for the entire array) into the good one
DD/MM/YYYY
. It leads to a warning per unique date in the array, where the inversion of month and day is required to convert the date.As you can see in my example, the last date in the array
"01/02/2000"
is converted using the guessed format into'2000-01-02'
(without warning) while I hoped"01-02-2000"
...The user warning is annoying because it is emitted per object in the array. With a large array, it leads to thousands of warning lines: one per value when there is no other option than to change the initial guess to parse the date.
This issue seems to be related to #12585
Comment From: MarcoGorelli
@TheSwallowCoder can you check on upstream/main (or on the pandas 1.5.0 release candidate)?
Comment From: TheSwallowCoder
@MarcoGorelli
Comment From: MarcoGorelli
Can't reproduce, here's the output I get from 1.5.0rc:
(.venv) marco@marco-Predator-PH315-52:~/tmp$ cat t.py
import pandas as pd
pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)
(.venv) marco@marco-Predator-PH315-52:~/tmp$ python -c 'import pandas; print(pandas.__version__)'
1.5.0rc0
(.venv) marco@marco-Predator-PH315-52:~/tmp$ python t.py
t.py:2: UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to ensure consistent parsing.
pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)