Code Sample, a copy-pastable example if possible
import pandas as pd
series_1 = pd.Series(['13/03/2000', '01/03/2000', '02/03/2000']) # dates in European format
series_2 = pd.Series(['01/03/2000', '13/03/2000', '02/03/2000']) # same dates, reordered
converted_1 = pd.to_datetime(series_1, infer_datetime_format=True)
converted_2 = pd.to_datetime(series_2, infer_datetime_format=True)
print sorted(series_1) == sorted(series_2) # True
print sorted(converted_1) == sorted(converted_2) # False
Problem description
One would hope to obtain the same result when applying pd.to_datetime
to the same series twice, but shuffled.
Expected Output
My understanding is that the format determined when setting infer_datetime_format=True
is obtained from the first non-null value of the Series (see function _guess_datetime_format_for_array
in tseries.tools
), which explains the result above. I understand this logic in terms of optimizing the operation, and it does work as expected most of the time.
However, I feel like the example provided above is fairly generic. Ideally, the function would find the best datetime_format
for the entire Series. Any ideas on how to implement this ?
Output of pd.show_versions()
Comment From: jreback
xref https://github.com/pandas-dev/pandas/issues/3341
you need to be explicit. with ambiguous dayfirst dates.
In [19]: series_1 = pd.Series(['13/03/2000', '01/03/2000', '02/03/2000']) # dates in European format
...: series_2 = pd.Series(['01/03/2000', '13/03/2000', '02/03/2000']) # same dates, reordered
...:
...: converted_1 = pd.to_datetime(series_1, dayfirst=True)
...: converted_2 = pd.to_datetime(series_2, dayfirst=True)
...:
...:
In [20]: converted_1
Out[20]:
0 2000-03-13
1 2000-03-01
2 2000-03-02
dtype: datetime64[ns]
In [21]: converted_2
Out[21]:
0 2000-03-01
1 2000-03-13
2 2000-03-02
dtype: datetime64[ns]
Comment From: j-abi
I understand that dayfirst=True
solves my issue ... if I already know whether my dates are in European format.
The point of this feature request was not to report a bug but, more generally, whether we would be able to determine this automatically based on the input Series: the dates individually are ambiguous, but the Series as a whole is not.
Perhaps this is not feasible within reasonable computation cost, but as far as I am concerned it would be a big improvement.
Comment From: jreback
see this issue: https://github.com/pandas-dev/pandas/issues/12585
we have explicity discussed this. its is just waiting for someone to come along and do it. its not very difficult actually, pull -requests are welcome.
Comment From: jorisvandenbossche
@jreback It is not completely related to issue #12585.
I suspect that in pd.to_datetime(series_2, infer_datetime_format=True)
, the inferring of the format fails (or at least, you get an error with the inferred format based on the first date), and as a result it falls back to default pd.to_datetime(series_2)
parsing with dateutil, and this has indeed the known inconsistent parsing -> #12585
But here, there are also other issues:
- what @abiteboul asks: can the inferring of the datetime be based on more than the first value? (so the format inference would not fail in this case). Should look it up in the original issues/PRs discussing this feature, but probably performance is a good reason to not do this.
- The failure to infer the datetime format is silent in this case, and for this reason the inconsistent parsing is maybe even more suprising, because the user thinks a format has been inferred.
Comment From: jreback
yeah I suppose the format could be guess better here, but (maybe take up to three elements) and see if they match. BUT you still can contrive an example where this fails, so I would not be in favor of changing this, rather having the parser raise if the day/year first are violated, which is #12585
Comment From: j-abi
I suspect that in pd.to_datetime(series_2, infer_datetime_format=True), the inferring of the format fails (or at least, you get an error with the inferred format based on the first date), and as a result it falls back to default pd.to_datetime(series_2) parsing with dateutil, and this has indeed the known inconsistent parsing -> #12585
That is correct, I tried checking with timeit
and it does seem that with the second series the inferring fails (on bigger series than in my example, setting infer_datetime_format=True
leads to a ~50x speedup for the first series, and no speedup for the second).
Since this option is mainly there for speedup, it probably makes sense not to infer from more than the first value, and maybe some kind of warning would be appropriate when the inferring fails.
If we agree on this point, then I guess my issue is indeed basically the same as #12585. Sorry for mixing up two different problems, and thanks for your comments.