Pandas Determine datetime_format from more than one element

Code Sample, a copy-pastable example if possible

import pandas as pd

series_1 = pd.Series(['13/03/2000', '01/03/2000', '02/03/2000'])  # dates in European format
series_2 = pd.Series(['01/03/2000', '13/03/2000', '02/03/2000'])  # same dates, reordered

converted_1 = pd.to_datetime(series_1, infer_datetime_format=True)
converted_2 = pd.to_datetime(series_2, infer_datetime_format=True)

print sorted(series_1) == sorted(series_2)  # True
print sorted(converted_1) == sorted(converted_2)  # False

Problem description

One would hope to obtain the same result when applying pd.to_datetime to the same series twice, but shuffled.

Expected Output

My understanding is that the format determined when setting infer_datetime_format=True is obtained from the first non-null value of the Series (see function _guess_datetime_format_for_array in tseries.tools), which explains the result above. I understand this logic in terms of optimizing the operation, and it does work as expected most of the time. However, I feel like the example provided above is fairly generic. Ideally, the function would find the best datetime_format for the entire Series. Any ideas on how to implement this ?

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-57-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.utf8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.19.2 nose: 1.3.7 pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.11.3 scipy: 0.18.0 statsmodels: None xarray: None IPython: 2.4.1 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None boto: None pandas_datareader: None

Comment From: jreback

xref https://github.com/pandas-dev/pandas/issues/3341

you need to be explicit. with ambiguous dayfirst dates.

In [19]: series_1 = pd.Series(['13/03/2000', '01/03/2000', '02/03/2000'])  # dates in European format
    ...: series_2 = pd.Series(['01/03/2000', '13/03/2000', '02/03/2000'])  # same dates, reordered
    ...:
    ...: converted_1 = pd.to_datetime(series_1, dayfirst=True)
    ...: converted_2 = pd.to_datetime(series_2, dayfirst=True)
    ...:
    ...:

In [20]: converted_1
Out[20]:
0   2000-03-13
1   2000-03-01
2   2000-03-02
dtype: datetime64[ns]

In [21]: converted_2
Out[21]:
0   2000-03-01
1   2000-03-13
2   2000-03-02
dtype: datetime64[ns]

Comment From: j-abi

I understand that dayfirst=True solves my issue ... if I already know whether my dates are in European format. The point of this feature request was not to report a bug but, more generally, whether we would be able to determine this automatically based on the input Series: the dates individually are ambiguous, but the Series as a whole is not. Perhaps this is not feasible within reasonable computation cost, but as far as I am concerned it would be a big improvement.

Comment From: jreback

see this issue: https://github.com/pandas-dev/pandas/issues/12585

we have explicity discussed this. its is just waiting for someone to come along and do it. its not very difficult actually, pull -requests are welcome.

Comment From: jorisvandenbossche

@jreback It is not completely related to issue #12585. I suspect that in pd.to_datetime(series_2, infer_datetime_format=True), the inferring of the format fails (or at least, you get an error with the inferred format based on the first date), and as a result it falls back to default pd.to_datetime(series_2) parsing with dateutil, and this has indeed the known inconsistent parsing -> #12585

But here, there are also other issues:

what @abiteboul asks: can the inferring of the datetime be based on more than the first value? (so the format inference would not fail in this case). Should look it up in the original issues/PRs discussing this feature, but probably performance is a good reason to not do this.
The failure to infer the datetime format is silent in this case, and for this reason the inconsistent parsing is maybe even more suprising, because the user thinks a format has been inferred.

Comment From: jreback

yeah I suppose the format could be guess better here, but (maybe take up to three elements) and see if they match. BUT you still can contrive an example where this fails, so I would not be in favor of changing this, rather having the parser raise if the day/year first are violated, which is #12585

Comment From: j-abi

I suspect that in pd.to_datetime(series_2, infer_datetime_format=True), the inferring of the format fails (or at least, you get an error with the inferred format based on the first date), and as a result it falls back to default pd.to_datetime(series_2) parsing with dateutil, and this has indeed the known inconsistent parsing -> #12585

That is correct, I tried checking with timeit and it does seem that with the second series the inferring fails (on bigger series than in my example, setting infer_datetime_format=True leads to a ~50x speedup for the first series, and no speedup for the second). Since this option is mainly there for speedup, it probably makes sense not to infer from more than the first value, and maybe some kind of warning would be appropriate when the inferring fails.

If we agree on this point, then I guess my issue is indeed basically the same as #12585. Sorry for mixing up two different problems, and thanks for your comments.

Pandas Determine datetime_format from more than one element

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`