Pandas pd.to_datetime/pd.DatetimeIndex unexpected behavior

I don't know if this is a bug or not (I guess not), but I think this behavior might be very confusing to many users who like me until now might not be aware of this.

When converting a string column to Date using pd.to_datetime or pd.DatetimeIndex the way the date is interpreted might change from row to row. I experienced this today where March 1st was interpreted as January 3rd. Probably pandas does this simply on a per row set considering only one sample at a time. One way to improve this would be to first go through the whole

At least I'd suggest to implement a warning if none of the format/dayfirst parameters are used.

Here's a screenshot to clarify:

http://snag.gy/mXT7g.jpg

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 3.5.1.final.0 python-bits: 64 OS: Darwin OS-release: 15.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8

pandas: 0.17.1 nose: 1.3.7 pip: 8.0.2 setuptools: 19.6.2 Cython: 0.23.4 numpy: 1.10.4 scipy: 0.17.0 statsmodels: 0.6.1 IPython: 4.1.1 sphinx: 1.3.5 patsy: 0.4.0 dateutil: 2.4.2 pytz: 2015.7 blosc: None bottleneck: 1.0.0 tables: 3.2.2 numexpr: 2.4.6 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.8.4 lxml: 3.5.0 bs4: 4.4.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.11 pymysql: None psycopg2: None Jinja2: 2.8

Comment From: jreback

pls show a copy-pastable example, not a screen shot.

Comment From: marcomayer

Sure, you can use this:


data = pd.Series(['29.02.2016 03:30:00', '29.02.2016 04:00:00', '01.03.2016 03:30:00', '01.03.2016 04:00:00'])

pd.to_datetime(data)

will make difference choices regarding day or month first:


0   2016-02-29 03:30:00
1   2016-02-29 04:00:00
2   2016-01-03 03:30:00
3   2016-01-03 04:00:00
dtype: datetime64[ns]

Comment From: jorisvandenbossche

This is a known issue with to_datetime, which is caused by the (too much) flexibility of dateutil. You can 'solve' it by providing a format argument.

See also https://github.com/pydata/pandas/issues/7348

Comment From: jreback

Well you have an option to control this, dayfirst. Having mixed day-first, not day-first data will require a pre-processing step.

In [18]: pd.to_datetime(data, dayfirst=True)
Out[18]: 
0   2016-02-29 03:30:00
1   2016-02-29 04:00:00
2   2016-03-01 03:30:00
3   2016-03-01 04:00:00
dtype: datetime64[ns]

In [19]: pd.to_datetime(data, dayfirst=False)
Out[19]: 
0   2016-02-29 03:30:00
1   2016-02-29 04:00:00
2   2016-01-03 03:30:00
3   2016-01-03 04:00:00
dtype: datetime64[ns]

Comment From: jorisvandenbossche

@marcomayer But I agree that this can be very confusing!

Note that using dayfirst is also not strict: https://github.com/pydata/pandas/issues/3341

Comment From: marcomayer

yes, a warning or so might be something to consider I think.

The way I work is that I read in millions of records, then convert the dates etc. and usually just do a head() to check if things are fine. My guess is that others do the same thing...

In a perfect world, it would check if there is any ambiguity and if so at least warn or just stop and require format.

Pandas pd.to_datetime/pd.DatetimeIndex unexpected behavior

output of pd.show_versions()

INSTALLED VERSIONS

output of `pd.show_versions()`