I don't know if this is a bug or not (I guess not), but I think this behavior might be very confusing to many users who like me until now might not be aware of this.
When converting a string column to Date using pd.to_datetime or pd.DatetimeIndex the way the date is interpreted might change from row to row. I experienced this today where March 1st was interpreted as January 3rd. Probably pandas does this simply on a per row set considering only one sample at a time. One way to improve this would be to first go through the whole
At least I'd suggest to implement a warning if none of the format/dayfirst parameters are used.
Here's a screenshot to clarify:
http://snag.gy/mXT7g.jpg
output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.5.1.final.0 python-bits: 64 OS: Darwin OS-release: 15.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8
pandas: 0.17.1 nose: 1.3.7 pip: 8.0.2 setuptools: 19.6.2 Cython: 0.23.4 numpy: 1.10.4 scipy: 0.17.0 statsmodels: 0.6.1 IPython: 4.1.1 sphinx: 1.3.5 patsy: 0.4.0 dateutil: 2.4.2 pytz: 2015.7 blosc: None bottleneck: 1.0.0 tables: 3.2.2 numexpr: 2.4.6 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.8.4 lxml: 3.5.0 bs4: 4.4.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.11 pymysql: None psycopg2: None Jinja2: 2.8
Comment From: jreback
pls show a copy-pastable example, not a screen shot.
Comment From: marcomayer
Sure, you can use this:
data = pd.Series(['29.02.2016 03:30:00', '29.02.2016 04:00:00', '01.03.2016 03:30:00', '01.03.2016 04:00:00'])
pd.to_datetime(data)
will make difference choices regarding day or month first:
0 2016-02-29 03:30:00
1 2016-02-29 04:00:00
2 2016-01-03 03:30:00
3 2016-01-03 04:00:00
dtype: datetime64[ns]
Comment From: jorisvandenbossche
This is a known issue with to_datetime
, which is caused by the (too much) flexibility of dateutil
. You can 'solve' it by providing a format
argument.
See also https://github.com/pydata/pandas/issues/7348
Comment From: jreback
Well you have an option to control this, dayfirst
. Having mixed day-first, not day-first data will require a pre-processing step.
In [18]: pd.to_datetime(data, dayfirst=True)
Out[18]:
0 2016-02-29 03:30:00
1 2016-02-29 04:00:00
2 2016-03-01 03:30:00
3 2016-03-01 04:00:00
dtype: datetime64[ns]
In [19]: pd.to_datetime(data, dayfirst=False)
Out[19]:
0 2016-02-29 03:30:00
1 2016-02-29 04:00:00
2 2016-01-03 03:30:00
3 2016-01-03 04:00:00
dtype: datetime64[ns]
Comment From: jorisvandenbossche
@marcomayer But I agree that this can be very confusing!
Note that using dayfirst
is also not strict: https://github.com/pydata/pandas/issues/3341
Comment From: marcomayer
yes, a warning or so might be something to consider I think.
The way I work is that I read in millions of records, then convert the dates etc. and usually just do a head() to check if things are fine. My guess is that others do the same thing...
In a perfect world, it would check if there is any ambiguity and if so at least warn or just stop and require format.