I ran in to this bug while trying to parse the few dates through parse_dates
of pandas.read_csv()
. In the following code snippet, I'm trying to parse dates that have format dd/mm/yy
which is resulting me an improper conversion. For some cases, the date field is considered as month and vice versa.
To keep it simple, for some cases dd/mm/yy
get converted to yyyy-dd-mm
instead of yyyy-mm-dd
.
Case 1:
04/10/96 is parsed as 1996-04-10, which is wrong.
Case 2:
15/07/97 is parsed as 1997-07-15, which is correct.
Case 3:
10/12/97 is parsed as 1997-10-12, which is wrong.
Code Sample
import pandas as pd
df = pd.read_csv('date_time.csv')
print 'Data in csv:'
print df
print df['start_date'].dtypes
print '----------------------------------------------'
df = pd.read_csv('date_time.csv', parse_dates = ['start_date'])
print 'Data after parsing:'
print df
print df['start_date'].dtypes
Current Output
----------------------
Data in csv:
----------------------
start_date
0 04/10/96
1 15/07/97
2 10/12/97
3 06/03/99
4 //1994
5 /02/1967
object
----------------------
Data after parsing:
----------------------
start_date
0 1996-04-10
1 1997-07-15
2 1997-10-12
3 1999-06-03
4 1994-01-01
5 1967-02-01
datetime64[ns]
Expected Output
----------------------
Data in csv:
----------------------
start_date
0 04/10/96
1 15/07/97
2 10/12/97
3 06/03/99
4 //1994
5 /02/1967
object
----------------------
Data after parsing:
----------------------
start_date
0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01
datetime64[ns]
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 2.7.11.final.0 python-bits: 64 OS: Windows OS-release: 8.1 machine: AMD64 processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en_US
pandas: 0.17.1 nose: 1.3.7 pip: 7.1.2 setuptools: 18.5 Cython: 0.23.4 numpy: 1.10.1 scipy: 0.16.0 statsmodels: None IPython: 4.0.1 sphinx: 1.3.1 patsy: 0.4.0 dateutil: 2.4.2 pytz: 2015.7 blosc: None bottleneck: 1.0.0 tables: 3.2.2 numexpr: 2.4.4 matplotlib: 1.5.0 openpyxl: 2.2.6 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.7.7 lxml: 3.4.4 bs4: 4.4.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.9 pymysql: None psycopg2: None Jinja2: None
More Comments:
I could use date_parser
or pandas.to_datetime()
to specify the proper format for date. But in my case, I have few date fields like ['//1997', '/02/1967'] for which I need to convert ['01/01/1997','01/02/1967']. The parse_dates
helps me converting those type of date fields to the expected format without making me to write extra line of code.
Comment From: jorisvandenbossche
I suppose this related to the SO question: http://stackoverflow.com/questions/36998540/pd-read-csv-not-correctly-parsing-date-month-field-when-set-parse-date-colum
As was commented over there by ajcr
, this is not a bug. The date format you are using ("dd/mm/yy") is not an international standard format (but a European one), and the default of the parser is to try to parse such dates as "mm/dd/yy" (more an American format I think). However, if this gives an error, it will try the other format. This is the reason you get inconsistent results between the rows in the csv.
You can pass dayfirst=True
to to_datetime
, and then the preference will be given to "dd/mm/yy" instead of to "mm/dd/yy".
Typically you can also use format="%d/%m/%y"
, however (as you note) this will not work with the "//1994" dates (since this is another format).