I'm sorry if this is the expected behavior, but I think this is odd and may be an issue.
Parsing the dates automatically using read_csv
(with the keywords parse_dates
and date_parser
) is actually slower than reading the file without parsing any dates, manually putting the date columns together and then using pandas.to_datetime
function on the newly created column to parse the dates.
Basically (assuming you have date and hour in two different columns in the csv) doing this
df = pd.read_csv("..\\file.csv", parse_dates = [['YYYYMMDD', 'HH']],
index_col = 0,
date_parser=parse)
is 3 to 5 times slower than this
df = pd.read_csv("..\\file.csv")
format = "%Y%m%d %H"
times = pd.to_datetime(df.YYYYMMDD + ' ' + df.HH, format=format)
df.set_index(times, inplace=True)
df = df.drop(['YYYYMMDD','HH'], axis=1)
I first found this issue comparing the two answers of this question and thought it was weird but OK, maybe it was due to some feature in his dataset. But while the person on the answer reported a 3 times time difference between both approaches, when I did the comparison I found that the difference is 5 times for me.
This seems to me like an issue, since using the function's internal options should be faster, or at the least take the same time.
Comment From: jreback
not sure what gave you that impressions. this is purely s convenient option.
to_datetime has many more options some of which allow hints to make parsing much faster. read_csv has a very limited set of options by definition. of course YMMV.
Comment From: jreback
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#assembling-datetimes provides the speediest way to assemble Datetimesindex but again it cannot be done inside read_csv without adding even more options.
Comment From: tomchor
I see. I assumed doing it within read_csv had to be faster. Sorry for the ignorance! And thank you for the information.