Caching (memoizing) the date_parser function in read_csv might be an easy perf improvement. Seems it is not cached unless I am missing something?
In [55]: df = pd.DataFrame([datetime.datetime.today()] * 1000000)
In [56]: df.to_csv('j', index=False)
In [57]: !gzip j
In [58]: %time df = pd.read_csv('j.gz')
CPU times: user 703 ms, sys: 68.7 ms, total: 772 ms
Wall time: 774 ms
In [59]: d = {df['0'][0]: datetime.datetime.today()}
In [60]: %time s = df['0'].map(d)
CPU times: user 84.8 ms, sys: 14.8 ms, total: 99.6 ms
Wall time: 99.2 ms
In [61]: %time df = pd.read_csv('j.gz', parse_dates=['0'])
CPU times: user 1.49 s, sys: 88.7 ms, total: 1.58 s
Wall time: 1.58 s
Comment From: chris-b1
Yep, caching certainly could help, this is a duplicate of #11665. PR welcome!