Pandas PERF: parse non-ambiguous date formats in c

Date parsing is slow on dates of the form "%m.%d.%Y" compared to "%Y-%m-%d". I am using 0.17.1 so maybe it's improved in 0.18 but I didn't see anything about it. I provide a benchmark below to compare the two datetime formats.

Some observations: - Letting pandas figure it out is unacceptably slow. In small files it won't make much of a difference but it became a massive bottleneck in an application. - In my particular use case, I know each timestamp in my file will have the same date so it is very easy to do a string replace. And this turns out to be by far the fastest option here. Even with the actual string replacement time. - Regex is slow. I'm sure that would have been obvious but I tried it out for comparison. - infer_datetime_format is the next best option but it appears to be a massive slowdown. Compare the relative times on the large file to the relative times of the pure read on the small file and date parsing with infer_datetime_format. - A custom date parser doesn't help. Not sure why it doesn't help. Maybe it's not compiled?

Code Sample

test.csv (1 million lines)

2016-03-12, 1
2016-03-12, 1
2016-03-12, 1
2016-03-12, 1
...

test2.csv (5000 lines)

03.12.2016, 1
03.12.2016, 1
03.12.2016, 1
03.12.2016, 1
...

import pandas as pd
from cStringIO import StringIO
import datetime
import re
print "Large file: pure read"
%timeit pd.read_csv('test.csv')

print "Large file: parse dates"
%timeit pd.read_csv('test.csv', parse_dates=[0])

print "Small file: pure read"
%timeit pd.read_csv('test2.csv')

print "Small file: parse dates"
%timeit pd.read_csv('test2.csv', parse_dates=[0])

print "Small file: parse dates with infer datetime format"
%timeit pd.read_csv('test2.csv', parse_dates=[0], infer_datetime_format=True)

print "Small file: string replace then parse dates "
def test_replace():
    s = file('test2.csv', 'r').read()
    s = s.replace('03.12.2016', '2016-03-12')
    pd.read_csv(StringIO(s), parse_dates=[0])
%timeit test_replace()

print "Small file: parse dates with custom date parser"
date_parser=lambda t: datetime.datetime.strptime(t, "%m.%d.%Y")
%timeit pd.read_csv('test2.csv', parse_dates=[0], date_parser=date_parser)

print "Small file: regex sub then parse dates"
def test_re_sub():
    s = file('test2.csv', 'r').read()
    p = re.compile('(?P<month> \d{2}).(?P<day> \d{2}).(?P<year> \d{4})')
    s = p.sub('\g<year>-\g<month>-\g<day>', s)
    pd.read_csv(StringIO(s), parse_dates=[0])
%timeit test_re_sub()

Output (2012 MacBook Air)

Large file: pure read
1 loop, best of 3: 253 ms per loop
Large file: parse dates
1 loop, best of 3: 790 ms per loop

Small file: pure read
100 loops, best of 3: 2.9 ms per loop
Small file: parse dates
1 loop, best of 3: 1.22 s per loop
Small file: parse dates with infer datetime format
10 loops, best of 3: 32.7 ms per loop
Small file: string replace then parse dates 
100 loops, best of 3: 4.97 ms per loop
Small file: parse dates with custom date parser
10 loops, best of 3: 90.6 ms per loop
Small file: regex sub then parse dates
1 loop, best of 3: 1.08 s per loop

output of `pd.show_versions()`

>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 8.1.0
setuptools: 20.2.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
IPython: 4.1.2
sphinx: 1.3.6
patsy: None
dateutil: 2.5.1
pytz: 2016.1
blosc: None
bottleneck: None
tables: None
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
Jinja2: None

Comment From: jreback

See the 0.18.0 perf improvements

In [16]: pd.__version__
Out[16]: u'0.18.0'

In [7]: s = pd.concat([Series(pd.date_range('20150101',periods=1000))]*1000)

In [8]: len(s)
Out[8]: 1000000

In [9]: s1 = s.astype(str)

In [10]: s1 = s.dt.strftime('%Y-%m-%d')

In [11]: s2 = s.dt.strftime('%Y.%m.%d')

In [12]: s1.head()
Out[12]: 
0    2015-01-01
1    2015-01-02
2    2015-01-03
3    2015-01-04
4    2015-01-05
dtype: object

In [13]: s2.head()
Out[13]: 
0    2015.01.01
1    2015.01.02
2    2015.01.03
3    2015.01.04
4    2015.01.05
dtype: object

In [14]: %timeit pd.to_datetime(s1)
10 loops, best of 3: 170 ms per loop

In [15]: %timeit pd.to_datetime(s2)
10 loops, best of 3: 175 ms per loop

Comment From: jreback

note if this were implemented: https://github.com/pydata/pandas/issues/11665

this would easily be 10x improvement here as there are lots of repeats.

pull-requests on that are welcome!

Comment From: jreback

I realize you actualy want %m.%d.%Y. This is not an ISO8601 format. So not really sure what/if anything could be done here. (The reason its not is its ambiguous to dayfirst)

Comment From: jreback

In [38]: s3 = s.dt.strftime('%m.%d.%Y')
s3.head()

In [39]: s3.head()
Out[39]: 
0    01.01.2015
1    01.02.2015
2    01.03.2015
3    01.04.2015
4    01.05.2015
dtype: object

# this may work for you
In [36]: %timeit pd.to_datetime(s3.str[6:10] + s3.str[0:2] + s3.str[3:5])
1 loop, best of 3: 1.24 s per loop

In [37]: %timeit pd.to_datetime(s3,format='%m.%d.%Y')
1 loop, best of 3: 3.35 s per loop

Comment From: jreback

btw, even though parse_dates does dispatch to pd.to_datetime, its meant for the simplest cases. Any heavy lifing you should do after with pd.to_datetime, which provides a bunch more options.

Comment From: jreback

so there was a bunch of discussion on #12585 on how this could be fixed. The primary way would be to allow additional c-code to parse some other non-ambiguous formats.

Comment From: apapanico

Jeff, thanks for taking the time to look at this.

Honestly, if this is a non ISO format then you guys should probably not bother supporting it.

It just so happens that our data storage ends up dumping the timestamps in this format. Luckily for me, the simple string replace method works incredibly well when I read in the csv string.

I brought this up because I just thought there might be something out there that can create a kind of custom pre-compiled format that you can define (like I tried to do in my date parser example) and I think this is what you're getting at in your last comment.

If someone takes this on and looks to make a more generalizeable date parser, then great. But like I said, if it's a non standard format, it should probably be ignored for the sake of everyone's sanity.

Alex On Mar 18, 2016 12:51 PM, "Jeff Reback" notifications@github.com wrote:

so there was a bunch of discussion on #12585 https://github.com/pydata/pandas/issues/12585 on how this could be fixed. The primary way would be to allow additional c-code to parse some other non-ambiguous formats.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/12667#issuecomment-198516285

Comment From: apapanico

I should clarify, I think the support in pandas provided for messy/non standard formats is great. I love it. I meant with respect performance for those formats. On Mar 18, 2016 12:59 PM, "Alex Papanicolaou" alex.papanic@gmail.com wrote:

Jeff, thanks for taking the time to look at this.

Honestly, if this is a non ISO format then you guys should probably not bother supporting it.

It just so happens that our data storage ends up dumping the timestamps in this format. Luckily for me, the simple string replace method works incredibly well when I read in the csv string.

I brought this up because I just thought there might be something out there that can create a kind of custom pre-compiled format that you can define (like I tried to do in my date parser example) and I think this is what you're getting at in your last comment.

If someone takes this on and looks to make a more generalizeable date parser, then great. But like I said, if it's a non standard format, it should probably be ignored for the sake of everyone's sanity.

Alex On Mar 18, 2016 12:51 PM, "Jeff Reback" notifications@github.com wrote:

so there was a bunch of discussion on #12585 https://github.com/pydata/pandas/issues/12585 on how this could be fixed. The primary way would be to allow additional c-code to parse some other non-ambiguous formats.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/12667#issuecomment-198516285

Comment From: jbrockmendel

FWIW format caching is pretty high on the python-dateutil todo list

Comment From: MarcoGorelli

caching is now supported, and this format isn't ISO8601, so I think we can close this

thanks for the suggestion!

Pandas PERF: parse non-ambiguous date formats in c

Code Sample

Output (2012 MacBook Air)

output of pd.show_versions()

output of `pd.show_versions()`