Pandas PERF: consider changing default of infer_datetime_format to True

xref #12060

Comment From: jreback

if someone would run the benchmarks with this turned on we could prove whether this affects anything.....

Comment From: jreback

@chris-b1 can you do a PR for this?

and see if any of the asv timings change?

Comment From: chris-b1

@jreback - here's the general tradeoff - adds some overhead (large relative, small absolute) to parsing iso8601, and of course speeds up non-iso8601 a lot. Does that seem worth it?

In [35]: s = pd.date_range('1900-1-1', periods=1000).strftime('%Y-%m-%d')

In [36]: %timeit pd.to_datetime(s)
1000 loops, best of 3: 316 µs per loop

In [37]: %timeit pd.to_datetime(s, infer_datetime_format=True)
1000 loops, best of 3: 642 µs per loop

In [38]: s_noniso = pd.date_range('1900-1-1', periods=1000).strftime('%m/%d/%Y')

In [42]: %timeit pd.to_datetime(s_noniso)
10 loops, best of 3: 79.9 ms per loop

In [43]: %timeit pd.to_datetime(s_noniso, infer_datetime_format=True)
100 loops, best of 3: 3.82 ms per loop

Comment From: jreback

so is the iso parsing overhead just a couple of function calls? can u update with 10x (or 100) size as well to see how this scales

Comment From: chris-b1

below has this same test at different sizes. diff is increase (decrease) in time, ratio is (infer) / (don't infer).

Infer only looks at the first element, so not completely sure why the overhead isn't flat.

n	iso_diff	iso_ratio	noniso_diff	noniso_ratio
1	0.000282747	4.23485	0.00024401	2.40649
100	0.000324795	4.44211	-0.00697929	0.0946573
1000	0.000332079	2.02872	-0.0772536	0.0462967
100000	0.00309632	1.11218	-7.62098	0.0440372
1000000	0.0303454	1.10112	-82.5801	0.0421022

Comment From: jreback

would it be dumb to change the default to None, then set it based on the length of the passed array? e.g. say < =100 -> False, > 100 -> True. (of course if its explicity passed then just use that).

This way you get the perf benefit without overhead on smaller samples.

Comment From: jreback

@chris-b1 so what do you think?

Comment From: chris-b1

It didn't seem unreasonable - but I got to thinking - there are actually cases with ambigous dates where False and True don't do the same thing (e.g. below). So now I'm not sure what to do.


In [3]: pd.to_datetime(['24/1/2015', '1/2/2015'], infer_datetime_format=True)
Out[3]: DatetimeIndex(['2015-01-24', '2015-02-01'], dtype='datetime64[ns]', freq=None)

In [4]: pd.to_datetime(['24/1/2015', '1/2/2015'])
Out[4]: DatetimeIndex(['2015-01-24', '2015-01-02'], dtype='datetime64[ns]', freq=None)

Comment From: jreback

ahh, so the yearfirst & dayfirst may not be respected. So I think these need to be of 'standard' format (e.g. dayfirst=False and yearfirst=False) or we should rase if infer_datetime_format=True as these are incompatible (or just make it False)

Comment From: MarcoGorelli

as of PDEP4, this parameter is deprecated and a stricter version if the default - closing then