xref #12060
Comment From: jreback
if someone would run the benchmarks with this turned on we could prove whether this affects anything.....
Comment From: jreback
@chris-b1 can you do a PR for this?
and see if any of the asv timings change?
Comment From: chris-b1
@jreback - here's the general tradeoff - adds some overhead (large relative, small absolute) to parsing iso8601, and of course speeds up non-iso8601 a lot. Does that seem worth it?
In [35]: s = pd.date_range('1900-1-1', periods=1000).strftime('%Y-%m-%d')
In [36]: %timeit pd.to_datetime(s)
1000 loops, best of 3: 316 µs per loop
In [37]: %timeit pd.to_datetime(s, infer_datetime_format=True)
1000 loops, best of 3: 642 µs per loop
In [38]: s_noniso = pd.date_range('1900-1-1', periods=1000).strftime('%m/%d/%Y')
In [42]: %timeit pd.to_datetime(s_noniso)
10 loops, best of 3: 79.9 ms per loop
In [43]: %timeit pd.to_datetime(s_noniso, infer_datetime_format=True)
100 loops, best of 3: 3.82 ms per loop
Comment From: jreback
so is the iso parsing overhead just a couple of function calls? can u update with 10x (or 100) size as well to see how this scales
Comment From: chris-b1
below has this same test at different sizes. diff
is increase (decrease) in time, ratio
is (infer) / (don't infer).
Infer only looks at the first element, so not completely sure why the overhead isn't flat.
n | iso_diff | iso_ratio | noniso_diff | noniso_ratio |
---|---|---|---|---|
1 | 0.000282747 | 4.23485 | 0.00024401 | 2.40649 |
100 | 0.000324795 | 4.44211 | -0.00697929 | 0.0946573 |
1000 | 0.000332079 | 2.02872 | -0.0772536 | 0.0462967 |
100000 | 0.00309632 | 1.11218 | -7.62098 | 0.0440372 |
1000000 | 0.0303454 | 1.10112 | -82.5801 | 0.0421022 |
Comment From: jreback
would it be dumb to change the default to None
, then set it based on the length of the passed array? e.g. say < =100 -> False
, > 100 -> True
. (of course if its explicity passed then just use that).
This way you get the perf benefit without overhead on smaller samples.
Comment From: jreback
@chris-b1 so what do you think?
Comment From: chris-b1
It didn't seem unreasonable - but I got to thinking - there are actually cases with ambigous dates where False
and True
don't do the same thing (e.g. below). So now I'm not sure what to do.
In [3]: pd.to_datetime(['24/1/2015', '1/2/2015'], infer_datetime_format=True)
Out[3]: DatetimeIndex(['2015-01-24', '2015-02-01'], dtype='datetime64[ns]', freq=None)
In [4]: pd.to_datetime(['24/1/2015', '1/2/2015'])
Out[4]: DatetimeIndex(['2015-01-24', '2015-01-02'], dtype='datetime64[ns]', freq=None)
Comment From: jreback
ahh, so the yearfirst
& dayfirst
may not be respected. So I think these need to be of 'standard' format (e.g. dayfirst=False
and yearfirst=False
) or we should rase if infer_datetime_format=True
as these are incompatible (or just make it False)
Comment From: MarcoGorelli
as of PDEP4, this parameter is deprecated and a stricter version if the default - closing then