Pandas PERF: Implement date_range for business day in terms of daily

Code Sample, a copy-pastable example if possible

@dsm054 noticed that business day performance is slow relative to daily

These two are equivalent

In [23]: %timeit pd.date_range("1956-01-31", "2017-05-16", freq="B")
278 ms ± 4.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [24]: %timeit i2 = pd.date_range("1956-01-31", "2017-05-16", freq="D"); i2=i2[i2.dayofweek < 5]
1.44 ms ± 39.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If we do these, we'll have to special case the handling when periods is passed to expand it by enough weekends. But that's a pretty nice performance boost for a pretty common operation.

Comment From: dsm054

pd.date_range is just a convenience wrapper around DatetimeIndex, and so the same problem affects resamples to B as well because of the call to DatetimeIndex it makes along the way when building a binner. There will be lots of secondary benefits as well.

Comment From: chris-b1

xref: #11214 - slightly more generic issue. In general any kind of bin generation larger than D is not performant right now.

Comment From: dsm054

@chris-b1: wow, when you're right, you're right.

This is all kinds of crazysauce.

In [58]: %timeit z = pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='3min')
10 loops, best of 3: 40.4 ms per loop

In [59]: %timeit z = pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='M')
10 loops, best of 3: 34.9 ms per loop

In [60]: len(pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='3min'))
Out[60]: 10745281

In [61]: len(pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='M'))
Out[61]: 736

That's a wild ratio.

I think it should be possible to provide a vectorized (not even cythonized) version for each of the some-filter-on-daily offsets, both anchored and not, without too much trouble. But even if not, we should at least fix the big ones. I grepped through just one of my codebases and found hundreds of D->B resamples on either Series or few-column frames where this is apparently the bottleneck. :-/

Comment From: rohanp

I can try giving this a go!

Comment From: jreback

@chris-b1 this is actually not that surprising, all this is doing is this an np.arange under the hood.

Comment From: rohanp

From what I can tell, cythonizing generate_range from offsets.py would solve this problem?

Comment From: dsm054

I think that's the base location which is slow, but I don't know how much of the slowness is due to the Python-level looping in generate_range or the Python-level operations in the offset's .apply that get called. I suspect we're only getting a few milliseconds of overhead from the Python generator itself, in which case we'd need to cythonize a lot of other code too to see benefits.

Whether that's a better direction than just taking advantage of preexisting fast code, I don't know.

Comment From: chris-b1

Yeah @dsm054 is right, there is a whole chain of pure-python stuff behind the offsets, so not that much benefit to cythonizing the existing code without a major rewrite.

Much easier would be to define some fastpaths for "easy" offsets, using existing vectorized datetime methods like suggested at the top.

i = pd.date_range('2010', '2015', freq='D')

bday = i[i.dayofweek < 5]
month_end = i[i.is_month_end]
month_begin = i[i.is_month_start] #etc

Hardest part is probably getting the number of periods/start/end logic correct.

Comment From: dsm054

@rohanp: are you still interested in looking at this even though it may not involve cython? ;-) If not, I have a prototype which looks promising although my testing of it has been pretty shallow so far (and as usual, it'll probably take longer to test than to implement..)

Comment From: rohanp

I am still interested, but feel free to share your prototype. On Thu, May 25, 2017 at 7:34 AM DSM notifications@github.com wrote:

@rohanp https://github.com/rohanp: are you still interested in looking at this even though it may not involve cython? ;-) If not, I have a prototype which looks promising although my testing of it has been pretty shallow so far (and as usual, it'll probably take longer to test than to implement..)

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/16463#issuecomment-304025518, or mute the thread https://github.com/notifications/unsubscribe-auth/AGxpcLJn7liouFzY61c_vE7SMlQYhHDAks5r9ZFwgaJpZM4NkNK2 .

Comment From: jbrockmendel

generate_range and the relevant DateOffset.apply code have all been moved into cython, so some of this has improved. Generating with date_range(..., freq="D") and then masking as in the OP is fine for "B" where you're not discarding much, it would get expensive as you discard more. Instead I think we can now use period_range

In [3]: %timeit z = pd.date_range(start='1956-01-31', end='2017-05-16', freq='M')
5.46 ms ± 67.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit pd.period_range(start="1956-01-31", end="2017-05-16", freq="M").to_timestamp()
839 µs ± 7.72 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)