Code Sample, a copy-pastable example if possible
@dsm054 noticed that business day performance is slow relative to daily
These two are equivalent
In [23]: %timeit pd.date_range("1956-01-31", "2017-05-16", freq="B")
278 ms ± 4.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [24]: %timeit i2 = pd.date_range("1956-01-31", "2017-05-16", freq="D"); i2=i2[i2.dayofweek < 5]
1.44 ms ± 39.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If we do these, we'll have to special case the handling when periods
is passed to expand it by enough weekends. But that's a pretty nice performance boost for a pretty common operation.
Comment From: dsm054
pd.date_range is just a convenience wrapper around DatetimeIndex, and so the same problem affects resamples to B as well because of the call to DatetimeIndex it makes along the way when building a binner. There will be lots of secondary benefits as well.
Comment From: chris-b1
xref: #11214 - slightly more generic issue. In general any kind of bin generation larger than D
is not performant right now.
Comment From: dsm054
@chris-b1: wow, when you're right, you're right.
This is all kinds of crazysauce.
In [58]: %timeit z = pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='3min')
10 loops, best of 3: 40.4 ms per loop
In [59]: %timeit z = pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='M')
10 loops, best of 3: 34.9 ms per loop
In [60]: len(pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='3min'))
Out[60]: 10745281
In [61]: len(pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='M'))
Out[61]: 736
That's a wild ratio.
I think it should be possible to provide a vectorized (not even cythonized) version for each of the some-filter-on-daily offsets, both anchored and not, without too much trouble. But even if not, we should at least fix the big ones. I grepped through just one of my codebases and found hundreds of D->B resamples on either Series or few-column frames where this is apparently the bottleneck. :-/
Comment From: rohanp
I can try giving this a go!
Comment From: jreback
@chris-b1 this is actually not that surprising, all this is doing is this an np.arange
under the hood.
Comment From: rohanp
From what I can tell, cythonizing generate_range
from offsets.py
would solve this problem?
Comment From: dsm054
I think that's the base location which is slow, but I don't know how much of the slowness is due to the Python-level looping in generate_range or the Python-level operations in the offset's .apply
that get called. I suspect we're only getting a few milliseconds of overhead from the Python generator itself, in which case we'd need to cythonize a lot of other code too to see benefits.
Whether that's a better direction than just taking advantage of preexisting fast code, I don't know.
Comment From: chris-b1
Yeah @dsm054 is right, there is a whole chain of pure-python stuff behind the offsets, so not that much benefit to cythonizing the existing code without a major rewrite.
Much easier would be to define some fastpaths for "easy" offsets, using existing vectorized datetime methods like suggested at the top.
i = pd.date_range('2010', '2015', freq='D')
bday = i[i.dayofweek < 5]
month_end = i[i.is_month_end]
month_begin = i[i.is_month_start] #etc
Hardest part is probably getting the number of periods/start/end logic correct.
Comment From: dsm054
@rohanp: are you still interested in looking at this even though it may not involve cython? ;-) If not, I have a prototype which looks promising although my testing of it has been pretty shallow so far (and as usual, it'll probably take longer to test than to implement..)
Comment From: rohanp
I am still interested, but feel free to share your prototype. On Thu, May 25, 2017 at 7:34 AM DSM notifications@github.com wrote:
@rohanp https://github.com/rohanp: are you still interested in looking at this even though it may not involve cython? ;-) If not, I have a prototype which looks promising although my testing of it has been pretty shallow so far (and as usual, it'll probably take longer to test than to implement..)
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/16463#issuecomment-304025518, or mute the thread https://github.com/notifications/unsubscribe-auth/AGxpcLJn7liouFzY61c_vE7SMlQYhHDAks5r9ZFwgaJpZM4NkNK2 .
Comment From: jbrockmendel
generate_range and the relevant DateOffset.apply code have all been moved into cython, so some of this has improved. Generating with date_range(..., freq="D") and then masking as in the OP is fine for "B" where you're not discarding much, it would get expensive as you discard more. Instead I think we can now use period_range
In [3]: %timeit z = pd.date_range(start='1956-01-31', end='2017-05-16', freq='M')
5.46 ms ± 67.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %timeit pd.period_range(start="1956-01-31", end="2017-05-16", freq="M").to_timestamp()
839 µs ± 7.72 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)