There are a few IntervalIndex
methods that convert to a MultiIndex
as an intermediate step, and then use the associated MultiIndex
method to compute the result. This likely introduces overhead that could be avoided via a more direct IntervalIndex
implementation.
Methods that currently require a MultiIndex
conversion:
- [x] IntervalIndex.is_monotonic
(#25820)
- [x] IntervalIndex.is_monotonic_increasing
(#25820)
- [x] IntervalIndex.is_monotonic_decreasing
(#25820)
- [x] IntervalIndex.is_unique
(#26391)
- [ ] Update: no longer exists
- [ ] IntervalIndex._get_loc_only_exact_matches
IntervalIndex.union
- [x] IntervalIndex.intersection
(#26225)
- [ ] IntervalIndex.difference
- [ ] IntervalIndex.symmetric_difference
Comment From: stevenbw
@jschendel I will look into this.
Comment From: vfilimonov
Hello
Do I understand correctly that slow DataFrame.mul
, DataFrame.add
, DataFrame.div
, DataFrame.sub
all belongs here (similar to #30267) or is it a separate issue?
df = pd.DataFrame(np.random.randn(500, 1000))
xx = pd.Series(100, index=df.columns)
%timeit df.mul(xx, axis=1) # 328 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.multiply(df, xx) # 935 µs ± 61.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Comment From: simonjayhawkins
Do I understand correctly that slow
DataFrame.mul
,DataFrame.add
,DataFrame.div
,DataFrame.sub
all belongs here (similar to #30267) or is it a separate issue?
I'm not sure why this issue was mentioned in #30267.
timings for #30267 are now comparable using master
%timeit x1 = df * 50 # 258 ms ± 14.6 ms per loop
# 2.78 ms ± 91.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit x2 = df * df # 1.57 ms ± 9.16 µs per loop
# 2.67 ms ± 55.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit x3 = np.multiply(df, 50) # 878 µs ± 71.7 µs per loop
# 3.28 ms ± 39.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
also getting comparable timings with df.mul using master
%timeit df.mul(xx, axis=1) # 328 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 2.93 ms ± 48.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.multiply(df, xx) # 935 µs ± 61.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 3.49 ms ± 88.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Comment From: vfilimonov
Great, thank you @simonjayhawkins . I look forward for 1.1! (and it looks like pandas is now faster than numpy?!)
p.s. just to make sure on 1.0.5 full dataframe multiplication is also 2-3 times slower:
df = pd.DataFrame(np.random.randn(500, 1000))
%timeit df * df # 1.8 ms ± 149 µs per loop
%timeit df.mul(df) # 1.74 ms ± 134 µs per loop
%timeit np.multiply(df, df) #742 µs ± 49.2 µs per loop
Is it now comparable in master as well?
Comment From: simonjayhawkins
yep, getting comparable numbers for those too.
%timeit df * df # 1.8 ms ± 149 µs per loop
# 2.82 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.mul(df) # 1.74 ms ± 134 µs per loop
# 2.99 ms ± 219 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.multiply(df, df) #742 µs ± 49.2 µs per loop
# 3.11 ms ± 42.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Comment From: vfilimonov
Wonderful! Thanks a lot!
Comment From: mroeschke
I don't see the remaining ops in the checklist dispatching to MultiIndex so I think we can close this one