Master:
In [14]: idx = pd.interval_range(0, 1000, 1000)
In [15]: %timeit getattr(idx, '_ndarray_values', idx)
1.29 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]: %timeit idx.closed
321 ns ± 2.66 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
while on 0.25.3:
In [13]: idx = pd.interval_range(0, 1000, 1000)
In [14]: %timeit getattr(idx, '_ndarray_values', idx)
90.5 ns ± 2.09 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [15]: %timeit idx.closed
105 ns ± 1.61 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
(just checked a few attributes, didn't check if it is related to those specific ones or getattr in general)
I think this is a cause / one of the causes of several regressions that can currently be seen at https://pandas.pydata.org/speed/pandas/ (eg https://pandas.pydata.org/speed/pandas/#reshape.Cut.time_cut_timedelta?p-bins=1000&commits=6efc2379-b9de33e3)
Comment From: jorisvandenbossche
cc @jbrockmendel this might be related: https://github.com/pandas-dev/pandas/pull/30605
Comment From: jbrockmendel
30605 changed _ndarray_values from a cache_readonly to a property. Easy to update.
Comment From: jorisvandenbossche
That might be easy to update, but I am not sure that this explains the slowdown in the other attributes / slowdown in dependent functionality? Eg pd.cut (the actual benchmark I was investigating):
bins = 1000
N = 10 ** 5
timedelta_series = pd.Series(np.random.randint(N, size=N), dtype="timedelta64[ns]")
pd.cut(timedelta_series, bins)
This pd.cut
call also shows a slowdown, so would need to check if the caching improves that again (note, I also didn't fully investigate if the slowdown here is caused by the IntervalIndex attributes slowdown, there might also be other reasons).
Comment From: TomAugspurger
On master right now, CategoricalIndex._ndarray_values
is using Index._ndarray_values
, which is something like 6x slower than the old implementation (just an attribute lookup).
https://github.com/pandas-dev/pandas/pull/30717/files is adding an ExtensionIndex._ndarray_values
which would I think fix at least one source of the slowdown. Or we can have a targeted fix restoring CategoricalIndex._ndarray_values as _data.values
.
Comment From: jorisvandenbossche
Reopening this because the other attributes are still slower (the _ndarray_values
was solved in #30797), eg the original example in the top post for idx.closes
is still 3x slower.
Comment From: jorisvandenbossche
The original cut
benchmark (see example in https://github.com/pandas-dev/pandas/issues/30742#issuecomment-571199909) seems to be fixed with #30797!
Comment From: jreback
The original
cut
benchmark (see example in #30742 (comment)) seems to be fixed with #30797!
is this fixed @jorisvandenbossche
Comment From: jorisvandenbossche
Not in general, only the specific case of the _ndarray_values
attribute
Comment From: TomAugspurger
Any suggestions for the .closed
regression?
IIUC, the additional overhead comes from going through accessor.PandasDelegate
. Some options
- Just manually write the definition of IntervalIndex.closed
- Make
IntervalIndex.closed
cache readonly - Some simpler mixin, such that the definition of IntervalIndex.closed becomes just
._data.closed
, rather than the dynamic_delegate_property_get
?
Comment From: jbrockmendel
indexes.extension.inherit_names has a cache kwarg to turn a property into a cache_readonly. I can update #30860 to make closed (and anything else?) cached.
Comment From: TomAugspurger
I think just closed
IntervalIndex.
Comment From: jorisvandenbossche
I don't think it's just closed
, it's all attributes that use the same mechanism.
I think we could also have a version of PandasDelagate that is specific for the indexes that have _data
as the underlying data as extension array. I think that way, it can be written more specific / more efficient.
Comment From: TomAugspurger
@jbrockmendel does https://github.com/pandas-dev/pandas/pull/30860 close this, or is the caching being done in a separate PR?
Or do we want a PandasDelegate subclass that's specific to cases where Index.attr
is just Index._data.attr
?
Comment From: jorisvandenbossche
BTW, I don't think this should necessarily be a blocker for 1.0. But we should keep this in mind for all index->EA delegation refactor that is happening, and look into improving this.
Comment From: jbrockmendel
30860 did the caching for closed
but there may be other properties that can be made cached
Comment From: simonjayhawkins
moved off 1.1.1 milestone (scheduled for this week) as no PRs to fix in the pipeline
Comment From: simonjayhawkins
moved off 1.1.2 milestone (scheduled for this week) as no PRs to fix in the pipeline
Comment From: simonjayhawkins
moved off 1.1.3 milestone (overdue) as no PRs to fix in the pipeline
Comment From: simonjayhawkins
moved off 1.1.4 milestone (scheduled for release tomorrow) as no PRs to fix in the pipeline
Comment From: jbrockmendel
Closing as fixed