Using the test cases in tests.dtypes.test_generic
, checking isinstance(item, ABCIndex)
takes ~2x as long as isinstance(item, Index)
. Same for MultiIndex
, DataFrame
, Series
. These measurements are pretty stable if we restrict to cases where the isinstance
is True.
import timeit
import numpy as np
import pandas as pd
from pandas.core.dtypes import generic as gt
tuples = [[1, 2, 2], ['red', 'blue', 'red']]
multi_index = pd.MultiIndex.from_arrays(tuples, names=('number', 'color'))
datetime_index = pd.to_datetime(['2000/1/1', '2010/1/1'])
timedelta_index = pd.to_timedelta(np.arange(5), unit='s')
period_index = pd.period_range('2000/1/1', '2010/1/1/', freq='M')
categorical = pd.Categorical([1, 2, 3], categories=[2, 3, 1])
categorical_df = pd.DataFrame({"values": [1, 2, 3]}, index=categorical)
df = pd.DataFrame({'names': ['a', 'b', 'c']}, index=multi_index)
sparse_series = pd.Series([1, 2, 3]).to_sparse()
sparse_array = pd.SparseArray(np.random.randn(10))
items = [tuples, multi_index, datetime_index, timedelta_index,
period_index, categorical, categorical_df, df,
sparse_series, sparse_array]
pairs = [
(pd.Index, gt.ABCIndex),
(pd.MultiIndex, gt.ABCMultiIndex),
(pd.DataFrame, gt.ABCDataFrame),
(pd.Series, gt.ABCSeries),
]
for pair in pairs:
times = [0, 0]
for item in items:
times[0] += timeit.timeit('isinstance(item, pair[0])', setup='from __main__ import item, pair')
times[1] += timeit.timeit('isinstance(item, pair[1])', setup='from __main__ import item, pair')
print(pair[0].__name__, times)
Using 0.20.2 on a 2011-era MacbookPro, this gave:
('Index', [2.9498720169067383, 7.1935436725616455])
('MultiIndex', [2.844989061355591, 6.8693037033081055])
('DataFrame', [2.8556389808654785, 7.1646740436553955])
('Series', [2.988542079925537, 6.987249851226807])
Using 0.20.2 on a 2015-6 era Ubuntu desktop:
('Index', [1.9884922504425049, 4.220286846160889])
('MultiIndex', [1.6177773475646973, 3.439270496368408])
('DataFrame', [1.5225791931152344, 3.534085512161255])
('Series', [1.7198824882507324, 3.554945468902588])
I don't have any bright ideas for improving the ABC performance, but wanted to put these numbers out there. These isinstance
checks happen all over the place, so the mileage might be non-trivial.
Comment From: jreback
well you would have to show a case where this actually matters. These ABC instance checks are to avoid complicated and interlocking circular imports, which they do very nicely; the code is MUCH simpler generally with these.
I am not sure what you are advocating here.
Comment From: jbrockmendel
These ABC instance checks are to avoid complicated and interlocking circular imports
Totally on board with this.
I am not sure what you are advocating here.
Mostly intended to park these numbers somewhere for the next time micro-optimizations come up.
Comment From: jbrockmendel
Parking some more observations.
Profiling the test suite (more specifically, running pd.test()
with a cProfile.Profile
enabled), isinstance
calls took about 10% of the total runtime.
Recall __isinstancecheck__
is here defined as:
@classmethod
def _check(cls, inst):
return getattr(inst, attr, '_typ') in comp
Some attempts to speed this up that didn't work:
- Make
comp
a set instead of a list. - In many cases
comp
is a singleton. In those cases, have it check for equality with that singleton instead of inclusion. - In all the pertinent cases
attr
is "typ", which is short enough to be automatically interned. Check that this is the case, and if so, check for identity rather than equality.
The fact that none of these made a perceptible difference is surprising. I wouldn't mind someone double-checking on this.
Comment From: jreback
closing but its searchable
Comment From: jbrockmendel
One easy change that did make a significant difference is pinning _is_multi=False
to pd.Index
and _is_multi=True
to pd.MultiIndex
. Then a ton of isinstance(ax, MultiIndex)
calls can be replaced with ax._is_multi
(in cases where we know we're looking at some form of Index
). This includes at least 13 calls in core.frame
(including at the top of DataFrame.__getitem__
) , 16 calls in core.indexing
, 6 in core.series
, at which point I stopped counting.
Profiled on a 2011-era Macbook and 2015(6)-era Ubuntu desktop. ratio
is _is_multi/isinstance
. Results first, code below
2011MBP 2015Ubuntu
isinstance _is_multi ratio isinstance _is_multi ratio
Int64Index 0.6525 0.2955 0.4528 0.2249 0.0847 0.3764
RangeIndex 0.6975 0.3189 0.4572 0.2251 0.0841 0.3739
DatetimeIndex 0.7131 0.2981 0.4180 0.2281 0.0842 0.3690
PeriodIndex 0.6557 0.2894 0.4413 0.2282 0.0843 0.3693
Index 0.6638 0.3028 0.4561 0.2208 0.0841 0.3808
(Int64Index, Int64Index) 0.5206 0.3071 0.5899 0.1399 0.0855 0.6111
(Int64Index, Int64Index) 0.4929 0.3150 0.6390 0.1399 0.0849 0.6065
(Int64Index, DatetimeIndex) 0.5282 0.3155 0.5974 0.1400 0.0850 0.6067
(Int64Index, PeriodIndex) 0.5220 0.3054 0.5851 0.1400 0.0841 0.6007
(Int64Index, Index) 0.4987 0.2957 0.5930 0.1408 0.0841 0.5973
import pandas as pd
import timeit
pd.Index._is_multi = False
pd.MultiIndex._is_multi = True
idxs = [
pd.Index(range(30)),
pd.RangeIndex(10**4),
pd.date_range('2016-01-01', periods=15),
pd.date_range('2016-01-01', periods=15, freq='Q').map(lambda x: pd.Period(x, freq='Q')),
pd.Index(list('ABCDEFG')),
]
midxs = []
for idx1 in idxs:
for idx2 in idxs:
mi = pd.MultiIndex.from_product([idx1, idx2])
midxs.append(mi)
for item in idxs+midxs:
def inst():
return isinstance(item, pd.MultiIndex)
def attr():
return item._is_multi
v1 = timeit.timeit(inst)
v2 = timeit.timeit(attr)
print(round(v1, 4), round(v2, 4))
Comment From: gfyoung
@jbrockmendel : If it works and passes all tests, give it a shot!