Pandas PERF: isinstance(x, ABCFoo) vs isinstance(x, Foo)

Using the test cases in tests.dtypes.test_generic, checking isinstance(item, ABCIndex) takes ~2x as long as isinstance(item, Index). Same for MultiIndex, DataFrame, Series. These measurements are pretty stable if we restrict to cases where the isinstance is True.

import timeit
import numpy as np
import pandas as pd
from pandas.core.dtypes import generic as gt

tuples = [[1, 2, 2], ['red', 'blue', 'red']]
multi_index = pd.MultiIndex.from_arrays(tuples, names=('number', 'color'))
datetime_index = pd.to_datetime(['2000/1/1', '2010/1/1'])
timedelta_index = pd.to_timedelta(np.arange(5), unit='s')
period_index = pd.period_range('2000/1/1', '2010/1/1/', freq='M')
categorical = pd.Categorical([1, 2, 3], categories=[2, 3, 1])
categorical_df = pd.DataFrame({"values": [1, 2, 3]}, index=categorical)
df = pd.DataFrame({'names': ['a', 'b', 'c']}, index=multi_index)
sparse_series = pd.Series([1, 2, 3]).to_sparse()
sparse_array = pd.SparseArray(np.random.randn(10))

items = [tuples, multi_index, datetime_index, timedelta_index,
    period_index, categorical, categorical_df, df,
    sparse_series, sparse_array]

pairs = [
    (pd.Index, gt.ABCIndex),
    (pd.MultiIndex, gt.ABCMultiIndex),
    (pd.DataFrame, gt.ABCDataFrame),
    (pd.Series, gt.ABCSeries),
]

for pair in pairs:
    times = [0, 0]
    for item in items:
        times[0] += timeit.timeit('isinstance(item, pair[0])', setup='from __main__ import item, pair')
        times[1] += timeit.timeit('isinstance(item, pair[1])', setup='from __main__ import item, pair')
    print(pair[0].__name__, times)

Using 0.20.2 on a 2011-era MacbookPro, this gave:

('Index', [2.9498720169067383, 7.1935436725616455])
('MultiIndex', [2.844989061355591, 6.8693037033081055])
('DataFrame', [2.8556389808654785, 7.1646740436553955])
('Series', [2.988542079925537, 6.987249851226807])

Using 0.20.2 on a 2015-6 era Ubuntu desktop:

('Index', [1.9884922504425049, 4.220286846160889])
('MultiIndex', [1.6177773475646973, 3.439270496368408])
('DataFrame', [1.5225791931152344, 3.534085512161255])
('Series', [1.7198824882507324, 3.554945468902588])

I don't have any bright ideas for improving the ABC performance, but wanted to put these numbers out there. These isinstance checks happen all over the place, so the mileage might be non-trivial.

Comment From: jreback

well you would have to show a case where this actually matters. These ABC instance checks are to avoid complicated and interlocking circular imports, which they do very nicely; the code is MUCH simpler generally with these.

I am not sure what you are advocating here.

Comment From: jbrockmendel

These ABC instance checks are to avoid complicated and interlocking circular imports

Totally on board with this.

I am not sure what you are advocating here.

Mostly intended to park these numbers somewhere for the next time micro-optimizations come up.

Comment From: jbrockmendel

Parking some more observations.

Profiling the test suite (more specifically, running pd.test() with a cProfile.Profile enabled), isinstance calls took about 10% of the total runtime.

Recall __isinstancecheck__ is here defined as:

    @classmethod
    def _check(cls, inst):
        return getattr(inst, attr, '_typ') in comp

Some attempts to speed this up that didn't work:

Make comp a set instead of a list.
In many cases comp is a singleton. In those cases, have it check for equality with that singleton instead of inclusion.
In all the pertinent cases attr is "typ", which is short enough to be automatically interned. Check that this is the case, and if so, check for identity rather than equality.

The fact that none of these made a perceptible difference is surprising. I wouldn't mind someone double-checking on this.

Comment From: jreback

closing but its searchable

Comment From: jbrockmendel

One easy change that did make a significant difference is pinning _is_multi=False to pd.Index and _is_multi=True to pd.MultiIndex. Then a ton of isinstance(ax, MultiIndex) calls can be replaced with ax._is_multi (in cases where we know we're looking at some form of Index). This includes at least 13 calls in core.frame (including at the top of DataFrame.__getitem__) , 16 calls in core.indexing, 6 in core.series, at which point I stopped counting.

Profiled on a 2011-era Macbook and 2015(6)-era Ubuntu desktop. ratio is _is_multi/isinstance. Results first, code below

                               2011MBP                   2015Ubuntu                  
                            isinstance _is_multi   ratio isinstance _is_multi   ratio
Int64Index                      0.6525    0.2955  0.4528     0.2249    0.0847  0.3764
RangeIndex                      0.6975    0.3189  0.4572     0.2251    0.0841  0.3739
DatetimeIndex                   0.7131    0.2981  0.4180     0.2281    0.0842  0.3690
PeriodIndex                     0.6557    0.2894  0.4413     0.2282    0.0843  0.3693
Index                           0.6638    0.3028  0.4561     0.2208    0.0841  0.3808
(Int64Index, Int64Index)        0.5206    0.3071  0.5899     0.1399    0.0855  0.6111
(Int64Index, Int64Index)        0.4929    0.3150  0.6390     0.1399    0.0849  0.6065
(Int64Index, DatetimeIndex)     0.5282    0.3155  0.5974     0.1400    0.0850  0.6067
(Int64Index, PeriodIndex)       0.5220    0.3054  0.5851     0.1400    0.0841  0.6007
(Int64Index, Index)             0.4987    0.2957  0.5930     0.1408    0.0841  0.5973

import pandas as pd
import timeit

pd.Index._is_multi = False
pd.MultiIndex._is_multi = True

idxs = [
    pd.Index(range(30)),
    pd.RangeIndex(10**4),
    pd.date_range('2016-01-01', periods=15),
    pd.date_range('2016-01-01', periods=15, freq='Q').map(lambda x: pd.Period(x, freq='Q')),
    pd.Index(list('ABCDEFG')),
    ]

midxs = []
for idx1 in idxs:
    for idx2 in idxs:
        mi = pd.MultiIndex.from_product([idx1, idx2])
        midxs.append(mi)

for item in idxs+midxs:
    def inst():
        return isinstance(item, pd.MultiIndex)
    def attr():
        return item._is_multi
    v1 = timeit.timeit(inst)
    v2 = timeit.timeit(attr)
    print(round(v1, 4), round(v2, 4))

Comment From: gfyoung

@jbrockmendel : If it works and passes all tests, give it a shot!