Pandas PERF: indexing on a CategoricalIndex

see SO here

These are naturally duplicated indexes. These are handled in the underyling code using a pretty inefficient method of indexing.

import pandas as pd
import numpy as np
import string

np.random.seed(1234)
df = pd.DataFrame({'categories': np.tile(np.random.choice(list(string.ascii_letters), 100000), 100),
                     'values': np.tile(np.random.choice([0,1], 100000), 100)})
df['categories'] = df['categories'].astype('category')

# using the group indexer 
In [63]: %timeit -n 3 -r 1 df.take(np.sort(np.concatenate([v for k, v in df.groupby('categories').indices.items() if k in string.ascii_lowercase])))
361 ms +- 0 ns per loop (mean +- std. dev. of 1 run, 3 loops each)

# direct map
In [51]: %timeit -n 3 -r 1 df[df.categories.map(lambda x: x in string.ascii_lowercase)]
625 ms +- 0 ns per loop (mean +- std. dev. of 1 run, 3 loops each)

# isin
In [52]: %timeit -n 3 -r 1 df[df['categories'].isin(list(string.ascii_lowercase))]
456 ms +- 0 ns per loop (mean +- std. dev. of 1 run, 3 loops each)

# groupby with filter
In [53]: %timeit -n 3 -r 1 df.groupby('categories', as_index=False).filter(lambda x: x.name in string.ascii_lowercase)
1.1 s +- 0 ns per loop (mean +- std. dev. of 1 run, 3 loops each)

# loc (fixme!)
In [54]: %timeit -n 3 -r 1 df.set_index('categories').reindex(list(string.ascii_lowercase)).reset_index()
15.4 s +- 0 ns per loop (mean +- std. dev. of 1 run, 3 loops each)

Comment From: mroeschke

Looks like we banned a reindex on a duplicate Index so I am not sure if this case is relevant anymore so closing