see SO here
These are naturally duplicated indexes. These are handled in the underyling code using a pretty inefficient method of indexing.
import pandas as pd
import numpy as np
import string
np.random.seed(1234)
df = pd.DataFrame({'categories': np.tile(np.random.choice(list(string.ascii_letters), 100000), 100),
'values': np.tile(np.random.choice([0,1], 100000), 100)})
df['categories'] = df['categories'].astype('category')
# using the group indexer
In [63]: %timeit -n 3 -r 1 df.take(np.sort(np.concatenate([v for k, v in df.groupby('categories').indices.items() if k in string.ascii_lowercase])))
361 ms +- 0 ns per loop (mean +- std. dev. of 1 run, 3 loops each)
# direct map
In [51]: %timeit -n 3 -r 1 df[df.categories.map(lambda x: x in string.ascii_lowercase)]
625 ms +- 0 ns per loop (mean +- std. dev. of 1 run, 3 loops each)
# isin
In [52]: %timeit -n 3 -r 1 df[df['categories'].isin(list(string.ascii_lowercase))]
456 ms +- 0 ns per loop (mean +- std. dev. of 1 run, 3 loops each)
# groupby with filter
In [53]: %timeit -n 3 -r 1 df.groupby('categories', as_index=False).filter(lambda x: x.name in string.ascii_lowercase)
1.1 s +- 0 ns per loop (mean +- std. dev. of 1 run, 3 loops each)
# loc (fixme!)
In [54]: %timeit -n 3 -r 1 df.set_index('categories').reindex(list(string.ascii_lowercase)).reset_index()
15.4 s +- 0 ns per loop (mean +- std. dev. of 1 run, 3 loops each)
Comment From: mroeschke
Looks like we banned a reindex on a duplicate Index so I am not sure if this case is relevant anymore so closing