Pandas Performance issue in Series with MultiIndex

Code Sample

In [81]: sd_series = pd.Series(range(1000), index=pd.MultiIndex.from_product([range(100), range(10)]))

In [82]: def f_loc():
    ...:     for l1 in sd_series.index.levels[0]:
    ...:         _ = sd_series.loc[l1]
    ...:         

In [83]: def f_iloc():
    ...:     for i, l1 in enumerate(sd_series.index.levels[0]):
    ...:         _ = sd_series.iloc[10*i:10*(i+1)]
    ...: 

In [84]: def f_pd_groupby():
    ...:     for l1, group in sd_series.groupby(level=0):
    ...:         _ = group
    ...:         

In [85]: def f_groupby():
    ...:     for l1, group in groupby(sd_series.items(), lambda it: it[0][0]):
    ...:         _ = group
    ...:         
    ...:         

In [86]: %timeit f_loc()
55.7 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [87]: %timeit f_iloc()
18.8 ms ± 665 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [88]: %timeit f_pd_groupby()
17.5 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [89]: %timeit f_groupby()
374 µs ± 6.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Problem description

There seems to be a significant performance issue when extracting data from a Series with a MultiIndex so that it is actually faster (by a factor 100) to use itertools.groupby than either loc, iloc, or pd.groupby. I've found some issues with MultiIndex in the past with get_loc but that has been fixed. Apologies if I missed something else.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.22.0 pytest: None pip: 9.0.1 setuptools: 36.5.0 Cython: None numpy: 1.14.0 scipy: None pyarrow: None xarray: None IPython: 6.2.1 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: jreback

not sure what you are trying to show here.

In [18]: %timeit sd_series.take([l1 for l1 in sd_series.index.levels[0]])
87.1 us +- 887 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)

In [19]: %timeit [(l1, DataFrame(list(group))) for l1, group in groupby(sd_series.items(), lambda it: it[0][0])]
30.4 ms +- 400 us per loop (mean +- std. dev. of 7 runs, 10 loops each)

so your indexing doesn't actually do anything but a split, which you accomplish in many ways. this is what groupby actually does (its even worse, its just returning an iterator so its really not doing anything).

So you are comparing indexing & construction vs a split. In a completely non-idiomatic way.

[18] is idiomatic for pandas.

Comment From: fcolas

Thank you for your answer. I actually want a split and I might not have found the idiomatic way. More precisely, I have some data that I sometimes want to iterate over by group of first level and sometimes simply over all values. These iterations yield non-pandas objects consumed by some outside code. (Of course I also do some pandas processing on this data). A more precise look at what I'm doing here:

In [103]: MyClass = collections.namedtuple('MyClass', list('Labcde'))

In [104]: sd_series = pd.Series(range(1000, 1500), index=pd.MultiIndex.from_product([range(100,200), list('abcde')]))

In [105]: def f_loc():
     ...:     for l1 in sd_series.index.levels[0]:
     ...:         _ = MyClass(l1, **sd_series.loc[l1])
     ...: 

In [106]: def f_iloc():
     ...:     for i, l1 in enumerate(sd_series.index.levels[0]):
     ...:         _ = MyClass(l1, *sd_series.iloc[5*i:5*(i+1)])
     ...: 

In [107]: def f_pd_groupby():
     ...:     for l1, group in sd_series.groupby(level=0):
     ...:         _ = MyClass(l1, **{n: v for (_, n), v in group.items()})
     ...: 

In [108]: def f_groupby():
     ...:     for l1, group in groupby(sd_series.items(), lambda it: it[0][0]):
     ...:         _ = MyClass(l1, **{n: v for (_, n), v in group})
     ...: 

In [109]: def f_take():
     ...:     for i, l1 in enumerate(sd_series.index.levels[0]):
     ...:         _ = MyClass(l1, **{n: v for (_, n), v in sd_series.take(range(5*i, 5*(i+1))).items()})
     ...: 

In [110]: %timeit f_loc()
10 loops, best of 3: 38.7 ms per loop

In [111]: %timeit f_iloc()
100 loops, best of 3: 11 ms per loop

In [112]: %timeit f_pd_groupby()
10 loops, best of 3: 23.7 ms per loop

In [113]: %timeit f_groupby()
1000 loops, best of 3: 252 µs per loop

In [114]: %timeit f_take()
10 loops, best of 3: 25.4 ms per loop

But Indeed things change if I want to have a pd.Series (disregarding differences in index):

In [139]: def g_loc():
     ...:     for l1 in sd_series.index.levels[0]:
     ...:         _ = sd_series.loc[l1]
     ...: 

In [140]: def g_iloc():
     ...:     for i, l1 in enumerate(sd_series.index.levels[0]):
     ...:         _ = sd_series.iloc[5*i:5*(i+1)]
     ...: 

In [141]: def g_pd_groupby():
     ...:     for _, group in sd_series.groupby(level=0):
     ...:         _ = group
     ...: 

In [142]: def g_groupby():
     ...:     for l1, group in groupby(sd_series.items(), lambda it: it[0][0]):
     ...:         keys, values = zip(*group)
     ...:         _ = pd.Series(values, index=keys)
     ...: 

In [143]: def g_take():
     ...:     for i, l1 in enumerate(sd_series.index.levels[0]):
     ...:         _ = sd_series.take(range(5*i, 5*(i+1)))
     ...: 

In [144]: def g_xs():
     ...:     for l1 in sd_series.index.levels[0]:
     ...:         _ = sd_series.xs(l1, level=0)
     ...: 

In [145]: %timeit g_loc()
10 loops, best of 3: 30.4 ms per loop

In [146]: %timeit g_iloc()
%100 loops, best of 3: 8.72 ms per loop

In [147]: %timeit g_pd_groupby()
100 loops, best of 3: 8.05 ms per loop

In [148]: %timeit g_groupby()
100 loops, best of 3: 13.6 ms per loop

In [149]: %timeit g_take()
100 loops, best of 3: 9.56 ms per loop

In [150]: %timeit g_xs()
10 loops, best of 3: 27.1 ms per loop

What is unclear to me is: - why do loc and xs take this long (3x compared to pd.groupby)? - what takes time in creating a pd.Series? - is there a more idiomatic way to iterate over Series by group?

Comment From: jreback

your sample size is very small. You are turning tuples into a Series. this involves a copy to underyling storage. I suppose there could be more info in a performance section. But I do recall several issues / nice explanations on SO.

Comment From: jreback

If you want to contribute docs would be fine.

Pandas Performance issue in Series with MultiIndex

Code Sample

Problem description

Output of pd.show_versions()

Output of `pd.show_versions()`