Code Sample
In [81]: sd_series = pd.Series(range(1000), index=pd.MultiIndex.from_product([range(100), range(10)]))
In [82]: def f_loc():
...: for l1 in sd_series.index.levels[0]:
...: _ = sd_series.loc[l1]
...:
In [83]: def f_iloc():
...: for i, l1 in enumerate(sd_series.index.levels[0]):
...: _ = sd_series.iloc[10*i:10*(i+1)]
...:
In [84]: def f_pd_groupby():
...: for l1, group in sd_series.groupby(level=0):
...: _ = group
...:
In [85]: def f_groupby():
...: for l1, group in groupby(sd_series.items(), lambda it: it[0][0]):
...: _ = group
...:
...:
In [86]: %timeit f_loc()
55.7 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [87]: %timeit f_iloc()
18.8 ms ± 665 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [88]: %timeit f_pd_groupby()
17.5 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [89]: %timeit f_groupby()
374 µs ± 6.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Problem description
There seems to be a significant performance issue when extracting data from a Series
with a MultiIndex
so that it is actually faster (by a factor 100) to use itertools.groupby
than either loc
, iloc
, or pd.groupby
.
I've found some issues with MultiIndex
in the past with get_loc
but that has been fixed. Apologies if I missed something else.
Output of pd.show_versions()
Comment From: jreback
not sure what you are trying to show here.
In [18]: %timeit sd_series.take([l1 for l1 in sd_series.index.levels[0]])
87.1 us +- 887 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)
In [19]: %timeit [(l1, DataFrame(list(group))) for l1, group in groupby(sd_series.items(), lambda it: it[0][0])]
30.4 ms +- 400 us per loop (mean +- std. dev. of 7 runs, 10 loops each)
so your indexing doesn't actually do anything but a split, which you accomplish in many ways. this is what groupby
actually does (its even worse, its just returning an iterator so its really not doing anything).
So you are comparing indexing & construction vs a split. In a completely non-idiomatic way.
[18] is idiomatic for pandas.
Comment From: fcolas
Thank you for your answer.
I actually want a split and I might not have found the idiomatic way. More precisely, I have some data that I sometimes want to iterate over by group of first level and sometimes simply over all values. These iterations yield non-pandas objects consumed by some outside code. (Of course I also do some pandas
processing on this data).
A more precise look at what I'm doing here:
In [103]: MyClass = collections.namedtuple('MyClass', list('Labcde'))
In [104]: sd_series = pd.Series(range(1000, 1500), index=pd.MultiIndex.from_product([range(100,200), list('abcde')]))
In [105]: def f_loc():
...: for l1 in sd_series.index.levels[0]:
...: _ = MyClass(l1, **sd_series.loc[l1])
...:
In [106]: def f_iloc():
...: for i, l1 in enumerate(sd_series.index.levels[0]):
...: _ = MyClass(l1, *sd_series.iloc[5*i:5*(i+1)])
...:
In [107]: def f_pd_groupby():
...: for l1, group in sd_series.groupby(level=0):
...: _ = MyClass(l1, **{n: v for (_, n), v in group.items()})
...:
In [108]: def f_groupby():
...: for l1, group in groupby(sd_series.items(), lambda it: it[0][0]):
...: _ = MyClass(l1, **{n: v for (_, n), v in group})
...:
In [109]: def f_take():
...: for i, l1 in enumerate(sd_series.index.levels[0]):
...: _ = MyClass(l1, **{n: v for (_, n), v in sd_series.take(range(5*i, 5*(i+1))).items()})
...:
In [110]: %timeit f_loc()
10 loops, best of 3: 38.7 ms per loop
In [111]: %timeit f_iloc()
100 loops, best of 3: 11 ms per loop
In [112]: %timeit f_pd_groupby()
10 loops, best of 3: 23.7 ms per loop
In [113]: %timeit f_groupby()
1000 loops, best of 3: 252 µs per loop
In [114]: %timeit f_take()
10 loops, best of 3: 25.4 ms per loop
But Indeed things change if I want to have a pd.Series
(disregarding differences in index):
In [139]: def g_loc():
...: for l1 in sd_series.index.levels[0]:
...: _ = sd_series.loc[l1]
...:
In [140]: def g_iloc():
...: for i, l1 in enumerate(sd_series.index.levels[0]):
...: _ = sd_series.iloc[5*i:5*(i+1)]
...:
In [141]: def g_pd_groupby():
...: for _, group in sd_series.groupby(level=0):
...: _ = group
...:
In [142]: def g_groupby():
...: for l1, group in groupby(sd_series.items(), lambda it: it[0][0]):
...: keys, values = zip(*group)
...: _ = pd.Series(values, index=keys)
...:
In [143]: def g_take():
...: for i, l1 in enumerate(sd_series.index.levels[0]):
...: _ = sd_series.take(range(5*i, 5*(i+1)))
...:
In [144]: def g_xs():
...: for l1 in sd_series.index.levels[0]:
...: _ = sd_series.xs(l1, level=0)
...:
In [145]: %timeit g_loc()
10 loops, best of 3: 30.4 ms per loop
In [146]: %timeit g_iloc()
%100 loops, best of 3: 8.72 ms per loop
In [147]: %timeit g_pd_groupby()
100 loops, best of 3: 8.05 ms per loop
In [148]: %timeit g_groupby()
100 loops, best of 3: 13.6 ms per loop
In [149]: %timeit g_take()
100 loops, best of 3: 9.56 ms per loop
In [150]: %timeit g_xs()
10 loops, best of 3: 27.1 ms per loop
What is unclear to me is:
- why do loc
and xs
take this long (3x compared to pd.groupby
)?
- what takes time in creating a pd.Series
?
- is there a more idiomatic way to iterate over Series
by group?
Comment From: jreback
your sample size is very small. You are turning tuples into a Series. this involves a copy to underyling storage. I suppose there could be more info in a performance section. But I do recall several issues / nice explanations on SO.
Comment From: jreback
If you want to contribute docs would be fine.