Code Sample, a copy-pastable example if possible
from __future__ import print_function
import numpy as np
import pandas as pd
idx = pd.MultiIndex.from_product([['foo', 'bar'], ['a', 'b']],
names=['first', 'second'])
df = pd.DataFrame(np.random.randn(4, 2), index=idx).sort_index()
bool_idx = pd.Series([True, False], index=['foo', 'bar'])
print ('dataframe')
print(df)
print('\nbool idx')
print(bool_idx)
try:
df.loc[(bool_idx, slice(None)), :]
except ValueError as e:
import traceback
print('\nbool index in MultiIndex failed')
traceback.print_exc()
no_multi_idx = df.reset_index(level=1)
print('\nno multi index')
print(no_multi_idx)
print('successful indexing')
print(no_multi_idx.loc[bool_idx])
Problem description
A level of a MultiIndex cannot be sliced by a boolean array shorter than the level. A single level of a MultiIndex can contain repeated values (as one level isn't the entire index), so slicing by a smaller array can be necessary. A single Index with repeated values can currently be sliced by a boolean array without repeated Index values. It would be a smoother user experience if the same feature existed for MultiIndex, especially since this is less of a corner case in with MultiIndex (because it doesn't require repeating index values). I would guess that the same or very similar code to the single Index case could be used to handle the MultiIndex case.
Expected Output
0 1
first second
foo a -0.569935 -1.146863
b -2.087799 -0.506962
This can currently be computed by calling reset_index
, slicing, and then calling set_index
with append=True.
Output of pd.show_versions()
Comment From: sinhrks
Yeah, needs a logic to handle tuple contains boolean array-likes.
df.loc[(['foo'], slice(None)), :]
# 0 1
# first second
# foo a -0.168657 -0.495094
# b 0.396207 -1.310770
df.loc[([True, False], slice(None)), :]
# ValueError: cannot index with a boolean indexer that is not the same length as t
he index
Comment From: jorisvandenbossche
@sinhrks Note that in the original example a Series was used, not a plain list. As for a boolean list, the above is correct: it should have the same length. It is only when having a boolean Series, that the Series is aligned before indexing.
Comment From: jreback
This would be a major API change, and I don't even think this is possible, meaning that this could lead to ambiguous circumstances. Further this is not all all convenient; constructing a boolean index that is exactly equal to the length of THAT particular level is obvious. The use case generally is this:
In [17]: df.loc[(df[0]>0, slice(None)), :]
Out[17]:
0 1
first second
bar a 0.357749 1.191444
foo a 0.334568 0.979802
b 0.188320 0.085018
@lightcatcher can you show a usecase where this is actually useful? (pre-constructing a boolean index the same lenght as THAT level).
I keep emphasizing THAT level, because in this case the levels are the same length (2) and easy enough. But most multi-indexes are not like that. The levels are different and usually bigger.
Comment From: eamartin
I was attempting to do this type of indexing instead of a join. I had no reason to choose this over a join, it just came to mind first. A join seems like the more idiomatic way to go, but the support for single Index but not MultiIndex made me think it was worth opening an issue.
Example usecase:
I have dataframes foo
, bar
, and foobar
. foo
has index foo_id
, bar
has index bar_id
, and foobar
has index (foo_id, bar_id)
.
I want to select the rows of foobar
that have a foo_id
such that foo.loc[foo_id][col] > 0
for a col
that's stored on foo
but not on foobar
.
I attempted to do this with result = foobar.loc[(foo[col] > 0, slice(None)), :]
, and this lead to the error message. Notably, this would work if foobar
's index was just foo_id
rather than (foo_id, bar_id)
.
I think the more standard way to do this computation is:
joined = foobar.join(foo)
result = joined[joined[col] > 0][foobar.columns]
Algorithmically (at least in this case), the unequal length series index lookup is equivalent to a join. Maybe the boolean indexing this issue is about can be viewed as sugar on top of a join with a boolean series?
Comment From: jorisvandenbossche
@lightcatcher trying to translate your indexing-approach to code that already works now:
In [21]: bool_idx
Out[21]:
foo True
bar False
dtype: bool
In [22]: bool_idx_aligned = bool_idx.reindex(df.index.get_level_values(0))
In [23]: bool_idx_aligned
Out[23]:
first
bar False
bar False
foo True
foo True
dtype: bool
In [24]: df.loc[(bool_idx_aligned, slice(None)), :]
Out[24]:
0 1
first second
foo a 1.420829 0.464567
b -0.393165 -0.166931
So here I do the 'alignment' manually, so the passed indexer has the correct length and order.
And I think your question is then if df.loc[(bool_idx, slice(None), :]
should do such reindexing on the indexer automatically.
Comment From: jreback
@jorisvandenbossche
I was making the point of its harder to actually create the bool_idx
in the first place. E.g. its much easier simply to use .isin()
(or directly user the indexers, meaning what you want to select).
I was trying to get why @lightcatcher created the bool_idx
(other than to try things out). In the real world, you almost always programatically create a boolean indexer.
Comment From: eamartin
@jorisvandenbossche
Thanks for the reindex example.
This issue is to decide
(1) should df.loc[(bool_idx, slice(None), :]
be supported when len(bool_idx) != len(df.index)
and
(2) if so how to implement it. It appears reindex
is an option for (2).
@jreback
Do you understand where my initial example (with df
and bool_idx
) came from? I submitted this as a reduced test case for the foo, bar, foobar
dataframes problem I later described.
Is using join
the idiomatic way to do the foo, bar, foobar
example?
You mentioned you think this is a major API change that could lead to ambiguities. Do these concerns apply if there's just a new step to reindex the boolean indexer to the appropriate multiindex level if needed?
Comment From: jorisvandenbossche
@jreback yes, I get that, and @lightcatcher explained why he has such a strange bool_idx
, and I think it is a valid use case (which does not necessarily mean that I think enabling such automatic behaviour is a good idea, IMO it is too much magic).
So I am just looking for ways to do this indexing approach without running into the original reported error. Reindexing the indexer is one possibility (as show above), isin
can indeed be another:
In [32]: df.loc[df.index.get_level_values(0).isin(bool_idx.index[bool_idx])]
Out[32]:
0 1
first second
foo a 1.420829 0.464567
b -0.393165 -0.166931
(the bool_idx.index[bool_idx]
can of course be simplified if you have the underlying dataframe based on which this bool_idx
is created)
@lightcatcher I understand your expectation of something that is working on a single indexed also working on a multi-indexed frame. However, the alignment is something rather magical, and it is not clear that is should necessarily only align to the current index level, and not to the full multi-index. Therefore, I am not sure I am a fan of adding this functionality.
In any case, those situations where your boolean indexer is a Series with index deviating from the dataframe's index is rather underspecified and not documented well (also for a single level index). So this area can certainly be improved.
Comment From: jreback
@lightcatcher I understand what you are trying to do.
As @jorisvandenbossche indicated, indexing is already quite complicated. This does not seem like a natural extension as its really very different that boolean slicing as it already exists.
Comment From: shoyer
@jreback writes:
The use case generally is this:
In [17]: df.loc[(df[0]>0, slice(None)), :]
Out[17]:
0 1
first second
bar a 0.357749 1.191444
foo a 0.334568 0.979802
b 0.188320 0.085018
I am actually a surprised that even this works -- it is not at all obvious to me what it means to index a level of a MultiIndex with a boolean with length equal to the entire index. A less ambiguous way to write this would be df.loc[df[0]>0, :]
.
In fact, if I had to guess at how df.loc[(bool_array, slice(None)), :]
works, it seems at least as natural to use bool_array
to index along df.index.levels[0]
. Of course, this would still be confusing to many users (especially when not all levels are used in the actual index), so I don't really recommend it either.
Given the importance of preserving backwards compatibility, we shouldn't change this. But for pandas 2, I would consider making all attempts to index a level of a MultiIndex with a boolean raise an error.
Comment From: eamartin
@jorisvandenbossche Thanks for the explanation. You said:
However, the alignment is something rather magical, and it is not clear that is should necessarily only align to the current index level, and not to the full multi-index.
This seems clear with the slicer syntax I'm using. The slicer syntax is df[ (level_0_indexer, level_1_indexer, ...), :]
, where :
could be replaced by another tuple of slices for the column levels. My (very uneducated) guess is that the level_i_indexer
could be reindexed against df.index.get_level_values(i)
safely.
Is this not the case?
@shoyer I'm also surprised A = df.loc[(df[0]>0, slice(None)), :]
works, as either B = df.loc[df[0]>0, :]
or C = df.loc[(df.loc[df[0] > 0, :].index.get_level_values(0), :), :]
makes more sense to me. From evaluating, it appears A == B
and A != C
. Personally, I would expect that A == C
, because otherwise what's the point of using boolean series with per-level indexers?
However, this is tangential to the initial issue I opened.
Comment From: jreback
The slicing syntax per-level is meant to support a non-chained version that allows things like this: from the docs: http://pandas-docs.github.io/pandas-docs-travis/advanced.html#using-slicers
In [56]: mask = dfmi[('a','foo')]>200
In [57]: dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
Out[57]:
lvl0 a b
lvl1 foo foo
A3 B0 C1 D1 204 206
C3 D0 216 218
D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
there is no other way to set things w/o a chained indexing when simultaneous multi-level label selection & a column boolean.
Furthermore it is syntactically hard to even specify this, e.g. what if you wanted to simultaneously index all of the levels in some way AND use a boolean mask.
Its probably not used very much or this would have come up earlier.
Comment From: eamartin
@jreback If I understand the example from docs + your explanation correctly, then dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
is the same as dfmi.loc[idx[:, mask, ['C1','C3']], idx[:, 'foo']]
. Is that correct?
This syntax exists because the alternative would be to do dfmi.loc[mask].loc[idx[:,: ['C1', C3']], idx[:, :]]
and this can't be used as an lval for assignment because of chaining?
If avoiding chaining is the issue, could that dereference be written as
mask = (dfmi[('a', 'foo')] > 200).loc[idx[:, :, ['C1','C3']], idx[:, :]]
dfmi.loc[mask, :] = ...
Regardless, I guess it doesn't matter because changing the behavior of dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
would break existing code, which presumable you'd like to avoid.
To summarize my understanding of this issue:
* indexing one level of a multilevel with a boolean indexer series shorter than the level does not work
* the easiest way to fix this is to reindex the boolean indexer with the level to be indexed
* this reindex call cannot easily be added to the codepath because of support for things like dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
which is using a multiindex as the indexer for a single level.
Comment From: jreback
@lightcatcher yes this syntax is not the best, but I don't think we can / should change anything ATM.
if you have ideas on how to express something like:
mask = (dfmi[('a', 'foo')] > 200).loc[idx[:, :, ['C1','C3']], idx[:, :]]
dfmi.loc[mask, :] = ...
then pls open / add to issues https://github.com/pandas-dev/pandas2/issues