Pandas Boolean slicing on MultiIndex only works if bool index is same length as MultiIndex

Code Sample, a copy-pastable example if possible

from __future__ import print_function
import numpy as np
import pandas as pd

idx = pd.MultiIndex.from_product([['foo', 'bar'], ['a', 'b']],
                                 names=['first', 'second'])
df = pd.DataFrame(np.random.randn(4, 2), index=idx).sort_index()

bool_idx = pd.Series([True, False], index=['foo', 'bar'])

print ('dataframe')
print(df)

print('\nbool idx')
print(bool_idx)

try:
    df.loc[(bool_idx, slice(None)), :]
except ValueError as e:
    import traceback
    print('\nbool index in MultiIndex failed')
    traceback.print_exc()

no_multi_idx = df.reset_index(level=1)
print('\nno multi index')
print(no_multi_idx)

print('successful indexing')
print(no_multi_idx.loc[bool_idx])

Problem description

A level of a MultiIndex cannot be sliced by a boolean array shorter than the level. A single level of a MultiIndex can contain repeated values (as one level isn't the entire index), so slicing by a smaller array can be necessary. A single Index with repeated values can currently be sliced by a boolean array without repeated Index values. It would be a smoother user experience if the same feature existed for MultiIndex, especially since this is less of a corner case in with MultiIndex (because it doesn't require repeating index values). I would guess that the same or very similar code to the single Index case could be used to handle the MultiIndex case.

Expected Output

                     0         1
first second                    
foo   a      -0.569935 -1.146863
      b      -2.087799 -0.506962

This can currently be computed by calling reset_index, slicing, and then calling set_index with append=True.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.9.final.0 python-bits: 64 OS: Linux OS-release: 3.19.0-15-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.19.1 nose: 1.3.4 pip: 9.0.1 setuptools: 18.0.1 Cython: 0.21.1 numpy: 1.11.2 scipy: 0.14.1 statsmodels: None xarray: None IPython: 3.2.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.7 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.4.3 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.3.2 html5lib: 0.999 httplib2: 0.9.2 apiclient: None sqlalchemy: 1.0.8 pymysql: None psycopg2: 2.6.2 (dt dec pq3 ext lo64) jinja2: 2.8 boto: None pandas_datareader: None

Comment From: sinhrks

Yeah, needs a logic to handle tuple contains boolean array-likes.

df.loc[(['foo'], slice(None)), :]
#                      0         1
# first second
# foo   a      -0.168657 -0.495094
#       b       0.396207 -1.310770

df.loc[([True, False], slice(None)), :]
# ValueError: cannot index with a boolean indexer that is not the same length as t
he index

Comment From: jorisvandenbossche

@sinhrks Note that in the original example a Series was used, not a plain list. As for a boolean list, the above is correct: it should have the same length. It is only when having a boolean Series, that the Series is aligned before indexing.

Comment From: jreback

This would be a major API change, and I don't even think this is possible, meaning that this could lead to ambiguous circumstances. Further this is not all all convenient; constructing a boolean index that is exactly equal to the length of THAT particular level is obvious. The use case generally is this:

In [17]: df.loc[(df[0]>0, slice(None)), :]
Out[17]:
                     0         1
first second
bar   a       0.357749  1.191444
foo   a       0.334568  0.979802
      b       0.188320  0.085018

@lightcatcher can you show a usecase where this is actually useful? (pre-constructing a boolean index the same lenght as THAT level).

I keep emphasizing THAT level, because in this case the levels are the same length (2) and easy enough. But most multi-indexes are not like that. The levels are different and usually bigger.

Comment From: eamartin

I was attempting to do this type of indexing instead of a join. I had no reason to choose this over a join, it just came to mind first. A join seems like the more idiomatic way to go, but the support for single Index but not MultiIndex made me think it was worth opening an issue.

Example usecase: I have dataframes foo, bar, and foobar. foo has index foo_id, bar has index bar_id, and foobar has index (foo_id, bar_id). I want to select the rows of foobar that have a foo_id such that foo.loc[foo_id][col] > 0 for a col that's stored on foo but not on foobar. I attempted to do this with result = foobar.loc[(foo[col] > 0, slice(None)), :], and this lead to the error message. Notably, this would work if foobar's index was just foo_id rather than (foo_id, bar_id).

I think the more standard way to do this computation is:

joined = foobar.join(foo)
result = joined[joined[col] > 0][foobar.columns]

Algorithmically (at least in this case), the unequal length series index lookup is equivalent to a join. Maybe the boolean indexing this issue is about can be viewed as sugar on top of a join with a boolean series?

Comment From: jorisvandenbossche

@lightcatcher trying to translate your indexing-approach to code that already works now:

In [21]: bool_idx
Out[21]: 
foo     True
bar    False
dtype: bool

In [22]: bool_idx_aligned = bool_idx.reindex(df.index.get_level_values(0))

In [23]: bool_idx_aligned
Out[23]: 
first
bar    False
bar    False
foo     True
foo     True
dtype: bool

In [24]: df.loc[(bool_idx_aligned, slice(None)), :]
Out[24]: 
                     0         1
first second                    
foo   a       1.420829  0.464567
      b      -0.393165 -0.166931

So here I do the 'alignment' manually, so the passed indexer has the correct length and order. And I think your question is then if df.loc[(bool_idx, slice(None), :] should do such reindexing on the indexer automatically.

Comment From: jreback

@jorisvandenbossche

I was making the point of its harder to actually create the bool_idx in the first place. E.g. its much easier simply to use .isin() (or directly user the indexers, meaning what you want to select).

I was trying to get why @lightcatcher created the bool_idx (other than to try things out). In the real world, you almost always programatically create a boolean indexer.

Comment From: eamartin

@jorisvandenbossche Thanks for the reindex example. This issue is to decide (1) should df.loc[(bool_idx, slice(None), :] be supported when len(bool_idx) != len(df.index) and (2) if so how to implement it. It appears reindex is an option for (2).

@jreback Do you understand where my initial example (with df and bool_idx) came from? I submitted this as a reduced test case for the foo, bar, foobar dataframes problem I later described. Is using join the idiomatic way to do the foo, bar, foobar example? You mentioned you think this is a major API change that could lead to ambiguities. Do these concerns apply if there's just a new step to reindex the boolean indexer to the appropriate multiindex level if needed?

Comment From: jorisvandenbossche

@jreback yes, I get that, and @lightcatcher explained why he has such a strange bool_idx, and I think it is a valid use case (which does not necessarily mean that I think enabling such automatic behaviour is a good idea, IMO it is too much magic).

So I am just looking for ways to do this indexing approach without running into the original reported error. Reindexing the indexer is one possibility (as show above), isin can indeed be another:

In [32]: df.loc[df.index.get_level_values(0).isin(bool_idx.index[bool_idx])]
Out[32]: 
                     0         1
first second                    
foo   a       1.420829  0.464567
      b      -0.393165 -0.166931

(the bool_idx.index[bool_idx] can of course be simplified if you have the underlying dataframe based on which this bool_idx is created)

@lightcatcher I understand your expectation of something that is working on a single indexed also working on a multi-indexed frame. However, the alignment is something rather magical, and it is not clear that is should necessarily only align to the current index level, and not to the full multi-index. Therefore, I am not sure I am a fan of adding this functionality.

In any case, those situations where your boolean indexer is a Series with index deviating from the dataframe's index is rather underspecified and not documented well (also for a single level index). So this area can certainly be improved.

Comment From: jreback

@lightcatcher I understand what you are trying to do.

As @jorisvandenbossche indicated, indexing is already quite complicated. This does not seem like a natural extension as its really very different that boolean slicing as it already exists.

Comment From: shoyer

@jreback writes:

The use case generally is this:

In [17]: df.loc[(df[0]>0, slice(None)), :]
Out[17]:
                     0         1
first second
bar   a       0.357749  1.191444
foo   a       0.334568  0.979802
      b       0.188320  0.085018

I am actually a surprised that even this works -- it is not at all obvious to me what it means to index a level of a MultiIndex with a boolean with length equal to the entire index. A less ambiguous way to write this would be df.loc[df[0]>0, :].

In fact, if I had to guess at how df.loc[(bool_array, slice(None)), :] works, it seems at least as natural to use bool_array to index along df.index.levels[0]. Of course, this would still be confusing to many users (especially when not all levels are used in the actual index), so I don't really recommend it either.

Given the importance of preserving backwards compatibility, we shouldn't change this. But for pandas 2, I would consider making all attempts to index a level of a MultiIndex with a boolean raise an error.

Comment From: eamartin

@jorisvandenbossche Thanks for the explanation. You said:

However, the alignment is something rather magical, and it is not clear that is should necessarily only align to the current index level, and not to the full multi-index.

This seems clear with the slicer syntax I'm using. The slicer syntax is df[ (level_0_indexer, level_1_indexer, ...), :], where : could be replaced by another tuple of slices for the column levels. My (very uneducated) guess is that the level_i_indexer could be reindexed against df.index.get_level_values(i) safely. Is this not the case?

@shoyer I'm also surprised A = df.loc[(df[0]>0, slice(None)), :] works, as either B = df.loc[df[0]>0, :] or C = df.loc[(df.loc[df[0] > 0, :].index.get_level_values(0), :), :] makes more sense to me. From evaluating, it appears A == B and A != C. Personally, I would expect that A == C, because otherwise what's the point of using boolean series with per-level indexers? However, this is tangential to the initial issue I opened.

Comment From: jreback

The slicing syntax per-level is meant to support a non-chained version that allows things like this: from the docs: http://pandas-docs.github.io/pandas-docs-travis/advanced.html#using-slicers

In [56]: mask = dfmi[('a','foo')]>200

In [57]: dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
Out[57]: 
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

there is no other way to set things w/o a chained indexing when simultaneous multi-level label selection & a column boolean.

Furthermore it is syntactically hard to even specify this, e.g. what if you wanted to simultaneously index all of the levels in some way AND use a boolean mask.

Its probably not used very much or this would have come up earlier.

Comment From: eamartin

@jreback If I understand the example from docs + your explanation correctly, then dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']] is the same as dfmi.loc[idx[:, mask, ['C1','C3']], idx[:, 'foo']]. Is that correct?

This syntax exists because the alternative would be to do dfmi.loc[mask].loc[idx[:,: ['C1', C3']], idx[:, :]] and this can't be used as an lval for assignment because of chaining?

If avoiding chaining is the issue, could that dereference be written as

mask = (dfmi[('a', 'foo')] > 200).loc[idx[:, :, ['C1','C3']], idx[:, :]]
dfmi.loc[mask, :] = ...

Regardless, I guess it doesn't matter because changing the behavior of dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']] would break existing code, which presumable you'd like to avoid.

To summarize my understanding of this issue: * indexing one level of a multilevel with a boolean indexer series shorter than the level does not work * the easiest way to fix this is to reindex the boolean indexer with the level to be indexed * this reindex call cannot easily be added to the codepath because of support for things like dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']] which is using a multiindex as the indexer for a single level.

Comment From: jreback

@lightcatcher yes this syntax is not the best, but I don't think we can / should change anything ATM.

if you have ideas on how to express something like:

mask = (dfmi[('a', 'foo')] > 200).loc[idx[:, :, ['C1','C3']], idx[:, :]]
dfmi.loc[mask, :] = ...

then pls open / add to issues https://github.com/pandas-dev/pandas2/issues

Pandas Boolean slicing on MultiIndex only works if bool index is same length as MultiIndex

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`