Pandas BUG: Indexes still include values that have been deleted

Using pandas 0.10. If we create a Dataframe with a multi-index, then delete all the rows with value X, we'd expect the index to no longer show value X. But it does. Note the apparent inconsistency between "index" and "index.levels" -- one shows the values have been deleted but the other doesn't.

import pandas

x = pandas.DataFrame([['deleteMe',1, 9],['keepMe',2, 9],['keepMeToo',3, 9]], columns=['first','second', 'third'])
x = x.set_index(['first','second'], drop=False)

x = x[x['first'] != 'deleteMe'] #Chop off all the 'deleteMe' rows

print x.index #Good: Index no longer has any rows with 'deleteMe'. But....

print x.index.levels #Bad: index still shows the "deleteMe" values are there. But why? We deleted them.

x.groupby(level='first').sum() #Bad: it's creating a dummy row for the rows we deleted!

We don't want the deleted values to show up in that groupby. Can we eliminate them?

Comment From: ghost

related #2655 (maybe)

Comment From: wesm

This is kind of a tricky problem, e.g. when should you "recompute" the levels? Have to table this until I have a chance to look a bit more deeply. Another solution is to exclude levels with no observations in such a groupby

Comment From: darindillon

Well, is there any easy workaround I can use? Like if I know I have this problem, can I manually call a .rebuild_index() or something? I've played around with all the obvious possibilities (short of creating a brand new dataframe) and can't find any workaround. It's the last line (the .groupby(...).sum() ) that I care about -- that's the one I need to make the bad data go away.

EDIT: better clarification. I have one function that builds the dataset and drops the rows. At that point, I know I'm in the situation described in this issue, and I'd like to do my workaround there. But then the .groupby().sum() happens much much later in a different function. I could easily hack that second function as you say (exclude levels with no observations) but it makes more sense to keep my workaround code in the first function. Any ideas?

Comment From: michaelaye

How about the workaround that I proposed for #2655 ? In your case maybe

x.groupby(x.index.get_level_values(1)).sum()

should do the correct thing, if I'm not wrong? I don't know why, but the result of this function delivers updated values.

Comment From: darindillon

Yes that works; but the code that does .groupby().sum() is in one function and the code that removes the value from the table is in another fxn. It would be much much clear to use a workaround that cleans up the problem with the dataframe in the fxn that creates it -- that way any other fxn could use the dataframe without having to do your trick.

Comment From: michaelaye

Ehm, can you confirm that this problem still exists with 0.10.1? I just tried your example, and I don't see a dummy row with index "deleteMe" ?

In [9]: print x.index.levels
[Index([deleteMe, keepMe, keepMeToo], dtype=object), Int64Index([1, 2, 3], dtype=int64)]

In [10]: x.groupby(level='first').sum()
Out[10]: 
           second  third
first                   
keepMe          2      9
keepMeToo       3      9

Comment From: jreback

is this closable? @tavistmorph does this exist in 0.11-dev?

Comment From: darindillon

It's still an issue. Still happens for me in 10.1 and 0.11 (as of the last time I pulled, at least). Just run the code snippet in my orig post and you can see it.

Michael -- the deleted row appears in your output above on step #9 ("deleteMe" should not be there since we deleted it) and then it appears in the output for step 10 ("first" should not appear since all the rows with the "first" value were deleted).

Comment From: wesm

This isn't really a bug. Perhaps an option should be added to return an array of observed values in a particular level in the index (which is what you're after)?

Comment From: michaelaye

Can you precise what you mean by observed? Do you mean, that the object is a view into the original object (I don't know if it is), and that's why it still contains the 'deleteMe' index?

Comment From: wesm

The levels are not computed from the actual observed values. For example, in R you can have a factor (categorical variable) in which some distinct values are not observed:

> d
[1] b c b c
Levels: a b c

Comment From: michaelaye

Version: '0.12.0-1184-gc73b957' The MultiIndex still shows all previously existing index values and therefore still is confusing to the user who looks at it, after chopping off the 'deleteMe' rows:

In [10]: x.index
Out[10]:
MultiIndex(levels=[[u'deleteMe', u'keepMe', u'keepMeToo'], [1, 2, 3]],
           labels=[[1, 2], [1, 2]],
           names=[u'first', u'second'])

but at least the groupby does not create an empty row anymore for previously existing indices:

In [12]: x.groupby(level='first').sum()
Out[12]:
           second  third
first
keepMe          2      9
keepMeToo       3      9

[2 rows x 2 columns]

so the discussion now boils down to the confusion of looking at df.index. I would argue, as I am looking often at the index to see what I am working with, that I would still be very puzzled by the index showing old values and from that point on I would not trust the results anymore.

Comment From: jtratner

If you print that MultiIndex, it looks like what you want:

In [7]: mi
Out[7]:
MultiIndex(levels=[[u'deleteMe', u'keepMe', u'keepMeToo'], [1, 2, 3]],
           labels=[[1, 2], [1, 2]],
           names=[u'first', u'second'])

In [8]: print mi
first      second
keepMe     2
keepMeToo  3

Thus, a simple way to handle this is to examine your indices with print rather than the repr that IPython shows you.

The MultiIndex repr isn't really intuitive in any case, unless you understand that it's a categorical and that labels represent the integer positions of the levels at each location. You shouldn't need to care about that as a consumer of a MultiIndex. And if you understand the internal representation, you can then also understand why it doesn't matter whether there are extra levels.

The issue becomes clearer with a more complicated MI:

In [2]: ind = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b', 'c'], ['d', 'e', 'f', 'g', 'h']])

In [3]: ind
Out[3]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[0, 0, 1, 1, 2], [0, 1, 2, 3, 4]])

In [4]: print ind

a  d
   e
b  f
   g
c  h

Comment From: jtratner

Based on your previous comment, it seems like the key issue here (groupby showing unused levels) is now resolved. Can we close this or edit this issue to be a feature request? (e.g., method to allow MI to consolidate its levels)

As an aside, my perspective is that it's more intuitive to have the entire level set remain, because it makes slices very clear (and you can share the memory for storing levels):

In [15]: ind
Out[15]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[0, 0, 1, 1, 2], [0, 1, 2, 3, 4]])

In [16]: ind[:2]
Out[16]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[0, 0], [0, 1]])

In [17]: ind[2:4]
Out[17]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[1, 1], [2, 3]])

In [18]: ind[4:5]
Out[18]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[2], [4]])

Comment From: michaelaye

I would argue that to use repr as a way to examine pandas objects is the default and advertised use case, as pandas docs are full of that, so I don't find it really satisfying and a tad inconsistent that I have to resort to printing an object for clarity while repr works for all (most?) other cases. I don't really understand what you want to show in your last comment. What does it have to do with deleted indexes remaining in the index?

Comment From: michaelaye

I also would like to point out that the pandas core team does not seem to have come to a consistent conclusion how to handle this, as we have 3 issue related to this, and in one (#2655) the claim is made that it is no bug, while the 2 others (#3686 and this one) have been marked as a bug. Maybe you guys should have an internal discussion about it.

Comment From: ghost

@michaelaye, I think you (legitimately) missed the point wes was making. My guess is that you're under a misconception of the role of levels. It is not the equivalent of a regular Index labels, that equivalent is mi.labels. wes made this point to you, and so has jtratner https://github.com/pydata/pandas/issues/2655#issuecomment-29418847 to no avail.

Test yourself with this example:

In [10]: MultiIndex.from_tuples([[0,1],[0,2]])
Out[10]: 
MultiIndex(levels=[[0], [1, 2]],
           labels=[[0, 0], [0, 1]])

Do you understand why the first element in levels only has one item? Have you noticed that the number of elements in levels is not directly related to the number of rows in the frame? Then, it make sense that rows could be deleted without levels logically having to change?

You may find it counter-intuitive (I did in the past), but then the problem to be addressed is the misunderstanding , not the implementation. Hopefully, now you know.

The fact that a groupby emitted a group for entries that appear in levels but not in labels (What wes meant by "unobserved") was a bug, and it has been fixed.

I would venture a guess that the reason this non-bug issue has lingered for so long is lack of time or significance or, indeed, patience to spell things out like this and not a paucity of managerial suggestions or demerits from you.

Also: 1. issue labels are not holy scripture. ~~(removed "bug" label)~~ The extra groupby row was a bug though.

I agree with @jtratner, we can close this. If someone wants that consolidate method he suggested, open an issue. Personally, I don't see the need.

Comment From: michaelaye

Thank you for your efforts. I indeed was puzzled by the meaning of 'unobserved' and 'observed' and finally understand Wes' comment. Still, there are API calls that take levels as an argument, e.g. groupby(). If other users don't find it confusing to have a list of levels not representing the current state, than it must be me. I am sad to see that my effort to bring these issues forward is interpreted as demerit of your excellent work and apologize if I upset anyone.

Comment From: jtratner

If you're finding something wrong with groupby (ie you end up with spurious levels in final output) can you post it?

Comment From: michaelaye

I don't have anything showing up wrong, and I didn't mean to imply that. My work-style is very much relying on looking at indices and columns with __repr__ because most of the time my dataframes are just too big to be helpful to be displayed. Your suggesting of printing it solves potential confusions with having glimpsed the content of levels, but it is now the only object I would need to print. Any chance the definition of __str__ and __repr__ could be swapped for the MultiIndex or would that mess up other things?

Comment From: jtratner

@michaelaye unlikely to happen - repr is set to be something that's potentially eval'able. Makes it much easier to reproduce indexes when their repr can be copy/pasted.

Comment From: michaelaye

Understood. Thanks for your patience.

Comment From: ghost

The pandas API doesn't fit in my head anymore. For reference df.index.get_level_values might be relevent for whatever use case this was a problem for. DOes the right thing.

    ...: 
    ...: x = pandas.DataFrame([['deleteMe',1, 9],['keepMe',2, 9],['keepMeToo',3, 9]], columns=['first','second', 'third'])
    ...: x = x.set_index(['first','second'], drop=False)
    ...: 
    ...: print x.index.get_level_values(0)
    ...: x = x[x['first'] != 'deleteMe'] #Chop off all the 'deleteMe' rows
    ...: print x.index.get_level_values(0)
    ...: 
Index([u'deleteMe', u'keepMe', u'keepMeToo'], dtype='object')
Index([u'keepMe', u'keepMeToo'], dtype='object')

Comment From: robertmuil

I'm afraid I don't understand why there is no easy way of instructing the dataframe to drop absent levels.

For a clear use-case: counting the number of unique levels present in a data series is a computationally intensive task: if it can be done once, when drop or similar is called, then one need only look at the index.levels to determine the unique levels, rather than use the clunky, and slow, data.index.get_level_values(level='whatever').unique().

I know this has been discussed considerably, but I did not see any mention of this 'unique counting' use-case, which I see as rather important.

Comment From: jreback

@robertmuil this is an efficiency argument. It can be done on selection, the problem is the user may be doing multiple things (e.g.. selections or whatnot), and it is inefficient to do it then (this can be an expensive operation). So it is completely up to the user to rebuild or not.

Comment From: robertmuil

@jreback I see, that makes a lot of sense. Would it not be possible, though, to provide a flag to the drop function which causes this rebuild to occur? Even if drop is used under the hood by the slicing / selection operators, as long as the default is not to rebuild, this shouldn't have any impact at all, just allow the user to specify it when they are dropping the items, and thus get a more intuitively consistent DataFrame from the drop operation.

It is also unclear to me what the recommended way to rebuild the levels of the index is: is it by reset_index().set_index(...)?

Thanks very much for your quick response, by the way.

Comment From: jreback

can you put up a simple example of what you are doing?

Comment From: robertmuil

Sure:

> x = pandas.DataFrame([['deleteMe',1, 9],['keepMe',2, 9],['keepMeToo',3, 9]], columns=['first','second', 'third'])
> x.set_index(['first', 'second'], inplace=True)

Before dropping levels, the following are consistent:

> len(x.index.levels[0])
3
> len(x.index.get_level_values(level='first').unique())
3

But afterward, not:

> x.drop('deleteMe', level='first', inplace=True)
> len(x.index.levels[0])
3
> len(x.index.get_level_values(level='first').unique())
2

And this can be important, because the get_level_values().unique() is hundreds of times slower. So, I thought that a possibility would be to allow the drop() function to explicitly be told to reindex:

> x.drop('deleteMe', level='first', inplace=True, reindex=True)
> len(x.index.levels[0])
2

Comment From: jreback

@robertmuil

sorry, forgot to respond to you.

Here is an easy way to do this

create the new frame (FYI in general doing things inplace IMHO is confusing to the user and doesn't help with speed at all)

In [43]: dropped = x.drop('deleteMe', level='first')

In [44]: dropped
Out[44]: 
                  third
first     second       
keepMe    2           9
keepMeToo 3           9

In [46]: dropped.index.get_level_values(level='first').unique()
Out[46]: array(['keepMe', 'keepMeToo'], dtype=object)

This returns a new frame (You can assign alternatively if you want)

In [47]: dropped.set_index(pd.MultiIndex.from_tuples(dropped.index.values))
Out[47]: 
             third
keepMe    2      9
keepMeToo 3      9
In [50]: dropped.set_index(pd.MultiIndex.from_tuples(dropped.index.values)).index.get_level_values(level=0).unique()
Out[50]: array(['keepMe', 'keepMeToo'], dtype=object)

# has only the new values
In [51]: dropped.set_index(pd.MultiIndex.from_tuples(dropped.index.values)).index.levels[0]                         
Out[51]: Index([u'keepMe', u'keepMeToo'], dtype='object')

This is pretty cheap to do (though not completely free).

I suppose you could add this as an option to drop if you'd like. (and I would say reindex would be a fine kw for this).

like to do a pull-request?

Comment From: pjadzinsky

Hi, I know this is a somewhat old thread but I wanted to pitch in since I see some inconsistencies and wanted to learn whether this is my misunderstanding of something fundamental or this should be treated as a bug

Here is a simple example.

In [22]: df = pd.DataFrame(np.random.randint(0,4,(3,2)))

In [23]: df2 = pd.concat([df,df,df], keys=['a','b','c'], names=[1,2])

In [24]: df2
Out[24]: 
     0  1
1 2      
a 0  2  0
  1  0  2
  2  2  0
b 0  2  0
  1  0  2
  2  2  0
c 0  2  0
  1  0  2
  2  2  0

And now if I do:

In [27]: df2.loc[['a']].index
Out[27]: 
MultiIndex(levels=[[u'a'], [0, 1, 2]],
           labels=[[0, 0, 0], [0, 1, 2]],
           names=[1, 2])

In [28]: df2.loc[['a','b']].index
Out[28]: 
MultiIndex(levels=[[u'a', u'b', u'c'], [0, 1, 2]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
           names=[1, 2])

I was expecting the Multiindex to be either identical in both cases (based on the discussion on this thread) or that the last one (Out[28]) should be like this:

[MultiIndex(levels=[[u'a', u'b'], [0, 1, 2]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
           names=[1, 2])

Thanks in advance for clarifying my question and for an awesome package.

Comment From: jreback

@pjadzinsky this is an implementation detail ATM.

# this is what you are expecting
In [81]: df2.loc[[('a',0)]].index
Out[81]: 
MultiIndex(levels=[['a', 'b', 'c'], [0, 1, 2]],
           labels=[[0], [0]],
           names=[1, 2])

# this is able to get an entire level so it takes a different path
In [82]: df2.loc[['a']].index
Out[82]: 
MultiIndex(levels=[['a'], [0, 1, 2]],
           labels=[[0, 0, 0], [0, 1, 2]],
           names=[1, 2])

Comment From: wesm

That is slightly odd that in one case the levels are modified in one case but not the other

Comment From: toobaz

I think this can be closed: the default behavior is as intended, and the method MultiIndex.remove_unused_levels() has been added as a simple fix for whoever doesn't like the default behavior.

Comment From: jreback

yep this is now the accepted soln.

Comment From: vldbnc

How this MultiIndex.remove_unused_levels() could be accepted solution? You might filter dataframe based on level and create new df or series

df_new = df[df.index.isin(['VALUE'], level=0)]

Newly created df_new will correctly show Mutliindex with df_index.index having only 'VALUE' on level=0 but df_new.index.levels[0] are showing all index names on level=0 from original df.