Pandas sort_index behavior differs for the same DataFrame?

I feel like I've encountered a bug. In the following scenario, the first sort_index call behaves as expected, but the second does not. Does someone know what the difference is here?

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.15.2'

In [3]: tuples = [(' foo', 'bar'), ('foo', 'bar'), (' foo ()', 'bar')]

In [4]: cols = pd.MultiIndex.from_tuples(tuples)

In [5]: df = pd.DataFrame(index=cols, data={'baz': [0, 1, 2]})

In [6]: df
Out[6]: 
             baz
 foo    bar    0
foo     bar    1
 foo () bar    2

In [7]: df.sort_index()
Out[7]: 
             baz
 foo    bar    0
 foo () bar    2
foo     bar    1

In [8]: tuples = [(' foo', 'bar'), ('foo', 'bar')]

In [9]: cols = pd.MultiIndex.from_tuples(tuples)

In [10]: df = pd.DataFrame(index=cols, data={'baz': [0, 1]})

In [11]: df
Out[11]: 
          baz
 foo bar    0
foo  bar    1

In [12]: df.ix[(' foo ()', 'bar'), 'baz'] = 2

In [13]: df
Out[13]: 
             baz
 foo    bar    0
foo     bar    1
 foo () bar    2

In [14]: df.sort_index()
Out[14]: 
             baz
 foo    bar    0
foo     bar    1
 foo () bar    2

Comment From: tlmaloney

FWIW:

~
tmaloney@aal-lpc-2-03$  locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Comment From: jreback

under the hood this uses something like this: http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

this is kind=quicksort by default, which does not preserve the original order for ties, while mergesort does. It is not possible to pass thru the option for kind in sort_index, you can specify it for sort though.

So i'll mark this an enhancement if you'd like to put forth a PR.

Comment From: tlmaloney

I'll give a crack at it, thanks.

Comment From: tlmaloney

I do see that kind is a variable that can be passed into sort_index, see docs here. Although it says that it does not apply to a MultiIndex.

I'm not sure kind is the reason for the issue. I was able to get the behavior I want, although through a roundabout way, still using quicksort:

In [1]: import pandas as pd

In [2]: tuples = [(' foo', 'bar'), ('foo', 'bar')]

In [3]: cols = pd.MultiIndex.from_tuples(tuples)

In [4]: df = pd.DataFrame(index=cols, data={'baz': [0, 1]})

In [5]: df.ix[(' foo ()', 'bar'), 'baz'] = 2

In [6]: df
Out[6]: 
             baz
 foo    bar    0
foo     bar    1
 foo () bar    2

In [7]: df.reset_index()
Out[7]: 
   level_0 level_1  baz
0      foo     bar    0
1      foo     bar    1
2   foo ()     bar    2

In [8]: df.reset_index().sort(columns=['level_0', 'level_1'])
Out[8]: 
   level_0 level_1  baz
0      foo     bar    0
2   foo ()     bar    2
1      foo     bar    1

In [9]: df.reset_index().sort(columns=['level_0', 'level_1']).set_index(['level_0', 'level_1'])
Out[9]: 
                 baz
level_0 level_1     
 foo    bar        0
 foo () bar        2
foo     bar        1

Comment From: tlmaloney

This is also odd behavior. Adding a row doesn't simply append it to the bottom, it also switches the labels in the original two rows of the index.

In [1]: import pandas as pd

In [2]: tuples = [('foo', 'bar'), (' foo', 'bar')]

In [3]: cols = pd.MultiIndex.from_tuples(tuples)

In [4]: df = pd.DataFrame(index=cols, data={'baz': [0, 1]})

In [5]: df
Out[5]: 
          baz
foo  bar    0
 foo bar    1

In [6]: df.index
Out[6]: 
MultiIndex(levels=[[u' foo', u'foo'], [u'bar']],
           labels=[[1, 0], [0, 0]])

In [7]: df.ix[(' foo ()', 'bar'), 'baz'] = 2

In [8]: df
Out[8]: 
             baz
 foo    bar    1
foo     bar    0
 foo () bar    2

In [9]: df.index
Out[9]: 
MultiIndex(levels=[[u' foo', u'foo', u' foo ()'], [u'bar']],
           labels=[[0, 1, 2], [0, 0, 0]])

Comment From: behzadnouri

This is a bug, and the problem is when the frame is modified, df.index.lexsort_depth becomes invalid, and that breaks .sort_index:

>>> pd.__version__
'0.15.2-68-ge2b014c'
>>> df = pd.DataFrame([['b', 'c', 1]],
...                   columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])
>>> df
         3rd
1st 2nd     
b   c      1
>>> df.ix[('a', 'b'), '3rd'] = 0
>>> df
         3rd
1st 2nd     
b   c      1
a   b      0
>>> df.index.lexsort_depth  #  <<< this is the BUG
2
>>> df.sort_index()
         3rd
1st 2nd     
b   c      1
a   b      0

Comment From: jreback

hmm, looks like sortorder needs to be invalidated. (however, then we de-facto have a new index at that point, so may need to copy the index in this case).

Comment From: tlmaloney

It looks like this has been an issue at least going back to v0.13.1:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.13.1'

In [3]: df = pd.DataFrame([['b', 'c', 1]], columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])

In [4]: df.ix[('a', 'b'), '3rd'] = 0

In [5]: df.index.is_lexsorted()
Out[5]: True

In [6]: df
Out[6]: 
         3rd
1st 2nd     
b   c      1
a   b      0

In v0.12.0, setting values was not working with .ix[...] =, but using set_value to set the new row, sort_index does work as expected.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.12.0'

In [3]: df = pd.DataFrame([['b', 'c', 1]], columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])

In [4]: df.ix[('a', 'b'), '3rd'] = 0
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-b91d80d789ee> in <module>()
----> 1 df.ix[('a', 'b'), '3rd'] = 0

/home/tmaloney/vedev/pandas-test-02/lib/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/indexing.pyc in __setitem__(self, key, value)
     82                 raise IndexingError('only tuples of length <= %d supported',
     83                                     self.ndim)
---> 84             indexer = self._convert_tuple(key)
     85         else:
     86             indexer = self._convert_to_indexer(key)

/home/tmaloney/vedev/pandas-test-02/lib/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/indexing.pyc in _convert_tuple(self, key)
     94         keyidx = []
     95         for i, k in enumerate(key):
---> 96             idx = self._convert_to_indexer(k, axis=i)
     97             keyidx.append(idx)
     98         return tuple(keyidx)

/home/tmaloney/vedev/pandas-test-02/lib/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis)
    608                 mask = check == -1
    609                 if mask.any():
--> 610                     raise KeyError('%s not in index' % objarr[mask])
    611 
    612                 return indexer

KeyError: "['a'] not in index"

In [5]: df.set_value(('a', 'b'), '3rd', 0)
Out[5]: 
     3rd
b c    1
a b    0

In [6]: df = df.set_value(('a', 'b'), '3rd', 0)

In [7]: df
Out[7]: 
     3rd
b c    1
a b    0

In [8]: df.sort_index()
Out[8]: 
     3rd
a b    0
b c    1

Comment From: tlmaloney

Looks like GH4039 discussion is relevant. @jreback Should this issue be tagged as Bug instead of Enhancement?

Comment From: jreback

The main issue I have with this is that you cannot simply resort the index after an append. It will be completely non-performant. So you can simply try to invalidate sortorder which will recompute the sort depth when it is called for.

Comment From: tlmaloney

Agreed. How do you invalidate the sortorder? I don't know what that means.

Comment From: jreback

sorry...misspoke, you need to invalid the cache on lexsort_depth you can call mi._reset_cache()

Comment From: tlmaloney

I'm not sure that works.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.15.2'

In [3]: df = pd.DataFrame([['b', 'c', 1]], columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])

In [4]: df.ix[('a', 'b'), '3rd'] = 0

In [5]: df.index.is_lexsorted()
Out[5]: True

In [6]: df.index._cache
Out[6]: 
{'_engine': <pandas.index.ObjectEngine at 0x351c3f8>,
 'is_unique': True,
 'lexsort_depth': 2}

In [7]: df.index._res
df.index._reset_cache     df.index._reset_identity  

In [7]: df.index._reset_cache()

In [8]: df.index.is_lexsorted()
Out[8]: True

In [9]: df
Out[9]: 
         3rd
1st 2nd     
b   c      1
a   b      0

In [10]: df.index._cache
Out[10]: {'lexsort_depth': 2}

In [11]: df.index._reset_cache()

In [12]: df.index._cache
Out[12]: {}

In [13]: df
Out[13]: 
         3rd
1st 2nd     
b   c      1
a   b      0

In [14]: df.sort_index()
Out[14]: 
         3rd
1st 2nd     
b   c      1
a   b      0

Comment From: behzadnouri

so this does not depend on cache, but also occurs on a freshly generated index:

In [24]: mi = MultiIndex(levels=[['b', 'a'], ['b', 'a']],
   ....:                 labels=[[0, 1], [0, 1]])

In [25]: df = DataFrame([[0], [1]], index=mi)

In [26]: df
Out[26]:
     0
b b  0
a a  1

In [27]: df.sort_index()
Out[27]:
     0
b b  0
a a  1

lexsort_depth is only based on labels ( not levels ), so here the labels are in fact lexically sorted:

In [29]: df.index.labels
Out[29]: FrozenList([[0, 1], [0, 1]])

In [30]: df.index.lexsort_depth
Out[30]: 2

it is worth inspecting that if anywhere else in the code base breaks if labels are lexically sorted, but not the levels. (i.e. if they make the assumption that labels are always sorted)

sort_index also only looks at the labels; and since they are lexically sorted it stops there.

if other places in the code also make the assumption that the labels are sorted then this should be fixed in MultiIndex.__new__; otherwise sort_index should also check levels in addition to labels.

computationally, former path, is not very cheap, so it is worth confirming first that other places in the code depend on levels being sorted and break otherwise.

Comment From: tlmaloney

@behzadnouri Thanks for your looking more into this.

@jreback

I have a couple of comments. (FWIW, I've been using pandas since 0.9.1 but haven't had a need to dig in until now since it has really just worked. I hope to one day make a contribution myself. My mental model may be out of date with the fast pace of development.) 1. This issue might have raised a couple of distinct issues, which may be worth breaking apart into separate issues. 2. I'm trying to get the correct mental model of the MultiIndex. From here on out I'm thinking in terms of a MultiIndex in the df.index sense and not the df.columns sense. Essentially a MultiIndex is made up of labels (the rows, in my mind). A MultiIndex has a number of levels (the columns, in my mind). So when I see something like this I get confused:

In [5]: mi = pd.MultiIndex(levels=[['a', 'b'], ['c', 'd'], ['e', 'f']], labels=[[0, 1], [0, 1], [0, 1]])

In [6]: mi
Out[6]: 
MultiIndex(levels=[[u'a', u'b'], [u'c', u'd'], [u'e', u'f']],
           labels=[[0, 1], [0, 1], [0, 1]])

In [7]: print mi

a  c  e
b  d  f

Because I think of ('a', 'c', 'e') and ('b', 'd', 'f') as the labels. I understand the above __repr__ (line 6) is an internal representation. Am I wrong to think there's two labels, and not three? Perhaps this is a naming bug on MultiIndex.labels.

Comment From: jreback

@tlmaloney If you'd like to create a separate issue for a distince issue/bug, pls do so, keeping in mind that they should have reproducible examples. can always xref back to here if needed. Generally having 1 issue per 'thing' is a good idea.

Using another example with differnt label lenghts to clarify your mental model

In [16]: mi = pd.MultiIndex(levels=[['a', 'b'], ['c', 'd'], ['e', 'f']], labels=[[0, 1], [0, 1], [0, 1]])

In [18]: mi.values
Out[18]: array([('a', 'c', 'e'), ('b', 'd', 'f')], dtype=object)

In [20]: mi2 = pd.MultiIndex(levels=[['a', 'b'], ['c', 'd'], ['e', 'f']], labels=[[0, 1, 1], [0, 1, 1], [0, 0, 1]])

In [22]: mi2.values
Out[22]: array([('a', 'c', 'e'), ('b', 'd', 'e'), ('b', 'd', 'f')], dtype=object)

So you see that the labels define how long the combinations are, while the length of the labels/levels themselves are the number of levels in the MI. The labels are an indexer INTO the levels array. This is conceptually what a Categorical is (and is actually implemented like this, kind of an unrolled Categorical).

Note that this has nothing to do with df.index/df.columns, they BOTH could be MultiIndexes or Index. The result of mi2.values are the column labels (or index labels), each TUPLE is a label (though its shown in a sparsified way and not a tuple).

Comment From: tlmaloney

@jreback That's really interesting, I now understand what's going on a lot better, thanks. There is some cognitive dissonance in me with these two definitions: 1. Index.labels as the indexer into the levels array 2. label as the key value in an Index

Do you also see how it could be a bit confusing? I think there is a naming bug, but since naming is hard and is unrelated to the original issue, if you agree I can create a separate issue and xref this one.

Comment From: jreback

.labels only exists on MultiIndex. I agree these are somewhat confusing, but these are internal references (e.g. .levels/.labels).

We did change these for Categorical, e.g. .categories and .codes are the .levels and .labels. However for back-compat I don't think we are likely to change them for MultiIndex. (and even if we did, I think .levels is very natural, so I suppose labels->codes IS possible).

These are came original from R, fyi.

Comment From: jreback

you can see somewhat of a related discussion in #3268

Comment From: tlmaloney

The last comment by @behzadnouri gets at the heart of the problem I think. I'm not sure why the sorting methods look at .labels instead of .levels. Referencing .labels also seems to be an issue with sortlevel:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.15.2-103-gfda5012'

In [3]: tuples = [('a', 'x'), ('c', 'x')]

In [4]: idx = pd.MultiIndex.from_tuples(tuples)

In [5]: df = pd.DataFrame(index=idx, data={'baz': [0, 1]})

In [6]: df
Out[6]: 
     baz
a x    0
c x    1

In [7]: df.index
Out[7]: 
MultiIndex(levels=[[u'a', u'c'], [u'x']],
           labels=[[0, 1], [0, 0]])

In [8]: df.ix[('b', 'x'), 'baz'] = 2

In [9]: df
Out[9]: 
     baz
a x    0
c x    1
b x    2

In [10]: df.index
Out[10]: 
MultiIndex(levels=[[u'a', u'c', u'b'], [u'x']],
           labels=[[0, 1, 2], [0, 0, 0]])

In [11]: df.sort_index()
Out[11]: 
     baz
a x    0
c x    1
b x    2

In [12]: df.ix[('a', 'y'), 'baz'] = 3

In [13]: df.index
Out[13]: 
MultiIndex(levels=[[u'a', u'c', u'b'], [u'x', u'y']],
           labels=[[0, 1, 2, 0], [0, 0, 0, 1]])

In [14]: df
Out[14]: 
     baz
a x    0
c x    1
b x    2
a y    3

In [15]: df.sort_index()
Out[15]: 
     baz
a x    0
  y    3
b x    2
c x    1

In [16]: df
Out[16]: 
     baz
a x    0
c x    1
b x    2
a y    3

In [17]: df.sortlevel(0)
Out[17]: 
     baz
a x    0
  y    3
c x    1
b x    2

Comment From: tlmaloney

Similar discussion in #8017.

Comment From: tlmaloney

@jreback @behzadnouri

The below highlights different behavior between the MultiIndex append method and setting with enlargement on a DataFrame with .ix[...] =. Compare idx3 and df.index after line 13. Do you think this is getting closer to the problem?

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.15.2-103-gfda5012'

In [3]: tuples1 = [('a', 'x'), ('c', 'x')]

In [4]: tuples2 = [('b', 'x')]

In [5]: idx1 = pd.MultiIndex.from_tuples(tuples1)

In [6]: idx2 = pd.MultiIndex.from_tuples(tuples2)

In [7]: idx3 = idx1.append(idx2)

In [8]: idx1
Out[8]: 
MultiIndex(levels=[[u'a', u'c'], [u'x']],
           labels=[[0, 1], [0, 0]])

In [9]: idx2
Out[9]: 
MultiIndex(levels=[[u'b'], [u'x']],
           labels=[[0], [0]])

In [10]: idx3
Out[10]: 
MultiIndex(levels=[[u'a', u'b', u'c'], [u'x']],
           labels=[[0, 2, 1], [0, 0, 0]])

In [11]: df = pd.DataFrame(index=idx1, data={'baz': [0, 1]})

In [12]: df
Out[12]: 
     baz
a x    0
c x    1

In [13]: df.ix[('b', 'x'), 'baz'] = 2

In [14]: df
Out[14]: 
     baz
a x    0
c x    1
b x    2

In [15]: df.index
Out[15]: 
MultiIndex(levels=[[u'a', u'c', u'b'], [u'x']],
           labels=[[0, 1, 2], [0, 0, 0]])

In [16]: idx3.values
Out[16]: array([('a', 'x'), ('c', 'x'), ('b', 'x')], dtype=object)

In [17]: idx3.sortlevel(0)
Out[17]: 
(MultiIndex(levels=[[u'a', u'b', u'c'], [u'x']],
            labels=[[0, 1, 2], [0, 0, 0]]), array([0, 2, 1]))

In [18]: df.index.sortlevel(0)
Out[18]: 
(MultiIndex(levels=[[u'a', u'c', u'b'], [u'x']],
            labels=[[0, 1, 2], [0, 0, 0]]), array([0, 1, 2]))

Comment From: tlmaloney

Quick question: what does "associated factor" mean here?

Comment From: 8one6

What's the status on this one now? Is anyone actively working on a fix of the underlying issue? And, separately, is there any workaround for when "you just need to get the darn dataframe sorted" while we wait for/work on a more permanent fix for sorting?

Comment From: jreback

@8one6 well its an active issue with no pr, so no-one is working on it. you are welcome to. .sortlevel() works just fine.

Comment From: 8one6

Got it. So in a pinch, I can just sortlevel on each level (starting from innermost) and that'll get the job done?

Comment From: jreback

no just sortlevel, you don't need to do anything special. this is a quite degenerate case.

Comment From: jreback

duplicate of #13431