I feel like I've encountered a bug. In the following scenario, the first sort_index
call behaves as expected, but the second does not. Does someone know what the difference is here?
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '0.15.2'
In [3]: tuples = [(' foo', 'bar'), ('foo', 'bar'), (' foo ()', 'bar')]
In [4]: cols = pd.MultiIndex.from_tuples(tuples)
In [5]: df = pd.DataFrame(index=cols, data={'baz': [0, 1, 2]})
In [6]: df
Out[6]:
baz
foo bar 0
foo bar 1
foo () bar 2
In [7]: df.sort_index()
Out[7]:
baz
foo bar 0
foo () bar 2
foo bar 1
In [8]: tuples = [(' foo', 'bar'), ('foo', 'bar')]
In [9]: cols = pd.MultiIndex.from_tuples(tuples)
In [10]: df = pd.DataFrame(index=cols, data={'baz': [0, 1]})
In [11]: df
Out[11]:
baz
foo bar 0
foo bar 1
In [12]: df.ix[(' foo ()', 'bar'), 'baz'] = 2
In [13]: df
Out[13]:
baz
foo bar 0
foo bar 1
foo () bar 2
In [14]: df.sort_index()
Out[14]:
baz
foo bar 0
foo bar 1
foo () bar 2
Comment From: tlmaloney
FWIW:
~
tmaloney@aal-lpc-2-03$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Comment From: jreback
under the hood this uses something like this: http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
this is kind=quicksort
by default, which does not preserve the original order for ties, while mergesort does. It is not possible to pass thru the option for kind in sort_index
, you can specify it for sort
though.
So i'll mark this an enhancement if you'd like to put forth a PR.
Comment From: tlmaloney
I'll give a crack at it, thanks.
Comment From: tlmaloney
I do see that kind
is a variable that can be passed into sort_index
, see docs here. Although it says that it does not apply to a MultiIndex.
I'm not sure kind
is the reason for the issue. I was able to get the behavior I want, although through a roundabout way, still using quicksort:
In [1]: import pandas as pd
In [2]: tuples = [(' foo', 'bar'), ('foo', 'bar')]
In [3]: cols = pd.MultiIndex.from_tuples(tuples)
In [4]: df = pd.DataFrame(index=cols, data={'baz': [0, 1]})
In [5]: df.ix[(' foo ()', 'bar'), 'baz'] = 2
In [6]: df
Out[6]:
baz
foo bar 0
foo bar 1
foo () bar 2
In [7]: df.reset_index()
Out[7]:
level_0 level_1 baz
0 foo bar 0
1 foo bar 1
2 foo () bar 2
In [8]: df.reset_index().sort(columns=['level_0', 'level_1'])
Out[8]:
level_0 level_1 baz
0 foo bar 0
2 foo () bar 2
1 foo bar 1
In [9]: df.reset_index().sort(columns=['level_0', 'level_1']).set_index(['level_0', 'level_1'])
Out[9]:
baz
level_0 level_1
foo bar 0
foo () bar 2
foo bar 1
Comment From: tlmaloney
This is also odd behavior. Adding a row doesn't simply append it to the bottom, it also switches the labels in the original two rows of the index.
In [1]: import pandas as pd
In [2]: tuples = [('foo', 'bar'), (' foo', 'bar')]
In [3]: cols = pd.MultiIndex.from_tuples(tuples)
In [4]: df = pd.DataFrame(index=cols, data={'baz': [0, 1]})
In [5]: df
Out[5]:
baz
foo bar 0
foo bar 1
In [6]: df.index
Out[6]:
MultiIndex(levels=[[u' foo', u'foo'], [u'bar']],
labels=[[1, 0], [0, 0]])
In [7]: df.ix[(' foo ()', 'bar'), 'baz'] = 2
In [8]: df
Out[8]:
baz
foo bar 1
foo bar 0
foo () bar 2
In [9]: df.index
Out[9]:
MultiIndex(levels=[[u' foo', u'foo', u' foo ()'], [u'bar']],
labels=[[0, 1, 2], [0, 0, 0]])
Comment From: behzadnouri
This is a bug, and the problem is when the frame is modified, df.index.lexsort_depth
becomes invalid, and that breaks .sort_index
:
>>> pd.__version__
'0.15.2-68-ge2b014c'
>>> df = pd.DataFrame([['b', 'c', 1]],
... columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])
>>> df
3rd
1st 2nd
b c 1
>>> df.ix[('a', 'b'), '3rd'] = 0
>>> df
3rd
1st 2nd
b c 1
a b 0
>>> df.index.lexsort_depth # <<< this is the BUG
2
>>> df.sort_index()
3rd
1st 2nd
b c 1
a b 0
Comment From: jreback
hmm, looks like sortorder
needs to be invalidated. (however, then we de-facto have a new index at that point, so may need to copy the index in this case).
Comment From: tlmaloney
It looks like this has been an issue at least going back to v0.13.1:
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '0.13.1'
In [3]: df = pd.DataFrame([['b', 'c', 1]], columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])
In [4]: df.ix[('a', 'b'), '3rd'] = 0
In [5]: df.index.is_lexsorted()
Out[5]: True
In [6]: df
Out[6]:
3rd
1st 2nd
b c 1
a b 0
In v0.12.0, setting values was not working with .ix[...] =
, but using set_value
to set the new row, sort_index
does work as expected.
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '0.12.0'
In [3]: df = pd.DataFrame([['b', 'c', 1]], columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])
In [4]: df.ix[('a', 'b'), '3rd'] = 0
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-4-b91d80d789ee> in <module>()
----> 1 df.ix[('a', 'b'), '3rd'] = 0
/home/tmaloney/vedev/pandas-test-02/lib/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/indexing.pyc in __setitem__(self, key, value)
82 raise IndexingError('only tuples of length <= %d supported',
83 self.ndim)
---> 84 indexer = self._convert_tuple(key)
85 else:
86 indexer = self._convert_to_indexer(key)
/home/tmaloney/vedev/pandas-test-02/lib/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/indexing.pyc in _convert_tuple(self, key)
94 keyidx = []
95 for i, k in enumerate(key):
---> 96 idx = self._convert_to_indexer(k, axis=i)
97 keyidx.append(idx)
98 return tuple(keyidx)
/home/tmaloney/vedev/pandas-test-02/lib/python2.7/site-packages/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis)
608 mask = check == -1
609 if mask.any():
--> 610 raise KeyError('%s not in index' % objarr[mask])
611
612 return indexer
KeyError: "['a'] not in index"
In [5]: df.set_value(('a', 'b'), '3rd', 0)
Out[5]:
3rd
b c 1
a b 0
In [6]: df = df.set_value(('a', 'b'), '3rd', 0)
In [7]: df
Out[7]:
3rd
b c 1
a b 0
In [8]: df.sort_index()
Out[8]:
3rd
a b 0
b c 1
Comment From: tlmaloney
Looks like GH4039 discussion is relevant. @jreback Should this issue be tagged as Bug instead of Enhancement?
Comment From: jreback
The main issue I have with this is that you cannot simply resort the index after an append. It will be completely non-performant. So you can simply try to invalidate sortorder
which will recompute the sort depth when it is called for.
Comment From: tlmaloney
Agreed. How do you invalidate the sortorder? I don't know what that means.
Comment From: jreback
sorry...misspoke, you need to invalid the cache on lexsort_depth
you can call mi._reset_cache()
Comment From: tlmaloney
I'm not sure that works.
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '0.15.2'
In [3]: df = pd.DataFrame([['b', 'c', 1]], columns=['1st', '2nd', '3rd']).set_index(['1st', '2nd'])
In [4]: df.ix[('a', 'b'), '3rd'] = 0
In [5]: df.index.is_lexsorted()
Out[5]: True
In [6]: df.index._cache
Out[6]:
{'_engine': <pandas.index.ObjectEngine at 0x351c3f8>,
'is_unique': True,
'lexsort_depth': 2}
In [7]: df.index._res
df.index._reset_cache df.index._reset_identity
In [7]: df.index._reset_cache()
In [8]: df.index.is_lexsorted()
Out[8]: True
In [9]: df
Out[9]:
3rd
1st 2nd
b c 1
a b 0
In [10]: df.index._cache
Out[10]: {'lexsort_depth': 2}
In [11]: df.index._reset_cache()
In [12]: df.index._cache
Out[12]: {}
In [13]: df
Out[13]:
3rd
1st 2nd
b c 1
a b 0
In [14]: df.sort_index()
Out[14]:
3rd
1st 2nd
b c 1
a b 0
Comment From: behzadnouri
so this does not depend on cache, but also occurs on a freshly generated index:
In [24]: mi = MultiIndex(levels=[['b', 'a'], ['b', 'a']],
....: labels=[[0, 1], [0, 1]])
In [25]: df = DataFrame([[0], [1]], index=mi)
In [26]: df
Out[26]:
0
b b 0
a a 1
In [27]: df.sort_index()
Out[27]:
0
b b 0
a a 1
lexsort_depth
is only based on labels ( not levels ), so here the labels are in fact lexically sorted:
In [29]: df.index.labels
Out[29]: FrozenList([[0, 1], [0, 1]])
In [30]: df.index.lexsort_depth
Out[30]: 2
it is worth inspecting that if anywhere else in the code base breaks if labels are lexically sorted, but not the levels. (i.e. if they make the assumption that labels are always sorted)
sort_index
also only looks at the labels; and since they are lexically sorted it stops there.
if other places in the code also make the assumption that the labels are sorted then this should be fixed in MultiIndex.__new__
; otherwise sort_index
should also check levels in addition to labels.
computationally, former path, is not very cheap, so it is worth confirming first that other places in the code depend on levels being sorted and break otherwise.
Comment From: tlmaloney
@behzadnouri Thanks for your looking more into this.
@jreback
I have a couple of comments. (FWIW, I've been using pandas since 0.9.1 but haven't had a need to dig in until now since it has really just worked. I hope to one day make a contribution myself. My mental model may be out of date with the fast pace of development.)
1. This issue might have raised a couple of distinct issues, which may be worth breaking apart into separate issues.
2. I'm trying to get the correct mental model of the MultiIndex. From here on out I'm thinking in terms of a MultiIndex in the df.index
sense and not the df.columns
sense. Essentially a MultiIndex is made up of labels (the rows, in my mind). A MultiIndex has a number of levels (the columns, in my mind). So when I see something like this I get confused:
In [5]: mi = pd.MultiIndex(levels=[['a', 'b'], ['c', 'd'], ['e', 'f']], labels=[[0, 1], [0, 1], [0, 1]])
In [6]: mi
Out[6]:
MultiIndex(levels=[[u'a', u'b'], [u'c', u'd'], [u'e', u'f']],
labels=[[0, 1], [0, 1], [0, 1]])
In [7]: print mi
a c e
b d f
Because I think of ('a', 'c', 'e')
and ('b', 'd', 'f')
as the labels. I understand the above __repr__
(line 6) is an internal representation. Am I wrong to think there's two labels, and not three? Perhaps this is a naming bug on MultiIndex.labels
.
Comment From: jreback
@tlmaloney If you'd like to create a separate issue for a distince issue/bug, pls do so, keeping in mind that they should have reproducible examples. can always xref back to here if needed. Generally having 1 issue per 'thing' is a good idea.
Using another example with differnt label lenghts to clarify your mental model
In [16]: mi = pd.MultiIndex(levels=[['a', 'b'], ['c', 'd'], ['e', 'f']], labels=[[0, 1], [0, 1], [0, 1]])
In [18]: mi.values
Out[18]: array([('a', 'c', 'e'), ('b', 'd', 'f')], dtype=object)
In [20]: mi2 = pd.MultiIndex(levels=[['a', 'b'], ['c', 'd'], ['e', 'f']], labels=[[0, 1, 1], [0, 1, 1], [0, 0, 1]])
In [22]: mi2.values
Out[22]: array([('a', 'c', 'e'), ('b', 'd', 'e'), ('b', 'd', 'f')], dtype=object)
So you see that the labels define how long the combinations are, while the length of the labels/levels themselves are the number of levels in the MI. The labels are an indexer INTO the levels array. This is conceptually what a Categorical
is (and is actually implemented like this, kind of an unrolled Categorical).
Note that this has nothing to do with df.index/df.columns
, they BOTH could be MultiIndexes
or Index
. The result of mi2.values
are the column labels (or index labels), each TUPLE is a label (though its shown in a sparsified way and not a tuple).
Comment From: tlmaloney
@jreback That's really interesting, I now understand what's going on a lot better, thanks. There is some cognitive dissonance in me with these two definitions:
1. Index.labels
as the indexer into the levels array
2. label as the key value in an Index
Do you also see how it could be a bit confusing? I think there is a naming bug, but since naming is hard and is unrelated to the original issue, if you agree I can create a separate issue and xref this one.
Comment From: jreback
.labels
only exists on MultiIndex
. I agree these are somewhat confusing, but these are internal references (e.g. .levels/.labels
).
We did change these for Categorical
, e.g. .categories
and .codes
are the .levels
and .labels
. However for back-compat I don't think we are likely to change them for MultiIndex
. (and even if we did, I think .levels
is very natural, so I suppose labels->codes
IS possible).
These are came original from R, fyi.
Comment From: jreback
you can see somewhat of a related discussion in #3268
Comment From: tlmaloney
The last comment by @behzadnouri gets at the heart of the problem I think. I'm not sure why the sorting methods look at .labels
instead of .levels
. Referencing .labels
also seems to be an issue with sortlevel
:
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '0.15.2-103-gfda5012'
In [3]: tuples = [('a', 'x'), ('c', 'x')]
In [4]: idx = pd.MultiIndex.from_tuples(tuples)
In [5]: df = pd.DataFrame(index=idx, data={'baz': [0, 1]})
In [6]: df
Out[6]:
baz
a x 0
c x 1
In [7]: df.index
Out[7]:
MultiIndex(levels=[[u'a', u'c'], [u'x']],
labels=[[0, 1], [0, 0]])
In [8]: df.ix[('b', 'x'), 'baz'] = 2
In [9]: df
Out[9]:
baz
a x 0
c x 1
b x 2
In [10]: df.index
Out[10]:
MultiIndex(levels=[[u'a', u'c', u'b'], [u'x']],
labels=[[0, 1, 2], [0, 0, 0]])
In [11]: df.sort_index()
Out[11]:
baz
a x 0
c x 1
b x 2
In [12]: df.ix[('a', 'y'), 'baz'] = 3
In [13]: df.index
Out[13]:
MultiIndex(levels=[[u'a', u'c', u'b'], [u'x', u'y']],
labels=[[0, 1, 2, 0], [0, 0, 0, 1]])
In [14]: df
Out[14]:
baz
a x 0
c x 1
b x 2
a y 3
In [15]: df.sort_index()
Out[15]:
baz
a x 0
y 3
b x 2
c x 1
In [16]: df
Out[16]:
baz
a x 0
c x 1
b x 2
a y 3
In [17]: df.sortlevel(0)
Out[17]:
baz
a x 0
y 3
c x 1
b x 2
Comment From: tlmaloney
Similar discussion in #8017.
Comment From: tlmaloney
@jreback @behzadnouri
The below highlights different behavior between the MultiIndex append method and setting with enlargement on a DataFrame with .ix[...] =
. Compare idx3
and df.index
after line 13. Do you think this is getting closer to the problem?
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '0.15.2-103-gfda5012'
In [3]: tuples1 = [('a', 'x'), ('c', 'x')]
In [4]: tuples2 = [('b', 'x')]
In [5]: idx1 = pd.MultiIndex.from_tuples(tuples1)
In [6]: idx2 = pd.MultiIndex.from_tuples(tuples2)
In [7]: idx3 = idx1.append(idx2)
In [8]: idx1
Out[8]:
MultiIndex(levels=[[u'a', u'c'], [u'x']],
labels=[[0, 1], [0, 0]])
In [9]: idx2
Out[9]:
MultiIndex(levels=[[u'b'], [u'x']],
labels=[[0], [0]])
In [10]: idx3
Out[10]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'x']],
labels=[[0, 2, 1], [0, 0, 0]])
In [11]: df = pd.DataFrame(index=idx1, data={'baz': [0, 1]})
In [12]: df
Out[12]:
baz
a x 0
c x 1
In [13]: df.ix[('b', 'x'), 'baz'] = 2
In [14]: df
Out[14]:
baz
a x 0
c x 1
b x 2
In [15]: df.index
Out[15]:
MultiIndex(levels=[[u'a', u'c', u'b'], [u'x']],
labels=[[0, 1, 2], [0, 0, 0]])
In [16]: idx3.values
Out[16]: array([('a', 'x'), ('c', 'x'), ('b', 'x')], dtype=object)
In [17]: idx3.sortlevel(0)
Out[17]:
(MultiIndex(levels=[[u'a', u'b', u'c'], [u'x']],
labels=[[0, 1, 2], [0, 0, 0]]), array([0, 2, 1]))
In [18]: df.index.sortlevel(0)
Out[18]:
(MultiIndex(levels=[[u'a', u'c', u'b'], [u'x']],
labels=[[0, 1, 2], [0, 0, 0]]), array([0, 1, 2]))
Comment From: tlmaloney
Quick question: what does "associated factor" mean here?
Comment From: 8one6
What's the status on this one now? Is anyone actively working on a fix of the underlying issue? And, separately, is there any workaround for when "you just need to get the darn dataframe sorted" while we wait for/work on a more permanent fix for sorting?
Comment From: jreback
@8one6 well its an active issue with no pr, so no-one is working on it. you are welcome to. .sortlevel()
works just fine.
Comment From: 8one6
Got it. So in a pinch, I can just sortlevel
on each level (starting from innermost) and that'll get the job done?
Comment From: jreback
no just sortlevel
, you don't need to do anything special. this is a quite degenerate case.
Comment From: jreback
duplicate of #13431