Code Sample, a copy-pastable example if possible
d=pd.DataFrame(
[
[1,2,3],
[1,2,4],
[2,3,5],
[1,None,5],
],
columns=['i','j','k']
)
print d.dtypes
d = d.set_index(['i','j'])['k']
#d = d.sort_index(level=[0,1])
d = d.sort_index()
print d.loc[(2,3)] # PerformanceWarning even though sorted
Problem description
This behavior his is very confusing. The docs ask me to sort multiindexes to avoid the performance warning,
If I do it, I still get the PerformanceWarning. If I print the dataframe, I see that data was indeed sorted.
This is resolved by explicitly specifying the levels to sort: d.sort_index(level=[0,1])
.
As a user, I'm confused. Wasn't d.sort_index()
the same thing as d.sort_index(level=...all levels...)
? Inspecting my dataframe, it appeared to be the, case (the index was sorted, and nested into a hierarchy).
Shouldn't d.sort_index()
simply be the same as d.sort_index(level=...all levels...)
I'm surprised there's any behavioral difference to be?
It seems due to:
d = d.sort_index()
print d.index.is_lexsorted() # False
d = d.sort_index(level=[0,1])
print d.index.is_lexsorted() # True
and then, they are only different if the index contains nulls.
Why there's a difference doesn't seem to be documented.
Expected Output
Either: it should be that d.sort_index()
=== d.sort_index(level=...all levels...)
or, dataframe.sort_index documentation could stand improvement and explain when the fully specified level parameter is different to not specifying level.
I'd prefer: d.sort_index() === d.sort_index(level=d.index.names)
, because:
- this would allow functions function working on dataframes with vanilla indexes and calling sort_index(), also work optimally when provided dataframes with multindexes.
- its cumbersome and error-prone to need to remember to specify the level names, since this is only needed in the multi-index case, and then also only in the case where the index contains NULLs.
Output of pd.show_versions()
Comment From: chris-b1
Thanks for the example / report, this is a duplicate of #17931
There is a fundamental problem of a leaky internal detail - default sort puts missing data in the last position, but the integer coding backing a MultiIndex
uses -1
, so to be properly lexsorted, NA must be in the first position.