Pandas df.sort_index() leaves us with df.index.is_lexsorted()==False

Code Sample, a copy-pastable example if possible

d=pd.DataFrame(
    [
        [1,2,3],
        [1,2,4],
        [2,3,5],
        [1,None,5],
    ],
    columns=['i','j','k']
)
print d.dtypes
d = d.set_index(['i','j'])['k']
#d = d.sort_index(level=[0,1])
d = d.sort_index()
print d.loc[(2,3)]   # PerformanceWarning even though sorted

Problem description

This behavior his is very confusing. The docs ask me to sort multiindexes to avoid the performance warning,

If I do it, I still get the PerformanceWarning. If I print the dataframe, I see that data was indeed sorted.

This is resolved by explicitly specifying the levels to sort: d.sort_index(level=[0,1]).

As a user, I'm confused. Wasn't d.sort_index() the same thing as d.sort_index(level=...all levels...)? Inspecting my dataframe, it appeared to be the, case (the index was sorted, and nested into a hierarchy).

Shouldn't d.sort_index() simply be the same as d.sort_index(level=...all levels...)

I'm surprised there's any behavioral difference to be?

It seems due to:

d = d.sort_index()
print d.index.is_lexsorted()  # False

d = d.sort_index(level=[0,1])
print d.index.is_lexsorted()  # True

and then, they are only different if the index contains nulls.

Why there's a difference doesn't seem to be documented.

Expected Output

Either: it should be that d.sort_index() === d.sort_index(level=...all levels...) or, dataframe.sort_index documentation could stand improvement and explain when the fully specified level parameter is different to not specifying level.

I'd prefer: d.sort_index() === d.sort_index(level=d.index.names), because: - this would allow functions function working on dataframes with vanilla indexes and calling sort_index(), also work optimally when provided dataframes with multindexes. - its cumbersome and error-prone to need to remember to specify the level names, since this is only needed in the multi-index case, and then also only in the case where the index contains NULLs.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.6.final.0 python-bits: 64 OS: Linux OS-release: 3.13.0-24-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.22.0 pytest: None pip: 9.0.1 setuptools: 36.0.1 Cython: 0.27.3 numpy: 1.14.0 scipy: 1.0.0 pyarrow: None xarray: None IPython: 5.3.0 sphinx: None patsy: 0.2.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: 3.1.1 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 1.7.0 xlrd: 0.9.2 xlwt: 0.7.5 xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: 1.0.15 pymysql: None psycopg2: 2.7.3.2 (dt dec pq3 ext lo64) jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: chris-b1

Thanks for the example / report, this is a duplicate of #17931

There is a fundamental problem of a leaky internal detail - default sort puts missing data in the last position, but the integer coding backing a MultiIndex uses -1, so to be properly lexsorted, NA must be in the first position.

Pandas df.sort_index() leaves us with df.index.is_lexsorted()==False

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`