Code Sample
mi = pd.MultiIndex.from_tuples([['A0', 'B0'],['A0', 'B1'],
['A1', 'B0'],['A1', 'B1'],['A3', np.nan] ], names=['ia','ib'])
df = pd.DataFrame(np.arange(10).reshape(5,2), mi,columns=['bar', 'foo'])
df2=df.copy()
df2.index.set_levels(['B1','B0'],level=1,inplace=True)
df2
bar foo
ia ib
A0 B1 0 1
B0 2 3
A1 B1 4 5
B0 6 7
A3 NaN 8 9
# now sort_index to df2 will fill B0 for (A3,NaN) row, but sort_index of df wouldn't
df2.sort_index()
bar foo
ia ib
A0 B0 2 3
B1 0 1
A1 B0 6 7
B1 4 5
A3 B0 8 9
Problem description
It is suspect that the unsort level in MultiIndex DataFrame lead to NaN auto fill.
Note: I come with "unsort level in MultiIndex" DataFrame after some broadcast operation. Here is the code
def mklbl(prefix, n):
return ["%s%s" % (prefix, i) for i in range(n)]
miindex = pd.MultiIndex.from_product([mklbl('A', 3),
mklbl('B', 2),
mklbl('C', 2),
mklbl('D', 2)],names=['ia','ib','ic','id'])
micolumns = ['foo','bah']
dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))
.reshape((len(miindex), len(micolumns))),
index=miindex,
columns=micolumns).sort_index().sort_index(axis=1)
dfmi=dfmi.drop('A2')
# now dfmi first level index contain list of value more than it actually hold
# and it may lead to the index level change from ['A0', 'A1', 'A2'] to ['A1','A0']
bs=dfmi.loc[('A0','B0')].copy().rename({'D1':'D2'})
# this will lead to NaN in some row due to broadcast.
(dfmi+bs).sort_index()
bah foo
ic id ia ib
C0 D0 A0 B0 2.0 0.0
B1 10.0 8.0
A1 B0 18.0 16.0
B1 26.0 24.0
D1 A0 B0 NaN NaN
B1 NaN NaN
A1 B0 NaN NaN
B1 NaN NaN
D2 A0 NaN NaN NaN
C1 D0 A0 B0 10.0 8.0
B1 18.0 16.0
A1 B0 26.0 24.0
B1 34.0 32.0
D1 A0 B0 NaN NaN
B1 NaN NaN
A1 B0 NaN NaN
B1 NaN NaN
D2 A0 NaN NaN NaN # again the (D2,NaN,NaN) get change
Expected Output
sort_index should not change the contain of DataFrame.
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.1
pytest: 4.1.0
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: 4.7.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
Comment From: erezinman
Another, perhaps simpler example:
import pandas as
pd import numpy as np
idx = pd.MultiIndex(levels=[['a', 'b', 'c'], [16., 32., 4.]], codes=[[0, 1, 2, 2], [-1, -1, 1, 1]])
df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=idx)
df
# a b c
# NaN NaN 32.0 32.0
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
# 3 12 13 14 15
df.sort_index(axis=1)
# a b c
# 4.0 4.0 32.0 32.0
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
# 3 12 13 14 15
Problem still persists 3 years after.