Pandas column multiindex and reindex inconsistency

In [3]: df1 = pd.DataFrame(np.random.randint(0,10,(4,3)), columns=['a','b','c'])

In [4]: df1
Out[4]: 
   a  b  c
0  9  3  2
1  9  0  2
2  7  7  6
3  3  4  2

In [5]: df2 = df1.reindex(columns=['c','a','d'])

In [6]: df2
Out[6]: 
   c  a   d
0  2  9 NaN
1  2  9 NaN
2  6  7 NaN
3  2  3 NaN

In [7]: df2.columns
Out[7]: Index([u'c', u'a', u'd'], dtype='object')

In [8]: ci = pd.MultiIndex.from_product([['x','y'],['a','b','c']])

In [10]: df3 = pd.DataFrame(np.random.randint(0,10,(4,6)), columns=ci)

In [11]: df3
Out[11]: 
   x        y      
   a  b  c  a  b  c
0  3  1  5  3  8  7
1  2  7  6  8  7  4
2  8  3  5  7  1  1
3  7  5  8  7  8  7

In [12]: df4 = df3.reindex(columns=['c','a','d'], level=1)

In [13]: df4
Out[13]: 
   x     y   
   c  a  c  a
0  5  3  7  3
1  6  2  4  8
2  5  8  1  7
3  8  7  7  7

In [14]: df4.columns
Out[14]: 
MultiIndex(levels=[[u'x', u'y'], [u'c', u'a', u'd']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

When passing a nonexistent column name to reindex on a dataframe without multiindex columns, the result is: - a NaN column with the "new" column name - the columns attribute matches the columns in the dataframe

The same action on a multiindex dataframe produces different results: - there are no NaN columns (this may not be a problem) - the columns attribute of the resulting dataframe does not match the dateframe column names (this appears to be a bug)

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Darwin OS-release: 16.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.19.1 nose: 1.3.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.1 numpy: 1.11.2 scipy: 0.18.1 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.4.8 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.7 blosc: None bottleneck: 1.1.0 tables: 3.3.0 numexpr: 2.6.1 matplotlib: 1.5.3 openpyxl: 2.4.0 xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.3 lxml: 3.6.4 bs4: 4.5.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.1.4 pymysql: None psycopg2: 2.6.2 (dt dec pq3 ext lo64) jinja2: 2.8 boto: 2.43.0 pandas_datareader: None

Comment From: TomAugspurger

the columns attribute of the resulting dataframe does not match the dateframe column names (this appears to be a bug)

If you look at df4.columns, you'll notice that 'd' is present:

In [65]: df4.columns.levels[1]
Out[65]: Index(['c', 'a', 'd'], dtype='object')

Going back to your first point, that there are no NaN columns in the output, that's because there are no labels for that level (notice that there aren't any 2s in your Out[14]. If you want similar behavior, you'd have to reindex on both levels

In [77]: df3.reindex(columns=[('x', 'a'), ('x', 'd')])
Out[77]:
   x
   a   d
0  2 NaN
1  9 NaN
2  8 NaN
3  7 NaN

In [78]: df3.reindex(columns=[('x', 'a'), ('x', 'd')]).columns
Out[78]:
MultiIndex(levels=[['x'], ['a', 'd']],
           labels=[[0, 0], [0, 1]])

Make sense?

Comment From: jreback

duplicate of this issue: https://github.com/pandas-dev/pandas/issues/12319.

This ATM is a deliberate choice. Feel free to comment on the other issue.

Comment From: dfolch

The NaN column issue is tricky for sure. Not everyone wants the same lower level columns under every higher level column.

The columns attribute still seems a little weird since it can contain column names that are not in the dataframe. In my use case I did a lot of manipulations to a dataframe, and then wanted to see which columns names remained. Below is my solution:

In [62]: ci2 = pd.MultiIndex.from_tuples([('x','a'),('y','a'),('x','b'),('y','b'),('x','c')])

In [63]: df5 = pd.DataFrame(np.random.randint(0,10,(4,5)), columns=ci2)  

In [64]: df6 = df5.drop(('x','c'), axis=1)

In [65]: df6
Out[65]: 
   x  y  x  y
   a  a  b  b
0  0  0  8  4
1  7  2  0  4
2  4  2  2  8
3  6  1  7  5

In [66]: df6.columns
Out[66]: 
MultiIndex(levels=[[u'x', u'y'], [u'a', u'b', u'c']],
           labels=[[0, 1, 0, 1], [0, 0, 1, 1]])

In [67]: df6.columns.levels[1][pd.unique(df6.columns.labels[1])]
Out[67]: Index([u'a', u'b'], dtype='object')

Comment From: jreback

this is defined behavior on how s MultiIndex works you can see http://pandas.pydata.org/pandas-docs/stable/advanced.html#defined-levels if you think remove_unused_levels

Comment From: dfolch

I see now that there was a debate on this in #2770. remove_unused_levels is exactly what I needed. Thanks for implementing that @jreback.

Comment From: jreback

@dfolch great!

yeah its not automatic, but at least possible now (in an easy way).