In [3]: df1 = pd.DataFrame(np.random.randint(0,10,(4,3)), columns=['a','b','c'])
In [4]: df1
Out[4]:
a b c
0 9 3 2
1 9 0 2
2 7 7 6
3 3 4 2
In [5]: df2 = df1.reindex(columns=['c','a','d'])
In [6]: df2
Out[6]:
c a d
0 2 9 NaN
1 2 9 NaN
2 6 7 NaN
3 2 3 NaN
In [7]: df2.columns
Out[7]: Index([u'c', u'a', u'd'], dtype='object')
In [8]: ci = pd.MultiIndex.from_product([['x','y'],['a','b','c']])
In [10]: df3 = pd.DataFrame(np.random.randint(0,10,(4,6)), columns=ci)
In [11]: df3
Out[11]:
x y
a b c a b c
0 3 1 5 3 8 7
1 2 7 6 8 7 4
2 8 3 5 7 1 1
3 7 5 8 7 8 7
In [12]: df4 = df3.reindex(columns=['c','a','d'], level=1)
In [13]: df4
Out[13]:
x y
c a c a
0 5 3 7 3
1 6 2 4 8
2 5 8 1 7
3 8 7 7 7
In [14]: df4.columns
Out[14]:
MultiIndex(levels=[[u'x', u'y'], [u'c', u'a', u'd']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
When passing a nonexistent column name to reindex on a dataframe without multiindex columns, the result is:
- a NaN column with the "new" column name
- the columns attribute matches the columns in the dataframe
The same action on a multiindex dataframe produces different results:
- there are no NaN columns (this may not be a problem)
- the columns attribute of the resulting dataframe does not match the dateframe column names (this appears to be a bug)
Comment From: TomAugspurger
the columns attribute of the resulting dataframe does not match the dateframe column names (this appears to be a bug)
If you look at df4.columns, you'll notice that 'd' is present:
In [65]: df4.columns.levels[1]
Out[65]: Index(['c', 'a', 'd'], dtype='object')
Going back to your first point, that there are no NaN columns in the output, that's because there are no labels for that level (notice that there aren't any 2s in your Out[14]. If you want similar behavior, you'd have to reindex on both levels
In [77]: df3.reindex(columns=[('x', 'a'), ('x', 'd')])
Out[77]:
x
a d
0 2 NaN
1 9 NaN
2 8 NaN
3 7 NaN
In [78]: df3.reindex(columns=[('x', 'a'), ('x', 'd')]).columns
Out[78]:
MultiIndex(levels=[['x'], ['a', 'd']],
labels=[[0, 0], [0, 1]])
Make sense?
Comment From: jreback
duplicate of this issue: https://github.com/pandas-dev/pandas/issues/12319.
This ATM is a deliberate choice. Feel free to comment on the other issue.
Comment From: dfolch
The NaN column issue is tricky for sure. Not everyone wants the same lower level columns under every higher level column.
The columns attribute still seems a little weird since it can contain column names that are not in the dataframe. In my use case I did a lot of manipulations to a dataframe, and then wanted to see which columns names remained. Below is my solution:
In [62]: ci2 = pd.MultiIndex.from_tuples([('x','a'),('y','a'),('x','b'),('y','b'),('x','c')])
In [63]: df5 = pd.DataFrame(np.random.randint(0,10,(4,5)), columns=ci2)
In [64]: df6 = df5.drop(('x','c'), axis=1)
In [65]: df6
Out[65]:
x y x y
a a b b
0 0 0 8 4
1 7 2 0 4
2 4 2 2 8
3 6 1 7 5
In [66]: df6.columns
Out[66]:
MultiIndex(levels=[[u'x', u'y'], [u'a', u'b', u'c']],
labels=[[0, 1, 0, 1], [0, 0, 1, 1]])
In [67]: df6.columns.levels[1][pd.unique(df6.columns.labels[1])]
Out[67]: Index([u'a', u'b'], dtype='object')
Comment From: jreback
this is defined behavior on how s MultiIndex works you can see http://pandas.pydata.org/pandas-docs/stable/advanced.html#defined-levels if you think remove_unused_levels
Comment From: dfolch
I see now that there was a debate on this in #2770. remove_unused_levels is exactly what I needed. Thanks for implementing that @jreback.
Comment From: jreback
@dfolch great!
yeah its not automatic, but at least possible now (in an easy way).