In [3]: df1 = pd.DataFrame(np.random.randint(0,10,(4,3)), columns=['a','b','c'])
In [4]: df1
Out[4]:
a b c
0 9 3 2
1 9 0 2
2 7 7 6
3 3 4 2
In [5]: df2 = df1.reindex(columns=['c','a','d'])
In [6]: df2
Out[6]:
c a d
0 2 9 NaN
1 2 9 NaN
2 6 7 NaN
3 2 3 NaN
In [7]: df2.columns
Out[7]: Index([u'c', u'a', u'd'], dtype='object')
In [8]: ci = pd.MultiIndex.from_product([['x','y'],['a','b','c']])
In [10]: df3 = pd.DataFrame(np.random.randint(0,10,(4,6)), columns=ci)
In [11]: df3
Out[11]:
x y
a b c a b c
0 3 1 5 3 8 7
1 2 7 6 8 7 4
2 8 3 5 7 1 1
3 7 5 8 7 8 7
In [12]: df4 = df3.reindex(columns=['c','a','d'], level=1)
In [13]: df4
Out[13]:
x y
c a c a
0 5 3 7 3
1 6 2 4 8
2 5 8 1 7
3 8 7 7 7
In [14]: df4.columns
Out[14]:
MultiIndex(levels=[[u'x', u'y'], [u'c', u'a', u'd']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
When passing a nonexistent column name to reindex
on a dataframe without multiindex columns, the result is:
- a NaN
column with the "new" column name
- the columns
attribute matches the columns in the dataframe
The same action on a multiindex dataframe produces different results:
- there are no NaN
columns (this may not be a problem)
- the columns
attribute of the resulting dataframe does not match the dateframe column names (this appears to be a bug)
Comment From: TomAugspurger
the columns attribute of the resulting dataframe does not match the dateframe column names (this appears to be a bug)
If you look at df4.columns
, you'll notice that 'd'
is present:
In [65]: df4.columns.levels[1]
Out[65]: Index(['c', 'a', 'd'], dtype='object')
Going back to your first point, that there are no NaN
columns in the output, that's because there are no labels
for that level
(notice that there aren't any 2
s in your Out[14]
. If you want similar behavior, you'd have to reindex on both levels
In [77]: df3.reindex(columns=[('x', 'a'), ('x', 'd')])
Out[77]:
x
a d
0 2 NaN
1 9 NaN
2 8 NaN
3 7 NaN
In [78]: df3.reindex(columns=[('x', 'a'), ('x', 'd')]).columns
Out[78]:
MultiIndex(levels=[['x'], ['a', 'd']],
labels=[[0, 0], [0, 1]])
Make sense?
Comment From: jreback
duplicate of this issue: https://github.com/pandas-dev/pandas/issues/12319.
This ATM is a deliberate choice. Feel free to comment on the other issue.
Comment From: dfolch
The NaN
column issue is tricky for sure. Not everyone wants the same lower level columns under every higher level column.
The columns
attribute still seems a little weird since it can contain column names that are not in the dataframe. In my use case I did a lot of manipulations to a dataframe, and then wanted to see which columns names remained. Below is my solution:
In [62]: ci2 = pd.MultiIndex.from_tuples([('x','a'),('y','a'),('x','b'),('y','b'),('x','c')])
In [63]: df5 = pd.DataFrame(np.random.randint(0,10,(4,5)), columns=ci2)
In [64]: df6 = df5.drop(('x','c'), axis=1)
In [65]: df6
Out[65]:
x y x y
a a b b
0 0 0 8 4
1 7 2 0 4
2 4 2 2 8
3 6 1 7 5
In [66]: df6.columns
Out[66]:
MultiIndex(levels=[[u'x', u'y'], [u'a', u'b', u'c']],
labels=[[0, 1, 0, 1], [0, 0, 1, 1]])
In [67]: df6.columns.levels[1][pd.unique(df6.columns.labels[1])]
Out[67]: Index([u'a', u'b'], dtype='object')
Comment From: jreback
this is defined behavior on how s MultiIndex works you can see http://pandas.pydata.org/pandas-docs/stable/advanced.html#defined-levels if you think remove_unused_levels
Comment From: dfolch
I see now that there was a debate on this in #2770. remove_unused_levels
is exactly what I needed. Thanks for implementing that @jreback.
Comment From: jreback
@dfolch great!
yeah its not automatic, but at least possible now (in an easy way).