Pandas Inconsistencies in DataFrame construction with MultiIndex.from_tuples

It's not clear to me that this is a "bug" per se, but it's certainly not intuitive behavior. My thought was that these two methods would result in identical DataFrames.

import numpy as np
import pandas as pd

def multiindex_version_1(data):
    n, m = data.shape
    tuples = [('a', _m) for _m in xrange(m-1)] + [('b', )]
    columns = pd.MultiIndex.from_tuples(tuples)
    df = pd.DataFrame(data, columns=columns)
    return df

def multiindex_version_2(data):
    n, m = data.shape
    tuples = [('a', _m) for _m in xrange(m-1)]
    columns = pd.MultiIndex.from_tuples(tuples)
    df = pd.DataFrame(data[:, :(m-1)], columns=columns)
    df['b'] = data[:, m-1]
    return df

def test_goofy_multiindexing():
    shape = (10, 4)
    data = np.random.rand(*shape)
    df1 = multiindex_version_1(data)
    print "============= version_1 ============="
    print df1
    df2 = multiindex_version_2(data)
    print "============= version_2 ============="
    print df2
    assert (df1==df2).all()

Note the NaN immediately below.

============= version_1 =============
          a                             b
          0         1         2       NaN
0  0.742381  0.981698  0.976639  0.212344
1  0.278328  0.188071  0.320450  0.236994
2  0.543064  0.786865  0.065103  0.233676
3  0.084130  0.081981  0.295404  0.113241
4  0.025517  0.522754  0.238385  0.284278
5  0.263382  0.176789  0.289079  0.472668
6  0.933632  0.328850  0.006974  0.191654
7  0.557348  0.313619  0.582076  0.897741
8  0.271253  0.815123  0.113981  0.787208
9  0.595073  0.887483  0.350284  0.815767
============= version_2 =============
          a                             b
          0         1         2          
0  0.742381  0.981698  0.976639  0.212344
1  0.278328  0.188071  0.320450  0.236994
2  0.543064  0.786865  0.065103  0.233676
3  0.084130  0.081981  0.295404  0.113241
4  0.025517  0.522754  0.238385  0.284278
5  0.263382  0.176789  0.289079  0.472668
6  0.933632  0.328850  0.006974  0.191654
7  0.557348  0.313619  0.582076  0.897741
8  0.271253  0.815123  0.113981  0.787208
9  0.595073  0.887483  0.350284  0.815767
E
======================================================================
ERROR: test_scatter.test_goofy_multiindexing
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/br/test/test_scatter.py", line 99, in test_goofy_multiindexing
    assert (df1==df2).all()
  File "/home/br/.virtualenvs/cloudlab/local/lib/python2.7/site-packages/pandas/core/ops.py", line 912, in f
    return self._compare_frame(other, func, str_rep)
  File "/home/br/.virtualenvs/cloudlab/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3136, in _compare_frame
    raise ValueError('Can only compare identically-labeled '
ValueError: Can only compare identically-labeled DataFrame objects
-------------------- >> begin captured logging << --------------------
rospy.topics: INFO: topicmanager initialized
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 0.010s

FAILED (errors=1)

Comment From: jreback

What you are asking is too magical / not supported.

There IS a difference between a NaN named level and once named ''. To be honest this is a bit odd, but not really sure if we can do / should we do anything about this. Multi-indexes are fully leveled, in that, every value has each level represented. Their could be a new MultiIndex which can handle this better, but that would be quite a challenge I think.

Comment From: brianthelion

@jreback

A point that I am trying -- and maybe failing -- to illustrate here is that I did not, myself, insert the NaN; that was done by .from_tuples(...). You'll note that in multiindex_version_1(...) I didn't give a secondary index level for column b, but the secondary index level for b was nonetheless filled in with a NaN. This is undocumented behavior, and I would hazard that most folks will assume that they'll end up with the version_2 column scheme.

Comment From: jreback

@brianthelion by definition pandas fills missing things with NaN. That is quite fully documented in the missing data section. You are welcome to take a look there. I think it would be much more suprising to fill the nans with something else. It is not general at all to auto-fill things.

Comment From: brianthelion

@jreback

Sorry, maybe I'm still not explaining myself clearly here:

Let's start with the version_2 DataFrame and assume that that's the kind of column index structure that we want. Given the tuples that I put in to .from_tuples(...) in multiindex_version_1, my hope was that that was the kind of column structure I would get. Instead, I got a result in which pandas assumed that data were missing. The undocumented behavior here, then, is that .from_tuples assumes that data are missing if it doesn't get tuples of equal length.

The ultimate question is, How do I get the version_2 DataFrame without resorting to assignment, eg, via a MulitIndex class constructor?

Comment From: jreback

specify tuples, e.g. ('b','') if you really want to use an empty string.