It's not clear to me that this is a "bug" per se, but it's certainly not intuitive behavior. My thought was that these two methods would result in identical DataFrames.
import numpy as np
import pandas as pd
def multiindex_version_1(data):
n, m = data.shape
tuples = [('a', _m) for _m in xrange(m-1)] + [('b', )]
columns = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(data, columns=columns)
return df
def multiindex_version_2(data):
n, m = data.shape
tuples = [('a', _m) for _m in xrange(m-1)]
columns = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(data[:, :(m-1)], columns=columns)
df['b'] = data[:, m-1]
return df
def test_goofy_multiindexing():
shape = (10, 4)
data = np.random.rand(*shape)
df1 = multiindex_version_1(data)
print "============= version_1 ============="
print df1
df2 = multiindex_version_2(data)
print "============= version_2 ============="
print df2
assert (df1==df2).all()
Note the NaN immediately below.
============= version_1 =============
a b
0 1 2 NaN
0 0.742381 0.981698 0.976639 0.212344
1 0.278328 0.188071 0.320450 0.236994
2 0.543064 0.786865 0.065103 0.233676
3 0.084130 0.081981 0.295404 0.113241
4 0.025517 0.522754 0.238385 0.284278
5 0.263382 0.176789 0.289079 0.472668
6 0.933632 0.328850 0.006974 0.191654
7 0.557348 0.313619 0.582076 0.897741
8 0.271253 0.815123 0.113981 0.787208
9 0.595073 0.887483 0.350284 0.815767
============= version_2 =============
a b
0 1 2
0 0.742381 0.981698 0.976639 0.212344
1 0.278328 0.188071 0.320450 0.236994
2 0.543064 0.786865 0.065103 0.233676
3 0.084130 0.081981 0.295404 0.113241
4 0.025517 0.522754 0.238385 0.284278
5 0.263382 0.176789 0.289079 0.472668
6 0.933632 0.328850 0.006974 0.191654
7 0.557348 0.313619 0.582076 0.897741
8 0.271253 0.815123 0.113981 0.787208
9 0.595073 0.887483 0.350284 0.815767
E
======================================================================
ERROR: test_scatter.test_goofy_multiindexing
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/home/br/test/test_scatter.py", line 99, in test_goofy_multiindexing
assert (df1==df2).all()
File "/home/br/.virtualenvs/cloudlab/local/lib/python2.7/site-packages/pandas/core/ops.py", line 912, in f
return self._compare_frame(other, func, str_rep)
File "/home/br/.virtualenvs/cloudlab/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3136, in _compare_frame
raise ValueError('Can only compare identically-labeled '
ValueError: Can only compare identically-labeled DataFrame objects
-------------------- >> begin captured logging << --------------------
rospy.topics: INFO: topicmanager initialized
--------------------- >> end captured logging << ---------------------
----------------------------------------------------------------------
Ran 1 test in 0.010s
FAILED (errors=1)
Comment From: jreback
What you are asking is too magical / not supported.
There IS a difference between a NaN
named level and once named ''
. To be honest this is a bit odd, but not really sure if we can do / should we do anything about this. Multi-indexes are fully leveled, in that, every value has each level represented. Their could be a new MultiIndex
which can handle this better, but that would be quite a challenge I think.
Comment From: brianthelion
@jreback
A point that I am trying -- and maybe failing -- to illustrate here is that I did not, myself, insert the NaN; that was done by .from_tuples(...)
. You'll note that in multiindex_version_1(...)
I didn't give a secondary index level for column b
, but the secondary index level for b
was nonetheless filled in with a NaN. This is undocumented behavior, and I would hazard that most folks will assume that they'll end up with the version_2
column scheme.
Comment From: jreback
@brianthelion by definition pandas fills missing things with NaN
. That is quite fully documented in the missing data section. You are welcome to take a look there. I think it would be much more suprising to fill the nans with something else. It is not general at all to auto-fill things.
Comment From: brianthelion
@jreback
Sorry, maybe I'm still not explaining myself clearly here:
Let's start with the version_2
DataFrame and assume that that's the kind of column index structure that we want. Given the tuples that I put in to .from_tuples(...)
in multiindex_version_1
, my hope was that that was the kind of column structure I would get. Instead, I got a result in which pandas assumed that data were missing. The undocumented behavior here, then, is that .from_tuples
assumes that data are missing if it doesn't get tuples of equal length.
The ultimate question is, How do I get the version_2
DataFrame without resorting to assignment, eg, via a MulitIndex class constructor?
Comment From: jreback
specify tuples, e.g. ('b','')
if you really want to use an empty string.