Pandas DataFrame from hierarchical NumPy recarray with hierarchical MultiIndex results in all NaN values

I filed #13415 in which it was said that DataFrame(recarray, columns=MultiIndex) does reindexing and so only selects matching columns to be in the resultant frame. I can see how this might be a backward compatibility constraint. However, I have discovered a similar but different case which still seems broken:

arr = np.zeros(3, [('q', [('x',float), ('y',int)])])
ind = pd.MultiIndex.from_tuples([('q','x'),('q','y')])
pd.DataFrame(arr, columns=ind)

This creates a 3x2 array of zeros, but results in a 3x2 DataFrame of NaNs. Note that the column names basically match: the NumPy array has a top-level q with subitems x and y, and so does the MultiIndex. If the top-level name in the MultiIndex is changed to something other than q it results in an empty DataFrame, meaning that there is some recognized correspondence between the input data and the requested columns. But the data is lost nevertheless, putting NaNs where should be zeros.

Either the columns are considered non-matching, in which case the result should be an empty DataFrame, or they do match, in which case the result should be a DataFrame with contents from the input array.

Comment From: jorisvandenbossche

The issue is rather that pandas does not parses that hierarchical dtype as you expect:

In [74]: arr = np.zeros(3, [('q', [('x',float), ('y',int)])])

In [76]: pd.DataFrame(arr)
Out[76]:
          q
0  (0.0, 0)
1  (0.0, 0)
2  (0.0, 0)

Given the above result, the rest (empty frame when providing columns) is logical again. However, I am not sure what the correct way to convert such a recarray should be. The above also seems to make sense, as the records of the recarray consist of tuples, the resulting dataframe has tuples as well.

BTW, I closed the previous issue, but that does not mean it is prohibited to ask further questions on that topic over there :-)

Comment From: jreback

This sort of works with the only constructor that accepts rec-arrays.

In [4]: pd.DataFrame.from_records(arr, columns=ind)
Out[4]: 
          q
0  (0.0, 0)
1  (0.0, 0)
2  (0.0, 0)

Comment From: jreback

this is essentially another case of #7893

Comment From: jzwinck

I disagree that this is another case of #7893. As I tried to explain:

Either the columns are considered non-matching, in which case the result should be an empty DataFrame, or they do match, in which case the result should be a DataFrame with contents from the input array.

The current behavior is that an erroneous DataFrame is created, which does not contain data from the input array, but is also not empty. If Pandas recognizes that the column names match, it should use the input data; if it believes the names don't match then the result should be an empty DataFrame. The current behavior is half-and-half.

Comment From: jreback

and that's a bug I agree

we just need another issue that covers the same material as another issue it will just even more lost - if you would like to address that issue then you can include this as a test case

Comment From: jzwinck

What I would like more than anything is to have a simple way to take a hierarchical recarray (as my example arr) and get it into a DataFrame with MultiIndex. I think you see what I am trying to do--can you offer a workaround?

Comment From: jorisvandenbossche

@jreback the result you show from from_records is exactly the same as from DataFrame():

In [1]: arr = np.zeros(3, [('q', [('x',float), ('y',int)])])

In [2]: pd.DataFrame(arr)
Out[2]:
          q
0  (0.0, 0)
1  (0.0, 0)
2  (0.0, 0)

In [3]: pd.DataFrame.from_records(arr)
Out[3]:
          q
0  (0.0, 0)
1  (0.0, 0)
2  (0.0, 0)

In [4]: ind = pd.MultiIndex.from_tuples([('q','x'),('q','y')])

In [5]: pd.DataFrame.from_records(arr, columns=ind)
Out[5]:
          q
0  (0.0, 0)
1  (0.0, 0)
2  (0.0, 0)

So in the last line, the columns=ind is actually ignored which rather looks like a bug

Comment From: jreback

assign the columns directly

Comment From: jreback

not even sure why you would work with rec arrays to begin with - they r not very friendly (not to mention have an inefficient memory repr)

Comment From: shoyer

@jreback I agree that rec arrays don't work very well, but I disagree that they are memory inefficient -- the data is all packed together in the dtype, so that seems perfectly reasonable to me.

Comment From: jzwinck

@jreback to use a non-hierarchical example, let's say I have received from another library a big list of tuples and I have a dtype list which corresponds to them, e.g.:

data = [(1.2, 'foo'), (3.4, 'bar')] # in reality wider and very long, comes from another library
dtype = [('value', float), ('name', 'S3')]

Now in NumPy I do this:

np.array(data, dtype)

And I get something useful:

array([(1.2, 'foo'), (3.4, 'bar')], 
    dtype=[('value', '<f8'), ('name', 'S3')])

I can then construct a DataFrame from that array. Is there a better way to construct a DataFrame with explicit, heterogeneous column types? I don't want Pandas to guess the column types.

Comment From: jreback

this is exactly what .from_records() does simply assign the columns after if they r MultiIndexes (which is a bug)

they are memory inefficient as pandas has to convert then to a columnar layout

Comment From: jzwinck

This doesn't work--the dtype cannot be specified:

data = [(1.2, 5), (3.4, 6)]
dtype = [('value', float), ('name', 'i2')]
pd.DataFrame.from_records(data)._data

It gives:

Axis 1: RangeIndex(start=0, stop=2, step=1)
FloatBlock: slice(0, 1, 1), 1 x 2, dtype: float64
IntBlock: slice(1, 2, 1), 1 x 2, dtype: int64

Only by using NumPy do I get what I want:

pd.DataFrame.from_records(np.array(data, dtype))._data

Axis 1: RangeIndex(start=0, stop=2, step=1)
FloatBlock: slice(0, 1, 1), 1 x 2, dtype: float64
IntBlock: slice(1, 2, 1), 1 x 2, dtype: int16

Note we now see int16 rather than int64. You have said that using recarray is memory-inefficient, but I am struggling because in my use case, not using recarray causes inefficiency in Pandas.

Is there a way to construct a DataFrame with multiple columns of different types efficiently from a sequence of tuples? Obviously I don't have an efficient way to get one column at a time from the tuples, so I can't easily construct a bunch of Series etc.

Comment From: jorisvandenbossche

That pd.DataFrame.from_records(data, dtype) does not give the desired result is expected, as the second keyword argument is index (so you are setting the dtype list as the index values).

There is no way (as far as I know) to pass directly a compound dtype without making a numpy array first.

Comment From: shoyer

@jzwinck This gives data in the form you want:

dtype = [('value', float), ('name', 'i2')]
data = np.array([(1.2, 5), (3.4, 6)], dtype)
pd.DataFrame.from_records(data).dtypes

You need to make the numpy array with the proper dtype before passing it to from_records

Comment From: jzwinck

@jorisvandenbossche and @shoyer Right, so what you and I are all saying is that constructing a NumPy recarray (structured array) is a prerequisite to constructing a Pandas DataFrame. Yet above I am being told that recarrays are bad and inefficient. So I don't really understand what to take away from all this.

Comment From: jorisvandenbossche

@jzwinck It is only a prerequisite when you want to specify a compound dtype. Otherwise, you can pass the list of tuples just to DataFrame() and it will work without making a recarray first.

Further, you only have to worry about this if memory/performance of constructing your frame is a bottleneck.