I previously posted this as a question (not knowing it was a bug) here: http://stackoverflow.com/questions/37732403/pandas-dataframe-from-multiindex-and-numpy-structured-array-recarray
First I create a two-level MultiIndex:
import numpy as np
import pandas as pd
ind = pd.MultiIndex.from_product([('X','Y'), ('a','b')])
I can use it like this:
pd.DataFrame(np.zeros((3,4)), columns=ind)
Which gives:
X Y
a b a b
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
But now I'm trying to do this:
dtype = [('Xa','f8'), ('Xb','i4'), ('Ya','f8'), ('Yb','i4')]
pd.DataFrame(np.zeros(3, dtype), columns=ind)
But that gives me an empty DataFrame!
Empty DataFrame
Columns: [(X, a), (X, b), (Y, a), (Y, b)]
Index: []
I expected it to do the same thing as this:
df = pd.DataFrame(np.zeros(3, dtype))
df.columns = ind
df
Which is:
X Y
a b a b
0 0.0 0 0.0 0
1 0.0 0 0.0 0
2 0.0 0 0.0 0
INSTALLED VERSIONS
commit: None python: 2.7.10.final.0 python-bits: 64 OS: Linux OS-release: 3.13.0-86-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8
pandas: 0.18.0 pip: 8.1.1 setuptools: 20.7.0 numpy: 1.10.0 scipy: 0.16.0 statsmodels: 0.6.1 IPython: 3.2.1 patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.4 tables: 3.2.2 numexpr: 2.5.2 matplotlib: 1.4.3
Comment From: jorisvandenbossche
This is a common pitfall: currently, passing columns
in DataFrame()
does a reindex and does not overwrite the columns.
If your data already has column name information, pd.DataFrame(np.zeros(3, dtype), columns=ind)
does more something like:
df = pd.DataFrame(np.zeros(3, dtype))
df = df.reindex(columns=ind)
rather than the
df = pd.DataFrame(np.zeros(3, dtype))
df.columns = ind
as you expected.
So knowing this, the output you see is correct, as the reindex will not find matching column names and return an empty dataframe. There are some related issues about this, and some discussions on changing this (but the question is also whether it is worth the breaking change).
Comment From: jorisvandenbossche
xref discussion in #9237