read_excel()
seems to give the same results for a multi-index column file if index_col=None
or index_col=0
. Using the file from /pandas/test/io/data:
>>> df = pd.read_excel(
"testmultiindex.xlsx", header=[0, 1], index_col=None,
sheetname='mi_column_name')
>>> print(df.columns)
MultiIndex(levels=[[u'bar', u'foo'], [u'a', u'b']],
labels=[[1, 1, 0, 0], [0, 1, 0, 1]],
names=[u'c1', u'c2'])
>>> print(df)
c1 foo bar
c2 a b a b
0 1 2.5 2015-01-01 True
1 2 3.5 2015-01-02 False
2 3 4.5 2015-01-03 False
3 4 5.5 2015-01-04 True
Expected Output
With index_col=None
I would have expected the result to look something like:
MultiIndex(levels=[[u'bar', u'c1', u'foo'], [u'a', u'b', u'c2']],
labels=[[1, 2, 2, 0, 0], [2, 0, 1, 0, 1]],
names=[u'', u''])
c1 foo bar
c2 a b a b
0 0 1 2.5 2015-01-01 True
1 1 2 3.5 2015-01-02 False
2 2 3 4.5 2015-01-03 False
3 3 4 5.5 2015-01-04 True
Problem description
This behavior is due to pandas.io.excel._pop_header_name()
.
def _pop_header_name(row, index_col):
""" (header, new_data) for header rows in MultiIndex parsing"""
none_fill = lambda x: None if x == '' else x
if index_col is None:
# no index col specified, trim data for inference path
return none_fill(row[0]), row[1:]
else:
# pop out header name and fill w/ blank
i = index_col if not is_list_like(index_col) else max(index_col)
return none_fill(row[i]), row[:i] + [''] + row[i + 1:]
There are two things in the function that I do not understand
index_col is None
pulls out the first column and then treats it as an index.index_col == 0
returns something similar but adds a column of''
to the front of the return row. Why are these almost the same?
Output of pd.show_versions()
Comment From: jreback
@chris-b1 if you'd have a look.
Comment From: chris-b1
Thanks for the report, this is a duplicate of #11733, definitely would like to solve this.
index_col=None
, contrary to what you could might expect, is treated as "infer whether or not there is an index column" and is also the default for read_excel
. So as mentioned in that issue, I think the right approach is change the default for index_col
to a sentinel (e.g. 'infer'
) so that index_col=None
can mean something. PRs / ideas for the api welcome!