Pandas read_excel with multi-index columns and index_col=None, returns an index column

read_excel() seems to give the same results for a multi-index column file if index_col=None or index_col=0. Using the file from /pandas/test/io/data:

>>> df = pd.read_excel(
        "testmultiindex.xlsx", header=[0, 1], index_col=None,
        sheetname='mi_column_name')

>>> print(df.columns)
MultiIndex(levels=[[u'bar', u'foo'], [u'a', u'b']],
       labels=[[1, 1, 0, 0], [0, 1, 0, 1]],
       names=[u'c1', u'c2'])

>>> print(df)

c1 foo             bar       
c2   a    b          a      b
0    1  2.5 2015-01-01   True
1    2  3.5 2015-01-02  False
2    3  4.5 2015-01-03  False
3    4  5.5 2015-01-04   True

Expected Output

With index_col=None I would have expected the result to look something like:

MultiIndex(levels=[[u'bar', u'c1', u'foo'], [u'a', u'b', u'c2']],
           labels=[[1, 2, 2, 0, 0], [2, 0, 1, 0, 1]],
           names=[u'', u''])

  c1 foo             bar       
  c2   a    b          a      b
0  0   1  2.5 2015-01-01   True
1  1   2  3.5 2015-01-02  False
2  2   3  4.5 2015-01-03  False
3  3   4  5.5 2015-01-04   True

Problem description

This behavior is due to pandas.io.excel._pop_header_name().


def _pop_header_name(row, index_col):
    """ (header, new_data) for header rows in MultiIndex parsing"""
    none_fill = lambda x: None if x == '' else x

    if index_col is None:
        # no index col specified, trim data for inference path
        return none_fill(row[0]), row[1:]
    else:
        # pop out header name and fill w/ blank
        i = index_col if not is_list_like(index_col) else max(index_col)
        return none_fill(row[i]), row[:i] + [''] + row[i + 1:]

There are two things in the function that I do not understand

index_col is None pulls out the first column and then treats it as an index.
index_col == 0 returns something similar but adds a column of '' to the front of the return row. Why are these almost the same?

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 32 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.19.2 nose: 1.3.7 pip: 9.0.1 setuptools: 30.3.0 Cython: 0.24 numpy: 1.12.0 scipy: 0.13.3 xarray: None sphinx: 1.5.1 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.10 blosc: 1.3.2 bottleneck: 1.0.0 tables: 3.1.0 numexpr: 2.5.2 matplotlib: 1.5.1 openpyxl: 2.3.4 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: None lxml: 3.6.0 pandas_datareader: 0.3.0.post

Comment From: jreback

@chris-b1 if you'd have a look.

Comment From: chris-b1

Thanks for the report, this is a duplicate of #11733, definitely would like to solve this.

index_col=None, contrary to what you could might expect, is treated as "infer whether or not there is an index column" and is also the default for read_excel. So as mentioned in that issue, I think the right approach is change the default for index_col to a sentinel (e.g. 'infer') so that index_col=None can mean something. PRs / ideas for the api welcome!

Pandas read_excel with multi-index columns and index_col=None, returns an index column

Expected Output

Problem description

Output of pd.show_versions()

Output of `pd.show_versions()`