Pandas Reading from Excel mangles columns

As far as I can tell, both methods of reading an Excel sheet (through ExcelFile and read_excel) do not have the option to avoid mangling duplicate columns, which exists for read_csv (https://github.com/pydata/pandas/issues/3468).

I have an Excel file that looks like this:

  foo  foo  bar  bar  baz  baz
    A    B    A    B    A    B
  123  456  789   12  345  678
  901  234  567  890  123  456
  789   12  345  678  901  234

I initially encountered this problem while trying to find a workaround for not being able to specify a multirow-header (#4679). Reading in this Excel file with header = 0 (default) mangles the column names:

>>> df = pd.read_excel("dupe_cols.xlsx", header = 0)
>>> print df
   foo foo.1  bar bar.1  baz baz.1
0    A     B    A     B    A     B
1  123   456  789    12  345   678
2  901   234  567   890  123   456
3  789    12  345   678  901   234

Unlike read_csv, read_excel does not have the option to avoid mangling duplicate columns (using ExcelFile.parse works the same as far as I can see). Specifying header = None in read_excel and then assigning the column names to the first row will effectively allow you to avoid mangling the column names. You could also read in the Excel sheet with header = None, save a .csv with header = False and index = False, then read that csv and specify mangle_dupe_cols = False to get the dataframe you want. (Incidentally, this also allows you to specify multiple rows as the header, which is the behavior I was originally trying to emulate.)

Comment From: jreback

can you post a small example that can be used as a tests in the top section (e.g. create a frame, write to excel), show the arguments used to read back

Comment From: gfyoung

Because we pass in **kwds from read_excel to TextFileReader, we actually do accept mangle_dupe_cols now. 🎉 We just need to surface it explicitly.

Otherwise, this issue is superseded by #13262.

Comment From: geminizb

I'm afraid ExcelFile.parse is not changed yet

Comment From: dhimmel

With pandas 1.0.3 I got Setting mangle_dupe_cols=False is not supported yet.

~/anaconda3/lib/python3.6/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
    332         convert_float=convert_float,
    333         mangle_dupe_cols=mangle_dupe_cols,
--> 334         **kwds,
    335     )
    336 

~/anaconda3/lib/python3.6/site-packages/pandas/io/excel/_base.py in parse(self, sheet_name, header, names, index_col, usecols, squeeze, converters, true_values, false_values, skiprows, nrows, na_values, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
    886             convert_float=convert_float,
    887             mangle_dupe_cols=mangle_dupe_cols,
--> 888             **kwds,
    889         )
    890 

~/anaconda3/lib/python3.6/site-packages/pandas/io/excel/_base.py in parse(self, sheet_name, header, names, index_col, usecols, squeeze, dtype, true_values, false_values, skiprows, nrows, na_values, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
    510                     usecols=usecols,
    511                     mangle_dupe_cols=mangle_dupe_cols,
--> 512                     **kwds,
    513                 )
    514 

~/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in TextParser(*args, **kwds)
   2199     """
   2200     kwds["engine"] = "python"
-> 2201     return TextFileReader(*args, **kwds)
   2202 
   2203 

~/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    865         self._currow = 0
    866 
--> 867         options = self._get_options_with_defaults(engine)
    868 
    869         self.chunksize = options.pop("chunksize", None)

~/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _get_options_with_defaults(self, engine)
    893             # see gh-12935
    894             if argname == "mangle_dupe_cols" and not value:
--> 895                 raise ValueError("Setting mangle_dupe_cols=False is not supported yet")
    896             else:
    897                 options[argname] = value

ValueError: Setting mangle_dupe_cols=False is not supported yet

Comment From: talatccan

I got ValueError: Setting mangle_dupe_cols=False is not supported yet for pandas version 1.0.4

Code:

df = pd.read_excel(DATA_PATH, mangle_dupe_cols=False)

Error Details:

 File "site-packages\pandas\io\excel\_base.py", line 334, in read_excel
    **kwds,
  File "site-packages\pandas\io\excel\_base.py", line 888, in parse
    **kwds,
  File "site-packages\pandas\io\excel\_base.py", line 512, in parse
    **kwds,
  File "site-packages\pandas\io\parsers.py", line 2201, in TextParser
    return TextFileReader(*args, **kwds)
  File "site-packages\pandas\io\parsers.py", line 867, in __init__
    options = self._get_options_with_defaults(engine)
  File "site-packages\pandas\io\parsers.py", line 895, in _get_options_with_defaults
    raise ValueError("Setting mangle_dupe_cols=False is not supported yet")
ValueError: Setting mangle_dupe_cols=False is not supported yet

Comment From: ronald8192

Using pandas-1.2.0, got an error: ValueError: Setting mangle_dupe_cols=False is not supported yet

>>> pd.read_excel('test.xlsx',mangle_dupe_cols=False)
<stdin>:1: FutureWarning: Your version of xlrd is 1.2.0. In xlrd >= 2.0, only the xls format is supported. As a result, the openpyxl engine will be used if it is installed and the engine argument is not specified. Install openpyxl instead.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 299, in wrapper
    return func(*args, **kwargs)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 344, in read_excel
    data = io.parse(
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 1153, in parse
    return self._reader.parse(
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 532, in parse
    parser = TextParser(
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 2224, in TextParser
    return TextFileReader(*args, **kwds)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 801, in __init__
    options = self._get_options_with_defaults(engine)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/site-packages/pandas/io/parsers.py", line 829, in _get_options_with_defaults
    raise ValueError("Setting mangle_dupe_cols=False is not supported yet")
ValueError: Setting mangle_dupe_cols=False is not supported yet