Pandas FR: Allow duplicate column names in pandas.read_csv

Right now, pandas's read_csv() supports forcing column names read from CSV data to be unique:

>>> import pandas as pd
p>>> pd.__version__
u'0.20.3'
>>> from StringIO import StringIO
>>> csv_data = """a,a,b
... 1,2,3
... 4,5,6"""
>>> df = pd.read_csv(StringIO(csv_data))
>>> df.columns.tolist()
['a', 'a.1', 'b']

The documentation suggests that passing mangle_dupe_cols=False to read_csv() will change this behavior to one where it'll overwrite data on load. That doesn't seem to be implemented as of this version:

>>> df = pd.read_csv(StringIO(csv_data), mangle_dupe_cols=False)
# snip
ValueError: Setting mangle_dupe_cols=False is not supported yet

However! Pandas doesn't fundamentally disallow duplicate column names. There's a simple (but somewhat convoluted) trick to get duplicate columns from a CSV file into your headers:

>>> df = pd.read_csv(StringIO(csv_data), header=None)
>>> df.columns = df.iloc[0]
>>> df = df.reindex(df.index.drop(0)).reset_index(drop=True)
>>> df
0  a  a  b
0  1  2  3
1  4  5  6

(Then again, maybe this isn't so simple; there's a funny "0" in there with the column headings and that seems weird.)

Problem description

I would like a native way to read CSV files with repeated headers. In my application, it's literally so I can warn people about duplicated column headers. Yes, I could use python's built-in csv module for this, but then I'm using two methods to read CSV files and it gets weird.

Since mangle_dupe_cols=False is not yet implemented, I might propose this behavior in this case.

Output of `pd.show_versions()`

>>> pd.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 17.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.20.3 pytest: 3.2.1 pip: 9.0.1 setuptools: 36.2.7 Cython: None numpy: 1.13.1 scipy: None xarray: None IPython: 5.4.1 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: njvack

Probably a slightly better "turn the first row into column headers" incantation:

df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)

Also, I've upgraded to pandas 0.22 and all this works in that version as well.

Comment From: chris-b1

Thanks for the report, this is a duplicate of #13262. A PR to support this would be welcome if you're interested!

Comment From: njvack

Thanks — I suspected I wasn't the first to report this, but somehow failed to find that issue. I'll try and work on a PR for this (it shouldn't be too hard?) once I can figure out how pandas's source works some; this is a pretty intimidating codebase...

Comment From: grofte

Since pandas does support non-unique column names it would be really great if pandas had some kind of function to warn about them. Maybe in df.info() and / or other functions commonly used when running into trouble.

Example of crashing: trying to use seaborn boxplot on a dataframe with duplicate column names. The crash comes from pandas but I don't know what the actual crash stems from.

Comment From: claresloggett

While this doesn't directly address the issue, for the specific use-case that @njvack is encountering - wanting to warn the user about duplicate columns - I've resorted to a pre-load of the header row with

pd.read_csv(myfile, header=None, nrows=1).iloc[0,:].value_counts()

and then any values with count > 1 can be reported to the user.

As a general issue though, mangle_dupe_cols still needs implementation by the looks of it (as of Pandas version 1.4.1).

Pandas FR: Allow duplicate column names in pandas.read_csv

Problem description

Output of pd.show_versions()

Output of `pd.show_versions()`