Right now, pandas's read_csv() supports forcing column names read from CSV data to be unique:
>>> import pandas as pd
p>>> pd.__version__
u'0.20.3'
>>> from StringIO import StringIO
>>> csv_data = """a,a,b
... 1,2,3
... 4,5,6"""
>>> df = pd.read_csv(StringIO(csv_data))
>>> df.columns.tolist()
['a', 'a.1', 'b']
The documentation suggests that passing mangle_dupe_cols=False
to read_csv()
will change this behavior to one where it'll overwrite data on load. That doesn't seem to be implemented as of this version:
>>> df = pd.read_csv(StringIO(csv_data), mangle_dupe_cols=False)
# snip
ValueError: Setting mangle_dupe_cols=False is not supported yet
However! Pandas doesn't fundamentally disallow duplicate column names. There's a simple (but somewhat convoluted) trick to get duplicate columns from a CSV file into your headers:
>>> df = pd.read_csv(StringIO(csv_data), header=None)
>>> df.columns = df.iloc[0]
>>> df = df.reindex(df.index.drop(0)).reset_index(drop=True)
>>> df
0 a a b
0 1 2 3
1 4 5 6
(Then again, maybe this isn't so simple; there's a funny "0" in there with the column headings and that seems weird.)
Problem description
I would like a native way to read CSV files with repeated headers. In my application, it's literally so I can warn people about duplicated column headers. Yes, I could use python's built-in csv
module for this, but then I'm using two methods to read CSV files and it gets weird.
Since mangle_dupe_cols=False
is not yet implemented, I might propose this behavior in this case.
Output of pd.show_versions()
Comment From: njvack
Probably a slightly better "turn the first row into column headers" incantation:
df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)
Also, I've upgraded to pandas 0.22 and all this works in that version as well.
Comment From: chris-b1
Thanks for the report, this is a duplicate of #13262. A PR to support this would be welcome if you're interested!
Comment From: njvack
Thanks — I suspected I wasn't the first to report this, but somehow failed to find that issue. I'll try and work on a PR for this (it shouldn't be too hard?) once I can figure out how pandas's source works some; this is a pretty intimidating codebase...
Comment From: grofte
Since pandas does support non-unique column names it would be really great if pandas had some kind of function to warn about them. Maybe in df.info() and / or other functions commonly used when running into trouble.
Example of crashing: trying to use seaborn boxplot on a dataframe with duplicate column names. The crash comes from pandas but I don't know what the actual crash stems from.
Comment From: claresloggett
While this doesn't directly address the issue, for the specific use-case that @njvack is encountering - wanting to warn the user about duplicate columns - I've resorted to a pre-load of the header row with
pd.read_csv(myfile, header=None, nrows=1).iloc[0,:].value_counts()
and then any values with count > 1 can be reported to the user.
As a general issue though, mangle_dupe_cols
still needs implementation by the looks of it (as of Pandas version 1.4.1).