Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
dup = 'duplic'
df = pd.DataFrame({'indicator':[dup, 'whatever', dup],
'time0':['x11',"sth", 'x12'],
'time1':['x21','sth', 'x22']})
df = df.set_index('indicator') # now the index has duplicate values.
df = df.T # now, two columns are named identically. So far so good.
# Now, if we check the datarame, looks as expected
print('\nDataframe\n', df)
# If we check the df.columns, looks as expected
print('\ndf.columns\n', df.columns)
# BUT, If we check the df[df.columns], it duplicates the already duplicate columns
print('\ndf[df.columns]\n', df[df.columns])
Issue Description
When there are duplicate column-names in a dataframe, the df.columns indexer duplicates the -already duplicated- columns.
When I print the dataframe, I get the expected:
indicator duplic whatever duplic
time0 x11 sth x12
time1 x21 sth x22
When I print df.columns, I get the expected:
Index(['duplic', 'whatever', 'duplic'], dtype='object', name='indicator')
But when I print, df[df.columns], I get:
indicator duplic duplic whatever duplic duplic
time0 x11 x12 sth x11 x12
time1 x21 x22 sth x21 x22
(Of course I didn't just to print out the dataframe, this is the shortest reproducible example I could think off. I discovered it trying to apply some filters on the columns indexer)
Expected Behavior
Expected behaviour is apparent.
Installed Versions
Comment From: MarcoGorelli
looks right to me
if you print df['duplic']
you get both columns
In [3]: df['duplic']
Out[3]:
indicator duplic duplic
time0 x11 x12
time1 x21 x22
so if you print df[['duplic', 'whatever', 'duplic']]
it makes sense for duplic
to appear 4 times
Comment From: ThanosGkou
Oh, I see. It's understandable if this behaviour is not considered a bug, especially since having duplicate indexes is not a great idea anyway,
Anyway, I'd agree that df['duplic'] should select both columns. But it would be more intuitive if df[['duplic', 'duplic']] also returnued the same two columns, not two "copies" of the two columns.
Comment From: MarcoGorelli
especially since having duplicate indexes is not a great idea anyway,
exactly :) kinda wish we could disallow it, but that's a separate issue - closing then