Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
dup = 'duplic'
df = pd.DataFrame({'indicator':[dup, 'whatever', dup],
                   'time0':['x11',"sth", 'x12'],
                   'time1':['x21','sth', 'x22']})

df = df.set_index('indicator') # now the index has duplicate values. 
df = df.T # now, two columns are named identically. So far so good.

# Now, if we check the datarame, looks as expected
print('\nDataframe\n', df)

# If we check the df.columns, looks as expected
print('\ndf.columns\n', df.columns)

# BUT, If we check the df[df.columns], it duplicates the already duplicate columns
print('\ndf[df.columns]\n', df[df.columns])

Issue Description

When there are duplicate column-names in a dataframe, the df.columns indexer duplicates the -already duplicated- columns.

When I print the dataframe, I get the expected:

indicator duplic whatever duplic
time0        x11      sth    x12
time1        x21      sth    x22

When I print df.columns, I get the expected:

 Index(['duplic', 'whatever', 'duplic'], dtype='object', name='indicator')

But when I print, df[df.columns], I get:

 indicator duplic duplic whatever duplic duplic
time0        x11    x12      sth    x11    x12
time1        x21    x22      sth    x21    x22

(Of course I didn't just to print out the dataframe, this is the shortest reproducible example I could think off. I discovered it trying to apply some filters on the columns indexer)

Expected Behavior

Expected behaviour is apparent.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.11.2.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 151 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : Greek_Greece.1253 pandas : 1.5.3 numpy : 1.24.2 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 65.5.1 pip : 23.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : 4.12.0 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.7.1 numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 2.0.1 xlwt : None zstandard : None tzdata : 2023.2 None

Comment From: MarcoGorelli

looks right to me

if you print df['duplic'] you get both columns

In [3]: df['duplic']
Out[3]:
indicator duplic duplic
time0        x11    x12
time1        x21    x22

so if you print df[['duplic', 'whatever', 'duplic']] it makes sense for duplic to appear 4 times

Comment From: ThanosGkou

Oh, I see. It's understandable if this behaviour is not considered a bug, especially since having duplicate indexes is not a great idea anyway,

Anyway, I'd agree that df['duplic'] should select both columns. But it would be more intuitive if df[['duplic', 'duplic']] also returnued the same two columns, not two "copies" of the two columns.

Comment From: MarcoGorelli

especially since having duplicate indexes is not a great idea anyway,

exactly :) kinda wish we could disallow it, but that's a separate issue - closing then