Pandas inconsistent type returned on single (string) column subsetting

Code Sample, a copy-pastable example if possible

As expected:

import pandas as pd
all_cols = list("abcdefg")

informative_cols = ["a","c","g"]

dfrandom = pd.DataFrame(pd.np.random.rand(5, len(all_cols)), columns=all_cols)

field = informative_cols[0]
print(field)
print(type(dfrandom[field]))
print(type(dfrandom[informative_cols][field]))

returns:

a
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

However, unexpectedly:

import pandas as pd
all_cols = list("abcdefg")

informative_cols = ["a","c","g"]

dfrandom = pd.DataFrame(pd.np.random.rand(5, len(all_cols)), columns=all_cols)

field = informative_cols[0]
print(field)
print(type(dfrandom[field]))
print(type(dfrandom[informative_cols][field]))

returns:

0_AnatomicRegionSequence_CodeMeaning
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>

Problem description

Subsetting one single column with a string must output the same result independent of copying and previous subsetting. Because expected output is consistent and unexpected is inconsistent

Expected Output

0_AnatomicRegionSequence_CodeMeaning
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.2.0-42-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.2 pytest: None pip: 9.0.1 setuptools: 36.0.1 Cython: None numpy: 1.13.0 scipy: 0.19.0 xarray: None IPython: 6.1.0 sphinx: None patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: TomAugspurger

Can you check your second example? I see

a
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

for both.

Comment From: jorisvandenbossche

@DSLituiev I think you mistakingly pasted twice exactly the same code for both the working and failing example. Can you update this?

Comment From: DSLituiev

I beg pardon.

informative_cols = [all_cols[x] for x in [0,2,4]]
informative_cols = informative_cols + informative_cols

dfrandom = pd.DataFrame(pd.np.random.rand(5, len(all_cols)), columns=all_cols)

field = informative_cols[0]
print(field)
print(type(dfrandom[field]))
print(type(dfrandom[informative_cols][field]))

The issue seems due to duplicated columns in selection

Comment From: DSLituiev

Maybe it is good to return a warning in this case?

Comment From: jorisvandenbossche

Ah, yes, that is expected. If you have duplicate column names, df[col_name] will select all columns with this name, and thus return a DataFrame instead of a Series

Comment From: TomAugspurger

I don't think we'll issue a warning here. For better or worse, duplicates are allowed and used by some.

I'd recommend using something like engarde if you want runtime validation that you don't have duplicates.

Pandas inconsistent type returned on single (string) column subsetting

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`