Code Sample, a copy-pastable example if possible
As expected:
import pandas as pd
all_cols = list("abcdefg")
informative_cols = ["a","c","g"]
dfrandom = pd.DataFrame(pd.np.random.rand(5, len(all_cols)), columns=all_cols)
field = informative_cols[0]
print(field)
print(type(dfrandom[field]))
print(type(dfrandom[informative_cols][field]))
returns:
a
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
However, unexpectedly:
import pandas as pd
all_cols = list("abcdefg")
informative_cols = ["a","c","g"]
dfrandom = pd.DataFrame(pd.np.random.rand(5, len(all_cols)), columns=all_cols)
field = informative_cols[0]
print(field)
print(type(dfrandom[field]))
print(type(dfrandom[informative_cols][field]))
returns:
0_AnatomicRegionSequence_CodeMeaning
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
Problem description
Subsetting one single column with a string must output the same result independent of copying and previous subsetting. Because expected output is consistent and unexpected is inconsistent
Expected Output
0_AnatomicRegionSequence_CodeMeaning
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
Output of pd.show_versions()
Comment From: TomAugspurger
Can you check your second example? I see
a
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
for both.
Comment From: jorisvandenbossche
@DSLituiev I think you mistakingly pasted twice exactly the same code for both the working and failing example. Can you update this?
Comment From: DSLituiev
I beg pardon.
informative_cols = [all_cols[x] for x in [0,2,4]]
informative_cols = informative_cols + informative_cols
dfrandom = pd.DataFrame(pd.np.random.rand(5, len(all_cols)), columns=all_cols)
field = informative_cols[0]
print(field)
print(type(dfrandom[field]))
print(type(dfrandom[informative_cols][field]))
The issue seems due to duplicated columns in selection
Comment From: DSLituiev
Maybe it is good to return a warning in this case?
Comment From: jorisvandenbossche
Ah, yes, that is expected. If you have duplicate column names, df[col_name]
will select all columns with this name, and thus return a DataFrame instead of a Series
Comment From: TomAugspurger
I don't think we'll issue a warning here. For better or worse, duplicates are allowed and used by some.
I'd recommend using something like engarde if you want runtime validation that you don't have duplicates.