xref #46944
columns1 = pd.MultiIndex.from_tuples([("a", "a2"), ("b", "c")])
df1 = pd.DataFrame([[1, 2]], columns=columns1)
print(df1["a"])
# a2
# 0 1
columns2 = pd.MultiIndex.from_tuples([("a", ""), ("b", "c")])
df2 = pd.DataFrame([[1, 2]], columns=columns2)
print(df2["a"])
# 0 1
# Name: a, dtype: int64
The first case produces a DataFrame, whereas the second case produces a Series. I don't think this is intentional. This gives rise to a difference in DataFrameGroupBy._selected_obj
and DataFrameGroupBy._obj_with_exclusions
which can lead to erroneous results (#50804 is one example).
Currently, df2.groupby("a")
is allowed whereas df1.groupby("a")
raises. So returning a DataFrame in the 2nd case will resolve the groupby inconsistency as well.
One can do df1.groupby(("a", "a2"))
successfully, so I don't think there is a worry about making certain ops not possible.
cc @phofl for any thoughts
Comment From: phofl
Not really any thoughts, but we had an issue about this as well. I looked into why this was done the way it's done and it has been around forever and is tested explicitly
Comment From: rhshadrach
I think this could be supported in groupby with just a few lines, but I do find it somewhat odd behavior. I'm guessing an empty string enables having some column behave as if they are multiindexed whereas other columns not (or just with fewer levels). I didn't see it mentioned anywhere in the docs, but could have missed it.
I'd support removing this special case behavior, but not going to push for it. It seems like added complexity that doesn't offer much in the way of benefits (but maybe it does and I just don't see it).
Comment From: rhshadrach
The recommendation of using _obj_with_exclusions
instead of _selected_obj
in https://github.com/pandas-dev/pandas/issues/46944#issuecomment-1409619368 has the added benefit of handling this behavior without any other change to groupby.