Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
def func(ser):
return ser.rename('foo')
print(df.groupby('a')['b'].apply(lambda ser: ser.rename('foo')))
Issue Description
I get
a
1 0 4
2 1 5
3 2 6
Name: b, dtype: int64
Expected Behavior
I would've expected
a
1 0 4
2 1 5
3 2 6
Name: foo, dtype: int64
Looks like this is set very explicitly though:
https://github.com/pandas-dev/pandas/blob/2f751ad28ef29c6c4cea5c84503f6e07c378780f/pandas/core/groupby/generic.py#L404
Is this right?
Trying to do something similar with DataFrameGroupby.apply
gives
>>> df.groupby('a')[['b']].apply(lambda ser: ser.rename(columns=lambda _: 'foo'))
foo
a
1 0 4
2 1 5
3 2 6
which is more like what I was expecting
Related
https://github.com/pandas-dev/pandas/issues/46369
@rhshadrach thoughts here?
Installed Versions
Comment From: rhshadrach
Agreed a user could find the result in the example above is unexpected. However I wonder if there are use cases where the name gets set naturally via pandas operations within the UDF but the current behavior of returning the name of the column is desired. For example, doing
df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]})
gb = df.groupby('a')['b']
result = gb.cumsum()
result has the name 'b'
and this is the same if one instead does gb.apply(lambda x: x.cumsum())
. This is complicated by the fact that the name of the Series passed to the UDF are the group values (1
and 2
in the above example).
Thinking about various return cases for name - the group value, None
, and all others, along with the possibility of different groups returning different cases (e.g for one group the result is named None
, another is named the group value, and a third a different name altogether), it seems to me trying to do anything more advanced is trying to guess at what the user might want. Perhaps we're likely to get more cases right, but at the expense of having a more complex description of what the resulting name is.
In situations like this, my preference would be to have a simple and easy to predict result - e.g. always the name of the Series. However I'm certainly willing to entertain other suggestions, and I'm still thinking about the difference in behavior between DataFrameGroupBy
and SeriesGroupBy
as highlighted in the OP.
Comment From: MarcoGorelli
Thanks for sharing your thoughts!
In situations like this, my preference would be to have a simple and easy to predict result - e.g. always the name of the Series.
Yeah, if it's intentional then this sounds like a nice rule. To explain how I came across it, it was in the context of #49912 - would you be OK with setting
# https://github.com/pandas-dev/pandas/issues/49909
expected = expected.rename("count")
in the test that applies value_counts
both as df.groupby(col1)[col2].value_counts()
and df.groupby(col1)[col2].apply(Series.value_counts)
? If we accept the above rule, then it'd be expected that the results would differ (and IMO that'd be OK)
Comment From: rhshadrach
would you be OK with... If we accept the above rule, then it'd be expected that the results would differ (and IMO that'd be OK)
Yes - most certainly. In general, we don't expect DataFrameGroupBy.apply(lambda x: x.method())
to agree with DataFrameGroupBy.method()
. The former has to operate with the UDF as a black box whereas the latter can take the operation itself into account in order to produce better output.
Comment From: MarcoGorelli
makes sense, thanks! closing then