-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import pandas as pd
def apply_func(group):
return pd.DataFrame([{'cnt': group.b.sum()}])
def func(df):
return df.groupby(['a']).apply(apply_func).reset_index(level=0)
# OK : Apply the method to a one row dataframe with no category types
func(pd.DataFrame([{'a': 1, 'b': 2}]))
# As expected, output has column 'a'
# KO : Apply method to a one row dataframe where grouping column is of category type
df = pd.DataFrame([{'a': 1, 'b': 2}])
df.a = df.a.astype('category')
func(df)
# I expected column 'a' instead of column 'index'
Problem description
As soon as the input dataframe is not empty, the method func
should return a dataframe with column 'a'. This seems to happen in all the cases, except if the dataframe is only one row and the grouping column is categorical.
*NB: this does not seem to happen when not using apply
*
Expected Output
As shown, the result we expect for both method calls is the one we obtain when grouping column is not categorical:
# a cnt
# 0 1 2
Output of pd.show_versions()
Comment From: rhshadrach
Thanks for the report. There does indeed seem to be a bug, but let me also recommend to instead use a line like:
df.groupby('a').b.sum().to_frame('cnt').reset_index()
It avoids using apply
and so will be significantly faster.
The bug seems to be a result in the difference between the fast (cython) and slow paths within apply. Here, core.groupby.ops._is_indexed_like
is using equals
to determine if mutated should be True/False, whereas _libs.reduction.apply_frame_axis0
is using is
. The two indices are:
RangeIndex(start=0, stop=1, step=1)
Int64Index([0], dtype='int64')
which is why equals
is returning True whereas is
is returning False. This causes two different paths to be taken in core.groupby.groupby._concat_objects
, resulting in the different output.
It's unclear to me if using equals
or is
is the preferred behavior.
Comment From: kevinadda
Thanks @rhshadrach!
In my situation aggregating function apply_func
is returning multiple values, some of which are computed using multiple columns : I believe .apply
is the only way in this situation.
My workaround is to simply detect when 'index'
appears in the output and replace it manually by column 'a'
. It doesn't have much impact since it happens only with one line dataframes.
I can't help with the final interrogations, but reading your comment I still wonder why this happens only with one-line dataframes.
With multiple rows in the initial dataframe, the column 'a'
shows up in the result. why would equal
and is
have equal output in this situation? What is specific to the 1 row dataframe case?
Here is the additional code for this case:
# OK : Apply method to a multiple rows dataframe where grouping column is of category type
df = pd.DataFrame([{'a': 1, 'b': 1}, {'a': 1, 'b': 1}])
df.a = df.a.astype('category')
func(df)
Comment From: rhshadrach
Thanks for the additional info @kevinadda. When there are multiple rows, the result of the groupby is reduced via your function and thus has a different index from the group_axes
. Therefore there is no difference between equal
and is
.
Comment From: rhshadrach
I now get the expected output of the OP. This was resolved by the removal of libreduction. The two paths are now using the same logic to determine whether the index is the same.