• [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [ ] (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

def apply_func(group):
    return pd.DataFrame([{'cnt': group.b.sum()}])

def func(df):
    return df.groupby(['a']).apply(apply_func).reset_index(level=0)

# OK : Apply the method to a one row dataframe with no category types
func(pd.DataFrame([{'a': 1, 'b': 2}]))
# As expected, output has column 'a'

# KO : Apply method to a one row dataframe where grouping column is of category type
df = pd.DataFrame([{'a': 1, 'b': 2}])
df.a = df.a.astype('category')
func(df)
# I expected column 'a' instead of column 'index'

Problem description

As soon as the input dataframe is not empty, the method func should return a dataframe with column 'a'. This seems to happen in all the cases, except if the dataframe is only one row and the grouping column is categorical.

*NB: this does not seem to happen when not using apply *

Expected Output

As shown, the result we expect for both method calls is the one we obtain when grouping column is not categorical:

#    a  cnt
# 0  1    2

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : d9fff2792bf16178d4e450fe7384244e50635733 python : 3.6.8.final.0 python-bits : 64 OS : Darwin OS-release : 17.7.0 Version : Darwin Kernel Version 17.7.0: Thu Jun 18 21:21:34 PDT 2020; root:xnu-4570.71.82.5~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.1.0 numpy : 1.17.3 pytz : 2019.3 dateutil : 2.8.0 pip : 20.2 setuptools : 40.6.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 7.9.0 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 2.2.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pyxlsb : None s3fs : None scipy : 1.0.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

Comment From: rhshadrach

Thanks for the report. There does indeed seem to be a bug, but let me also recommend to instead use a line like:

 df.groupby('a').b.sum().to_frame('cnt').reset_index()

It avoids using apply and so will be significantly faster.

The bug seems to be a result in the difference between the fast (cython) and slow paths within apply. Here, core.groupby.ops._is_indexed_like is using equals to determine if mutated should be True/False, whereas _libs.reduction.apply_frame_axis0 is using is. The two indices are:

RangeIndex(start=0, stop=1, step=1)
Int64Index([0], dtype='int64')

which is why equals is returning True whereas is is returning False. This causes two different paths to be taken in core.groupby.groupby._concat_objects, resulting in the different output.

It's unclear to me if using equals or is is the preferred behavior.

Comment From: kevinadda

Thanks @rhshadrach!

In my situation aggregating function apply_func is returning multiple values, some of which are computed using multiple columns : I believe .apply is the only way in this situation.

My workaround is to simply detect when 'index' appears in the output and replace it manually by column 'a'. It doesn't have much impact since it happens only with one line dataframes.

I can't help with the final interrogations, but reading your comment I still wonder why this happens only with one-line dataframes. With multiple rows in the initial dataframe, the column 'a' shows up in the result. why would equal and is have equal output in this situation? What is specific to the 1 row dataframe case?

Here is the additional code for this case:

# OK : Apply method to a multiple rows dataframe where grouping column is of category type
df = pd.DataFrame([{'a': 1, 'b': 1}, {'a': 1, 'b': 1}])
df.a = df.a.astype('category')
func(df)

Comment From: rhshadrach

Thanks for the additional info @kevinadda. When there are multiple rows, the result of the groupby is reduced via your function and thus has a different index from the group_axes. Therefore there is no difference between equal and is.

Comment From: rhshadrach

I now get the expected output of the OP. This was resolved by the removal of libreduction. The two paths are now using the same logic to determine whether the index is the same.