Pandas Inconsistent behavior when using GroupBy and pandas.Series.mode

Code Sample

import pandas

# Works great
df1 = pandas.DataFrame([[20,'A'],[20,'B'],[10,'C']])
gb1 = df1.groupby(0).agg(pandas.Series.mode)
display(gb1)

# Exception: Must produce aggregated value
df2 = pandas.DataFrame([[20,'A'],[20,'B'],[30,'C']])
gb2 = df2.groupby(0).agg(pandas.Series.mode)
display(gb2)

Problem Description

As it seems, the above code works great for df1, returning the following result:

(where C is a str, and [A, B] is a numpy.ndarray)

However, it doesn't work for df2, throwing the following exception:

...
C:\ProgramData\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby\generic.py in _aggregate_named(self, func, *args, **kwargs)
    907             output = func(group, *args, **kwargs)
    908             if isinstance(output, (Series, Index, np.ndarray)):
--> 909                 raise Exception('Must produce aggregated value')
    910             result[name] = self._try_cast(output, group)
    911 

Exception: Must produce aggregated value

It looks like the order of the processing of the agg function affects pandas error checking; if the first result (GroupBy row) returns an numpy.ndarray - the above exception is thrown, but if the first result return a str/scalar - the processing continues and suppresses further such exceptions.

In my opinion, item may be related to #2656 or one of its root causes ("fast apply vs. old pathway" as they put it), and to #24016 (which fails to accept numpy.ndarray as a return value). However, in our case - where pandas.Series.mode sometimes returns a scalar and sometimes a numpy.ndarray, it is more illusive and more confusing and therefore inconsistent (I spent ~2 hours debugging this trying to understand why so many of my agg function calls work but only one doesn't).

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.24.1 pytest: None pip: 19.0.1 setuptools: 40.4.3 Cython: None numpy: 1.15.2 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: None patsy: 0.5.0 dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.0 openpyxl: 2.5.12 xlrd: 1.2.0 xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

Comment From: WillAyd

Yea I see where this is confusing but generally we don't support agg reducing to anything but a scalar, so I would think the first example returning something is pure coincidence and if anything the second example would be what we should be doing here.

Others may disagree though so let's see. FWIW apply is a more general function so you'd be safer to do something like:

df2.groupby(0)[1].apply(lambda x: list(x.mode()))

Comment From: j-musial

As explained in the https://github.com/pandas-dev/pandas/issues/19254 df2.groupby(0)[1].apply(lambda x: list(x.mode())) is really slow so it would be beneficial to add a separate groupby.mode() function implemented in the Cython. Such functionality is very often used for the categorical data so I am supprised that it still has not been implemented in Pandas.

Comment From: mroeschke

This looks to work on master now (albeit agreed that agg should just be a reducer). Could use a test

In [9]: df2 = pandas.DataFrame([[20,'A'],[20,'B'],[30,'C']])
   ...: gb2 = df2.groupby(0).agg(pandas.Series.mode)

In [10]: gb2
Out[10]:
         1
0
20  [A, B]
30       C

Comment From: jointfull

..so did someone fix this, or is it just another 'accidental' example of something that works, but given a different example it may fail again?

Comment From: jacobdadams

I am still seeing this behavior in 1.41.1.

If the first groupby result returns a single mode, a subsequent groupby result that contains multiple modes will insert a list of the multiple modes. However, if the first groupby result would return multiple modes, it returns a ValueError: Must produce aggregated value similar to the original exception.

In [2]: test_df = pd.DataFrame({
           'value': [1, 1, 2, 3],
           'group': ['a', 'a', 'b', 'b'],
       })
In [4]: gb = test_df.groupby('group')
In [5]: mode_series = gb['value'].agg(pd.Series.mode)
In [6]: mode_series
Out[6]:
group
a         1
b    [2, 3]
Name: value, dtype: object

In [7]: test2_df = pd.DataFrame({
           'value': [1, 2, 3, 3,],
           'group': ['a', 'a', 'b', 'b'],
       })
In [8]: gb2 = test2_df.groupby('group')
In [9]: mode_series2 = gb2['value'].agg(pd.Series.mode)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 mode_series2 = gb2['value'].agg(pd.Series.mode)
# <snip>
File ~\Miniconda3\envs\pandas\lib\site-packages\pandas\core\groupby\ops.py:1010, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
   1006 res = libreduction.extract_result(res)
   1008 if not initialized:
   1009     # We only do this validation on the first iteration
-> 1010     libreduction.check_result_array(res, group.dtype)
   1011     initialized = True
   1013 counts[i] = group.shape[0]

File ~\Miniconda3\envs\pandas\lib\site-packages\pandas\_libs\reduction.pyx:13, in pandas._libs.reduction.check_result_array()

File ~\Miniconda3\envs\pandas\lib\site-packages\pandas\_libs\reduction.pyx:21, in pandas._libs.reduction.check_result_array()

ValueError: Must produce aggregated value

pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 06d230151e6f18fdb8139d09abf539867a8cd481 python : 3.10.0.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19044 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252 pandas : 1.4.1 numpy : 1.21.5 pytz : 2021.3 dateutil : 2.8.2 pip : 21.2.4 setuptools : 58.0.4 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.1.1 pandas_datareader: None bs4 : None bottleneck : 1.3.2 fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : 2.8.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None

Comment From: jreback

this is an open issue

community patches are how pandas is updated

Comment From: nbarone1

take

Comment From: nbarone1

What is the structure supposed to be for the result?

Comment From: srotondo

take

Comment From: amanlai

FWIW, this issue still persists as of pandas 1.5.3. Here's an example that reproduces the issue:

df1 = pd.DataFrame({'grouper': ['a', 'a', 'a', 'b', 'b'], 'value': [1, 0, 0, 0, 1]})
df2 = pd.DataFrame({'grouper': ['b', 'b', 'b', 'a', 'a'], 'value': [1, 0, 0, 0, 1]})


print(df1.assign(grouper=df1['grouper'].map({'a':'b', 'b':'a'})).equals(df2))  # True

print(df1.groupby('grouper')['value'].agg(pd.Series.mode))  # OK
print(df2.groupby('grouper')['value'].agg(pd.Series.mode))  # ValueError: Must produce aggregated value

It works fine with groupby.apply() however.

Pandas Inconsistent behavior when using GroupBy and pandas.Series.mode

Code Sample

Problem Description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`