• [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [x] (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd

df = pd.DataFrame({"a": [1, 2, 1, 2], "b": [2, 1, 1, 2]})
g = df.groupby("a")

for aggfunc in ["unique", pd.unique, pd.Series.unique]:
    try:
        print(g.agg({"b": aggfunc}))
    except ValueError as e:
        print(f"{aggfunc!r} failed with {e!r}")

Running this script produces

        b
a        
1  [2, 1]
2  [1, 2]
<function unique at 0x7f6885cc7310> failed with ValueError('Must produce aggregated value')
<function Series.unique at 0x7f688369f8b0> failed with ValueError('Must produce aggregated value')

Problem description

When using "unique" str as the aggfunc parameter, internally it is interpreted as pd.Series.unique. I expect identical results when passing "unique" and pd.Series.unique, or pd.unique but they are not.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : f904213ffcc36d3ae29c37ae28fbb1844a0769ab python : 3.9.1.final.0 python-bits : 64 OS : Linux OS-release : 5.10.11-arch1-1 Version : #1 SMP PREEMPT Wed, 27 Jan 2021 13:53:16 +0000 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.3.0.dev0+772.gf904213ffc numpy : 1.20.1 pytz : 2021.1 dateutil : 2.8.1 pip : 20.2.3 setuptools : 49.2.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

Comment From: topper-123

Yes, I think the groupby with "unique" should fail also, because it doesn't aggregate.

Comment From: rhshadrach

I think this can be a useful op. I am in favor of treating the result of any provided function as if it were an aggregation - i.e. treat the result as if it were a scalar rather than try to infer if it is a scalar or not.

Related: #51445, #35725 (and others)

Comment From: topper-123

I expected it to give the grouped version of DataFrame.agg("unique"), i.e. in this case I'd expect the result to be:

   b
a
1  2
1  1
2  1
2  2

rather than:

a
1    [2, 1]
2    [1, 2]
Name: b, dtype: object

I.re. the grouping op to be different?

Comment From: rhshadrach

I think first and foremost, the result of agg with as_index=True should always be uniquely indexed by the groups. I think that's what it means for an operation to be an aggregation.

If we are to assume that (I'm still open to arguments to the contrary), then it seems to me that

a
1    [2, 1]
2    [1, 2]
Name: b, dtype: object

is a reasonable and useful result.

An interesting complication is that df.agg("unique") doesn't act on a column-by-column basis. Under the hood, we're calling np.unique with the default axis=None. pandas typically defaults to acting column-by-column whereas NumPy defaults to acting across all axes. So to get the "expected pandas default" (or, at least what I think it should be), you'd need to pass axis=1. However you can't currently pass axis=1 to the NumPy function because agg itself has an axis argument (xref: https://github.com/pandas-dev/pandas/issues/40112#issuecomment-1445240531). If you could, you'd get back a 2d NumPy array instead. All of this is making me wonder if we really should be passing things like this through to NumPy.