-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 1, 2], "b": [2, 1, 1, 2]})
g = df.groupby("a")
for aggfunc in ["unique", pd.unique, pd.Series.unique]:
try:
print(g.agg({"b": aggfunc}))
except ValueError as e:
print(f"{aggfunc!r} failed with {e!r}")
Running this script produces
b
a
1 [2, 1]
2 [1, 2]
<function unique at 0x7f6885cc7310> failed with ValueError('Must produce aggregated value')
<function Series.unique at 0x7f688369f8b0> failed with ValueError('Must produce aggregated value')
Problem description
When using "unique" str as the aggfunc
parameter, internally it is interpreted as pd.Series.unique
. I expect identical results when passing "unique" and pd.Series.unique
, or pd.unique
but they are not.
Output of pd.show_versions()
Comment From: topper-123
Yes, I think the groupby with "unique" should fail also, because it doesn't aggregate.
Comment From: rhshadrach
I think this can be a useful op. I am in favor of treating the result of any provided function as if it were an aggregation - i.e. treat the result as if it were a scalar rather than try to infer if it is a scalar or not.
Related: #51445, #35725 (and others)
Comment From: topper-123
I expected it to give the grouped version of DataFrame.agg("unique")
, i.e. in this case I'd expect the result to be:
b
a
1 2
1 1
2 1
2 2
rather than:
a
1 [2, 1]
2 [1, 2]
Name: b, dtype: object
I.re. the grouping op to be different?
Comment From: rhshadrach
I think first and foremost, the result of agg
with as_index=True
should always be uniquely indexed by the groups. I think that's what it means for an operation to be an aggregation.
If we are to assume that (I'm still open to arguments to the contrary), then it seems to me that
a
1 [2, 1]
2 [1, 2]
Name: b, dtype: object
is a reasonable and useful result.
An interesting complication is that df.agg("unique")
doesn't act on a column-by-column basis. Under the hood, we're calling np.unique
with the default axis=None
. pandas typically defaults to acting column-by-column whereas NumPy defaults to acting across all axes. So to get the "expected pandas default" (or, at least what I think it should be), you'd need to pass axis=1
. However you can't currently pass axis=1
to the NumPy function because agg
itself has an axis
argument (xref: https://github.com/pandas-dev/pandas/issues/40112#issuecomment-1445240531). If you could, you'd get back a 2d NumPy array instead. All of this is making me wonder if we really should be passing things like this through to NumPy.