Code Sample, a copy-pastable example if possible
import pandas as pd
a = pd.DataFrame({'a': [1,2,3,1,2,3], 'b': [2,1,2,1,2,1]})
print(a['a'].groupby(a['b']).apply(lambda x: x.drop_duplicates()))
a = pd.DataFrame({'a': [1,2,1,1,2,3], 'b': [2,1,2,1,2,1]})
print(a['a'].groupby(a['b']).apply(lambda x: x.drop_duplicates()))
Output:
0 1
1 2
2 3
3 1
4 2
5 3
Name: a, dtype: int64
b
1 1 2
3 1
5 3
2 0 1
4 2
Name: a, dtype: int64
Problem description
The two outputs have different schemas out when only changing the data. Expected that they would actually have the same levels of indicies. The first output should not "reset index" and completely lose the groupby context.
Comment From: WillAyd
Haven't looked in detail but my guess is that the former infers a transform and aligns back to the original index since there are no duplicates. The second infers a reduction of some sort and subsequently uses the grouper's values to piece the result back together.
Inference on apply here is kind of tricky but if you see a way to make it work universally would take a PR
Comment From: simonjayhawkins
on main, the first case now issues a warning.
/tmp/ipykernel_14904/126178882.py:1: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use
>>> .groupby(..., group_keys=False)
To adopt the future behavior and silence this warning, use
>>> .groupby(..., group_keys=True)
print(grp.apply(lambda x: x.drop_duplicates()))
The first output should not "reset index" and completely lose the groupby context.
so this can now be achieved with group_keys=True
and will be the default behavior in the future.
import pandas as pd
pd.__version__
# '1.5.0.dev0+867.gdf8acf4201'
a = pd.DataFrame({"a": [1, 2, 3, 1, 2, 3], "b": [2, 1, 2, 1, 2, 1]})
print(a["a"].groupby(a["b"], group_keys=True).apply(lambda x: x.drop_duplicates()))
# b
# 1 1 2
# 3 1
# 5 3
# 2 0 1
# 2 3
# 4 2
# Name: a, dtype: int64