I was trying to document my experiences with the inconsistencies of DataFrame.groupby.apply
(see #22545), and one of them was the following:
N = 5
df = pd.DataFrame(index=range(N), columns=['id', 'x', 'y', 'z'])
df.loc[:, ['x', 'y', 'z']] = np.arange(N*3).reshape(N, 3)
df.id = np.random.randint(0, 3, (N,)) + 10
df
# id x y z
# 0 11 0 1 2
# 1 10 3 4 5
# 2 10 6 7 8
# 3 12 9 10 11
# 4 12 12 13 14
Then, even though the result returned by the function is exactly the same, the following outputs are different:
df.groupby('id', as_index=True).apply(lambda gr: gr))
# id x y z
# 0 11 0 1 2
# 1 10 3 4 5
# 2 10 6 7 8
# 3 12 9 10 11
# 4 12 12 13 14
df.groupby('id', as_index=True).apply(lambda gr: gr.iloc[:10 ** 6])
# id x y z
# id
# 10 1 10 3 4 5
# 2 10 6 7 8
# 11 0 11 0 1 2
# 12 3 12 9 10 11
# 4 12 12 13 14
The first one just returns the original frame as-is, with no attempt to actually group the results like the second output. Furthermore, both outputs should not have the id
column anymore, which is now ambiguous between the index and the columns (e.g. in case one may continue with groupby
after some further transformations)
Desired output of both:
# x y z
# id
# 10 1 3 4 5
# 2 6 7 8
# 11 0 0 1 2
# 12 3 9 10 11
# 4 12 13 14
Comment From: gfyoung
That looks very weird indeed! Investigation and PR are welcome!
Comment From: smithto1
take
Comment From: smithto1
I think this is another issue that falls under #34998
Comment From: grsjar
I think I've encountered this bug, had groupby().apply() and groupby().transform() statements both "with no attempt to actually group the results like the second output." anyone know what is happening here?
Comment From: rhshadrach
Thanks @smithto1 - agreed. Closing.