Pandas API: inconsistent return format of groupby apply

related issues: - [x] #13121 - [x] #5839 - [x] #9867 - [x] #13255 - [x] #20066 - [ ] #14927 (about detecting mutating functions)

Thank you very much for provide us pandas which is really a good packages I have used!

There were several times when I use pandas but get inconsistent return format which finally leads to a break!

for example:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,1), columns = ['x'])
df['y'] = 1
df.groupby('y').apply(lambda x: x.x)

#x        0         1         2         3         4         5         6  \
#y                                                                        
#1 -1.12114  0.679616 -1.392863  0.032637 -0.051134  0.594201 -0.238833   

#x        7        8         9  
#y                              
#1  0.95173  1.07469 -0.062198  

df2 = df.copy()
df2['y'] = np.random.randint(1,3,10)
# do something similar like above
df2.groupby('y').apply(lambda x: x.x)
#y   
#1  3    0.032637
#   8    1.074690
#2  0   -1.121140
#   1    0.679616
#   2   -1.392863
#   4   -0.051134
#   5    0.594201
#   6   -0.238833
#   7    0.951730
#   9   -0.062198
#Name: x, dtype: float64

even though, I knew the way to avoid this issue, I still believe it is better to return the same format. It's really annoying when some inconsistent return format happened which force me to look backwards to use some awkward methods to avoid it.

So, if possible, I do suggest the above function will only return one kind of format.

Comment From: jreback

You are trying to do too much inside an apply which is completely non-performant. .apply has to guess your return output. I suppose this is slightly confusing, in any event you should simply be doing this.

In [14]: df.groupby('y').x.apply(lambda x: x)
Out[14]: 
0    2.481148
1   -1.324170
2    0.783518
3    0.869827
4   -0.080157
5    0.071685
6    0.987246
7    0.099149
8   -0.159449
9   -0.383200
Name: x, dtype: float64

In [15]: df2.groupby('y').x.apply(lambda x: x)
Out[15]: 
0    2.481148
1   -1.324170
2    0.783518
3    0.869827
4   -0.080157
5    0.071685
6    0.987246
7    0.099149
8   -0.159449
9   -0.383200
Name: x, dtype: float64

I'll mark it as an issue if you'd like to investigate whether this could be made consistent w/o breaking anything else (might not be possible).

Comment From: jreback

this was changed by this: https://github.com/pydata/pandas/pull/3239

sort of a special case was added I think.

Comment From: edfall

yep, I think the way you(@jreback) use do solve problem of inconsistent return format, However, sometimes it's important to keep the groupby variables as a label, which your method failed to.

what's more, the function in .apply formula can be more complicated as the aim and data changes.

Comment From: jreback

@edfall and so how would you reconcile this issue?

I don't think you can infer what format the user wants and meet all users expectations.

Comment From: edfall

@jreback

of cause, it's really difficult to meet all users' satisfaction, but return format of any specific application should always keep the same. and I think the result should at least be something like:

 df.groupby('y').apply(lambda x: x.x)
#y
#1 0    2.481148
#  1   -1.324170
#  2    0.783518
#  3    0.869827
#  4   -0.080157
#  5    0.071685
#  6    0.987246
#  7    0.099149
#  8   -0.159449
#  9   -0.383200

Maybe I have use R package data.table, which can do similar things as '.groupby().apply' do, but all results of its' application is predictable.

Comment From: xappppp

It seems the trick is not the groupby-apply, but the what the lambda function return. if put a well defined function into apply() and return the result in a defined format you want, and it will solve the problem.

data.table in R seems have not issue as such though

Comment From: vss888

@xappppp : What should the function return to avoid the problem? For me, it normally returns either a Series or a DataFrame and I want all the outputs to be concatenated in the end (for any number of groups, including one).

Comment From: jorisvandenbossche

Also related to this: https://github.com/pandas-dev/pandas/issues/14927 (about detecting mutating functions)

Comment From: jorisvandenbossche

Here is a short notebook with an overview of the behaviours: http://nbviewer.jupyter.org/gist/jorisvandenbossche/16f511fe111a8b9fa0eac74e920c5251

Some things to note:

We have the different code path depending on whether the original group (mutated or not) is returned (related to #14927 ).
If the original object is returned, a "transform" behaviour is used (keep original index order, don't add groupby key to the index)
I would personally deprecate this, and point users to groupby().transform(..)
Then for the "normal" apply behaviour, we have:
In general, specification is:
- The func gets the full group (including the key column)
- The results of func(group) are stacked, the groupby key is added to the Index (creating a MultiIndex)
Some specific things to note how things are stacked:
- If it is a Series, we make a distinction between:
- resulting Series that for each group has the same length as the index: stack vertically
- resulting Series has fixed size (or different to group size): stack horizontally
- scalars result in a final Series

I think in gerenal the second bullet point above makes sense. Only the Series differentiation is a bit strange. Do we want to keep this?

Additional note:

Why do have both as_index=True/False and group_keys=True/False ?

cc @WillAyd

(the above is only for functions

Comment From: fred-xue

I ran into the same issue. Need a consistent behavior for this.

Comment From: ratulbhadury

Also hitting a problem because of the inconsistent shape of what apply returns. +1 for fix please!

Comment From: liushapku

get the same problem. It is very annoying that every time I call group(...).apply(), I need to check how many groups are there in order to further process the result. Please fix.

Comment From: TomAugspurger

@liushapku are you interested in working on it?

Comment From: TomAugspurger

This is being reverted in #35306 and will be re-fixed in #34998 in pandas 1.2.

Comment From: MoritzLaurer

I'm still encountering this issue of inconsistent return formats as mentioned in https://github.com/pandas-dev/pandas/issues/5839 with pandas 1.4.4 and 1.5

Comment From: Polaris000

Same here