related issues: - [x] #13121 - [x] #5839 - [x] #9867 - [x] #13255 - [x] #20066 - [ ] #14927 (about detecting mutating functions)
Thank you very much for provide us pandas which is really a good packages I have used!
There were several times when I use pandas but get inconsistent return format which finally leads to a break!
- for example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,1), columns = ['x'])
df['y'] = 1
df.groupby('y').apply(lambda x: x.x)
#x 0 1 2 3 4 5 6 \
#y
#1 -1.12114 0.679616 -1.392863 0.032637 -0.051134 0.594201 -0.238833
#x 7 8 9
#y
#1 0.95173 1.07469 -0.062198
df2 = df.copy()
df2['y'] = np.random.randint(1,3,10)
# do something similar like above
df2.groupby('y').apply(lambda x: x.x)
#y
#1 3 0.032637
# 8 1.074690
#2 0 -1.121140
# 1 0.679616
# 2 -1.392863
# 4 -0.051134
# 5 0.594201
# 6 -0.238833
# 7 0.951730
# 9 -0.062198
#Name: x, dtype: float64
even though, I knew the way to avoid this issue, I still believe it is better to return the same format. It's really annoying when some inconsistent return format happened which force me to look backwards to use some awkward methods to avoid it.
So, if possible, I do suggest the above function will only return one kind of format.
Comment From: jreback
You are trying to do too much inside an apply which is completely non-performant. .apply
has to guess your return output. I suppose this is slightly confusing, in any event you should simply be doing this.
In [14]: df.groupby('y').x.apply(lambda x: x)
Out[14]:
0 2.481148
1 -1.324170
2 0.783518
3 0.869827
4 -0.080157
5 0.071685
6 0.987246
7 0.099149
8 -0.159449
9 -0.383200
Name: x, dtype: float64
In [15]: df2.groupby('y').x.apply(lambda x: x)
Out[15]:
0 2.481148
1 -1.324170
2 0.783518
3 0.869827
4 -0.080157
5 0.071685
6 0.987246
7 0.099149
8 -0.159449
9 -0.383200
Name: x, dtype: float64
I'll mark it as an issue if you'd like to investigate whether this could be made consistent w/o breaking anything else (might not be possible).
Comment From: jreback
this was changed by this: https://github.com/pydata/pandas/pull/3239
sort of a special case was added I think.
Comment From: edfall
yep, I think the way you(@jreback) use do solve problem of inconsistent return format, However, sometimes it's important to keep the groupby variables as a label, which your method failed to.
what's more, the function in .apply
formula can be more complicated as the aim and data changes.
Comment From: jreback
@edfall and so how would you reconcile this issue?
I don't think you can infer what format the user wants and meet all users expectations.
Comment From: edfall
@jreback
of cause, it's really difficult to meet all users' satisfaction, but return format of any specific application should always keep the same. and I think the result should at least be something like:
df.groupby('y').apply(lambda x: x.x)
#y
#1 0 2.481148
# 1 -1.324170
# 2 0.783518
# 3 0.869827
# 4 -0.080157
# 5 0.071685
# 6 0.987246
# 7 0.099149
# 8 -0.159449
# 9 -0.383200
Maybe I have use R package data.table
, which can do similar things as '.groupby().apply' do, but all results of its' application is predictable.
Comment From: xappppp
It seems the trick is not the groupby-apply, but the what the lambda function return. if put a well defined function into apply() and return the result in a defined format you want, and it will solve the problem.
data.table in R seems have not issue as such though
Comment From: vss888
@xappppp : What should the function return to avoid the problem? For me, it normally returns either a Series or a DataFrame and I want all the outputs to be concatenated in the end (for any number of groups, including one).
Comment From: jorisvandenbossche
Also related to this: https://github.com/pandas-dev/pandas/issues/14927 (about detecting mutating functions)
Comment From: jorisvandenbossche
Here is a short notebook with an overview of the behaviours: http://nbviewer.jupyter.org/gist/jorisvandenbossche/16f511fe111a8b9fa0eac74e920c5251
Some things to note:
- We have the different code path depending on whether the original group (mutated or not) is returned (related to #14927 ).
- If the original object is returned, a "transform" behaviour is used (keep original index order, don't add groupby key to the index)
- I would personally deprecate this, and point users to
groupby().transform(..)
- Then for the "normal" apply behaviour, we have:
- In general, specification is:
- The
func
gets the full group (including the key column) - The results of
func(group)
are stacked, the groupby key is added to the Index (creating a MultiIndex)
- The
- Some specific things to note how things are stacked:
- If it is a Series, we make a distinction between:
- resulting Series that for each group has the same length as the index: stack vertically
- resulting Series has fixed size (or different to group size): stack horizontally
- scalars result in a final Series
I think in gerenal the second bullet point above makes sense. Only the Series differentiation is a bit strange. Do we want to keep this?
Additional note:
- Why do have both
as_index=True/False
andgroup_keys=True/False
?
cc @WillAyd
(the above is only for functions
Comment From: fred-xue
I ran into the same issue. Need a consistent behavior for this.
Comment From: ratulbhadury
Also hitting a problem because of the inconsistent shape of what apply returns. +1 for fix please!
Comment From: liushapku
get the same problem. It is very annoying that every time I call group(...).apply(), I need to check how many groups are there in order to further process the result. Please fix.
Comment From: TomAugspurger
@liushapku are you interested in working on it?
Comment From: TomAugspurger
This is being reverted in #35306 and will be re-fixed in #34998 in pandas 1.2.
Comment From: MoritzLaurer
I'm still encountering this issue of inconsistent return formats as mentioned in https://github.com/pandas-dev/pandas/issues/5839 with pandas 1.4.4 and 1.5
Comment From: Polaris000
Same here