Should we deprecate using as_index
in groupby? As the docs say, using as_index=False
should only impact reductions. Here, my expectation is that a line such as
df.groupby(keys, as_index=False).op(...)
should always be equivalent to
df.groupby(keys, as_index=True).op(...).reset_index()
It seems quite easy for the user to do so if they desire, and this would simplify the groupby internals.
Comment From: rhshadrach
cc @jbrockmendel, @WillAyd, @jreback Any thoughts here?
Comment From: jbrockmendel
this would simplify the groupby internals
Do you have a qualitative estimate of how much simplification we'd get? I'm definitely open to the idea in principal.
Comment From: jreback
in principal this makes groupby simpler but then again it makes a slightly more verbose syntax and to be honest i wish we wouldn't actually set the index and just make it a column
so either way this is a rather big change
so -0 on deprecating
Comment From: rhshadrach
@jbrockmendel Not sure what kind of qualitative estimate one can give here; here are some that I think are on the more quantitative side. I count 13 instances of branching created within core.groupby
that depend solely on as_index
; not much code itself (27 lines). Also, the method _insert_inaxis_grouper_inplace
could be removed. Of course, it doubles the testing when it comes to reduction methods, plus the additional testing that it does not impact non-reductions. There is also the ambiguity on what the behavior should be when it comes to a UDF and a Series
/DataFrame
with unique groups (is it a reduction or a transform?), but this will continue to impact other behaviors of groupby and so I don't think it should be really considered.
@jreback If we were to leave the groups as columns, I think that would necessitate the index being a RangeIndex
, is that right? I personally utilize meaningful indexes in my workflows and lean heavily on pandas's alignment when combining results from various computations, which is why I find as_index=True
behavior most useful. But of course I understand if others' workflows/uses are different.
Comment From: jorisvandenbossche
I agree that as_index=False
and .reset_index()
are basically the same, and thus it's not really needed to provide both ways.
But on the other hand, I think that as_index=False
is also quite massively used. So I am personally not sure it is worth disrupting their workflow and having all those users need to update their code for something relatively minor.
I mean, from a user point of view, it's not that we consider it "harmful" or bad practice code, I think, so for me that gives less reason to actually deprecate.
Comment From: rhshadrach
But on the other hand, I think that
as_index=False
is also quite massively used.
Thanks @jorisvandenbossche. Working on some of the as_index=False
bugs, I got the sense that perhaps it isn't used all that often, and this was especially the case when I saw some others express that it should be removed in an issue. @jreback's and your responses, along with some searching on SO, have me convinced otherwise.
Going to close this issue in a week if there is no futher discussion.
Comment From: WillAyd
Reopening as I think this is worthwhile to do.
Comment From: jbrockmendel
+1. The groupby tests are made really cumbersome by the combination of options.