Is your feature request related to a problem?
The default behaviour of pandas.DataFrame.groupby
is currently different depending on the type of the groupers (when one of the groupers is categorical, unobserved categories are added to the groupby by default. This behaviour can be overriden by setting the observed
argument to False
).
I feel like making the groupby API consistent by default and regardless of the underlying data type would provide a much better user experience.
Describe the solution you'd like
Default to observed=False
in pandas.DataFrame.groupby
.
API breaking implications
Would break backwards-compatibility.
Describe alternatives you've considered
So far the only option I can think of is to add observed=True
to every groupby I write to make sure it will behave correctly no matter what kind of data gets passed to it.
Comment From: jreback
pls search the tracker this is a duplicate request
Comment From: Seon82
Sorry, I completely missed it! And I still seem unable to find it no matter what synonyms I try, would you mind sending a link if you have one handy?
Comment From: jreback
see #35967 and linked issues
i guess we don't have an actual issue for this (or maybe one of the linked ones)
cc @jseabold made a really good effort here
Comment From: PMLP-novo
An alternative suggestion could be to that the observed was determined at runtime by default. So if there will be created more groups than lets say 100,000,000 if groups are created in the Cartesian way, then we automatically change to observed = true
.
I the code this should be having the default observed = None
. This solution will be backwards compatible if users have set observed.
Comment From: rhshadrach
I think we should pursue this deprecation. By defaulting to observed=False
, categorical dtypes will default to behaving the same as all other dtypes. This would allow users to take advantage of the performance benefits of categorical (in particular, memory usage if string values are frequently repeated). The default of observed=False
is also safer than observed=True
in regards to memory and runtime, especially when there are multiple groupings.
cc @jbrockmendel @jorisvandenbossche @mroeschke @topper-123
Comment From: topper-123
+1 😄 . In addition, I had some arguments in #43999 on this.
I think this is quite a big ergonomic problem, e.g. beginners who don't know observed=True
will often see their memory use explode when doing groupbys, giving Pandas an unjustly negative reputation performance wise and/or discouraging them to pursue Pandas further. Experienced users may also forget to set observed=True
(I forget this a lot myself), getting a little annoyed at this API each time.
Comment From: jbrockmendel
+1 on deprecating the default, see also #30552