Pandas DEPR: Change default to observed=True in DataFrame.groupby

Is your feature request related to a problem?

The default behaviour of pandas.DataFrame.groupby is currently different depending on the type of the groupers (when one of the groupers is categorical, unobserved categories are added to the groupby by default. This behaviour can be overriden by setting the observed argument to False).

I feel like making the groupby API consistent by default and regardless of the underlying data type would provide a much better user experience.

Describe the solution you'd like

Default to observed=False in pandas.DataFrame.groupby.

API breaking implications

Would break backwards-compatibility.

Describe alternatives you've considered

So far the only option I can think of is to add observed=True to every groupby I write to make sure it will behave correctly no matter what kind of data gets passed to it.

Comment From: jreback

pls search the tracker this is a duplicate request

Comment From: Seon82

Sorry, I completely missed it! And I still seem unable to find it no matter what synonyms I try, would you mind sending a link if you have one handy?

Comment From: jreback

see #35967 and linked issues

i guess we don't have an actual issue for this (or maybe one of the linked ones)

cc @jseabold made a really good effort here

Comment From: PMLP-novo

An alternative suggestion could be to that the observed was determined at runtime by default. So if there will be created more groups than lets say 100,000,000 if groups are created in the Cartesian way, then we automatically change to observed = true. I the code this should be having the default observed = None. This solution will be backwards compatible if users have set observed.

Comment From: rhshadrach

I think we should pursue this deprecation. By defaulting to observed=False, categorical dtypes will default to behaving the same as all other dtypes. This would allow users to take advantage of the performance benefits of categorical (in particular, memory usage if string values are frequently repeated). The default of observed=False is also safer than observed=True in regards to memory and runtime, especially when there are multiple groupings.

cc @jbrockmendel @jorisvandenbossche @mroeschke @topper-123

Comment From: topper-123

+1 😄 . In addition, I had some arguments in #43999 on this.

I think this is quite a big ergonomic problem, e.g. beginners who don't know observed=True will often see their memory use explode when doing groupbys, giving Pandas an unjustly negative reputation performance wise and/or discouraging them to pursue Pandas further. Experienced users may also forget to set observed=True (I forget this a lot myself), getting a little annoyed at this API each time.

Comment From: jbrockmendel

+1 on deprecating the default, see also #30552