Code Sample
df = pd.DataFrame([{'i': i, 's': str(i)} for i in range(5)])
# this gives two rows with counts of one, as expected
df.iloc[:2].groupby('s').size()
df['s'] = df['s'].astype('category')
# this gives five rows, two of those having counts of one and others of zero
df.iloc[:2].groupby('s').size()
Problem description
I'm using categorical columns to save memory and possibly improve performance when having lots of strings, and would expect grouping by them to behave the same whether the string column is left as-is or converted to category. If that's expected behavior for categorical groupby to keep empty groups, is it possible to at least provide a boolean parameter to groupby
like empty_groups
? Or maybe even a simpler solution exists, but I couldn't find it.
Output of pd.show_versions()
Comment From: gfyoung
@aplavin : We've had this issue before with crosstab
(xref #16367), and in that discussion, we seemed to be going with keeping all categories in the count. Thus, I would consider the output from your example expected in light of that discussion.
I'm a little wary about adding a parameter to .size()
because it doesn't make much sense outside of the context of categorical. We want the interface to be as uniform as possible.
That being said, this method is a little different from crosstab
, so perhaps workarounds (I can think of some but none feel appealing) might be feasible.
@jreback @jorisvandenbossche @TomAugspurger
Comment From: aplavin
@gfyoung I'm not talking about adding a parameter to .size()
, but to .groupby()
so that it apply to all further operations. size
was just an example.
As of now the only way I see to have nonempty groups only (as in non-categorical groupby) is something like df.groupby(...).apply(f).pipe(lambda d: d[~d.isnull().any(1)])
, and this still would call f
for all the empty groups.
Comment From: gfyoung
I'm not talking about adding a parameter to .size(), but to .groupby()
I see, though outside of categorical, what would this parameter mean? I would still be wary of adding another parameter and functionality just because of how special-cased it is in the context of groupby
.
Comment From: toobaz
xref https://github.com/pandas-dev/pandas/issues/8559#issuecomment-59442278
Rather than changing some general purpose method(s) such as size()
and groupby()
, maybe we could decide to solve this with a handy drop_empty()
method for categoricals?
Comment From: aplavin
@toobaz if I understand you correctly, then it probably won't work in general. Suppose there is column a
with values 1
and 2
, and column b
with values 3
and 4
, but the only occuring combinations are 1,3
and 2,4
. Then if we group by both of them (as categorical) I would expect to have two groups (1,3
and 2,4
), and not all four. Actually it was very surprising to see that simply changing data type for performance reasons (in my case) significantly changed the behavior.
Comment From: gfyoung
I think in light of @toobaz 's reference, I think it brings up a larger question of how we want to handle counting of Categorical
groups that have no presence?
It might be worthwhile to create one umbrella issue that addresses this API standpoint and close the duplicates, as this issue seems to have crept up on several occasions.
@jreback : what do you think?
Comment From: toobaz
@toobaz if I understand you correctly, then it probably won't work in general
Yes, you understood me correctly... and yes, you're right...
Comment From: jreback
see much discussion https://github.com/pandas-dev/pandas/issues/8559
closing as a duplicate.