Pandas Groupby inconsistency with categorical values

Code Sample

df = pd.DataFrame([{'i': i, 's': str(i)} for i in range(5)])

# this gives two rows with counts of one, as expected
df.iloc[:2].groupby('s').size()

df['s'] = df['s'].astype('category')
# this gives five rows, two of those having counts of one and others of zero
df.iloc[:2].groupby('s').size()

Problem description

I'm using categorical columns to save memory and possibly improve performance when having lots of strings, and would expect grouping by them to behave the same whether the string column is left as-is or converted to category. If that's expected behavior for categorical groupby to keep empty groups, is it possible to at least provide a boolean parameter to groupby like empty_groups? Or maybe even a simpler solution exists, but I couldn't find it.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.9.16-gentoo machine: x86_64 processor: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz byteorder: little LC_ALL: None LANG: en_US.utf8 LOCALE: en_US.UTF-8 pandas: 0.20.2 pytest: 3.1.2 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.1 xarray: None IPython: 6.1.0 sphinx: 1.6.2 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.8.0 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.11 pymysql: None psycopg2: 2.7.1 (dt dec pq3 ext lo64) jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: gfyoung

@aplavin : We've had this issue before with crosstab (xref #16367), and in that discussion, we seemed to be going with keeping all categories in the count. Thus, I would consider the output from your example expected in light of that discussion.

I'm a little wary about adding a parameter to .size() because it doesn't make much sense outside of the context of categorical. We want the interface to be as uniform as possible.

That being said, this method is a little different from crosstab, so perhaps workarounds (I can think of some but none feel appealing) might be feasible.

@jreback @jorisvandenbossche @TomAugspurger

Comment From: aplavin

@gfyoung I'm not talking about adding a parameter to .size(), but to .groupby() so that it apply to all further operations. size was just an example.

As of now the only way I see to have nonempty groups only (as in non-categorical groupby) is something like df.groupby(...).apply(f).pipe(lambda d: d[~d.isnull().any(1)]), and this still would call f for all the empty groups.

Comment From: gfyoung

I'm not talking about adding a parameter to .size(), but to .groupby()

I see, though outside of categorical, what would this parameter mean? I would still be wary of adding another parameter and functionality just because of how special-cased it is in the context of groupby.

Comment From: toobaz

xref https://github.com/pandas-dev/pandas/issues/8559#issuecomment-59442278

Rather than changing some general purpose method(s) such as size() and groupby(), maybe we could decide to solve this with a handy drop_empty() method for categoricals?

Comment From: aplavin

@toobaz if I understand you correctly, then it probably won't work in general. Suppose there is column a with values 1 and 2, and column b with values 3 and 4, but the only occuring combinations are 1,3 and 2,4. Then if we group by both of them (as categorical) I would expect to have two groups (1,3 and 2,4), and not all four. Actually it was very surprising to see that simply changing data type for performance reasons (in my case) significantly changed the behavior.

Comment From: gfyoung

I think in light of @toobaz 's reference, I think it brings up a larger question of how we want to handle counting of Categorical groups that have no presence?

It might be worthwhile to create one umbrella issue that addresses this API standpoint and close the duplicates, as this issue seems to have crept up on several occasions.

@jreback : what do you think?

Comment From: toobaz

@toobaz if I understand you correctly, then it probably won't work in general

Yes, you understood me correctly... and yes, you're right...

Comment From: jreback

see much discussion https://github.com/pandas-dev/pandas/issues/8559

closing as a duplicate.

Pandas Groupby inconsistency with categorical values

Code Sample

Problem description

Output of pd.show_versions()

Output of `pd.show_versions()`