Pandas Dataframe Groupby value_counts with bins parameter

Found on Stack Overflow post here

df=pd.DataFrame([[0,0],[1,100],[0,100],[2,0],[3,100],[4,100],[4,100],[4,100],[1,100],[3,100]],columns=['key','score'])
df.groupby('key')['score'].value_counts(bins=[0,20,40,60,80,100])

Problem description

Outputs counts at the with wrong index labels. Note: data only occurs in 0-20 and 80-100.

key  score         
0    (-0.001, 20.0]    1
     (20.0, 40.0]      1
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
1    (20.0, 40.0]      2
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
2    (-0.001, 20.0]    1
     (20.0, 40.0]      0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
3    (20.0, 40.0]      2
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
4    (20.0, 40.0]      3
     (-0.001, 20.0]    0
     (40.0, 60.0]      0
     (60.0, 80.0]      0
     (80.0, 100.0]     0
Name: score, dtype: int64

Expected Output

df.groupby('key')['score'].apply(pd.Series.value_counts, bins=[0,20,40,60,80,100])

key                
0    (80.0, 100.0]     1
     (-0.001, 20.0]    1
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
1    (80.0, 100.0]     2
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
2    (-0.001, 20.0]    1
     (80.0, 100.0]     0
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
3    (80.0, 100.0]     2
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
4    (80.0, 100.0]     3
     (60.0, 80.0]      0
     (40.0, 60.0]      0
     (20.0, 40.0]      0
     (-0.001, 20.0]    0
Name: score, dtype: int64

Output of `pd.show_versions()`

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None pandas : 1.0.1 numpy : 1.16.4 pytz : 2019.1 dateutil : 2.8.0 pip : 19.1.1 setuptools : 41.0.1 Cython : 0.29.12 pytest : 5.0.1 hypothesis : None sphinx : 2.1.2 blosc : None feather : None xlsxwriter : 1.1.8 lxml.etree : 4.3.4 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.10.1 IPython : 7.7.0 pandas_datareader: None bs4 : 4.7.1 bottleneck : 1.2.1 fastparquet : None gcsfs : None lxml.etree : 4.3.4 matplotlib : 3.1.0 numexpr : 2.6.9 odfpy : None openpyxl : 2.6.2 pandas_gbq : None pyarrow : None pytables : None pytest : 5.0.1 pyxlsb : None s3fs : None scipy : 1.3.0 sqlalchemy : 1.3.5 tables : 3.5.2 tabulate : 0.8.6 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.1.8 numba : 0.45.0

Comment From: DataInformer

I think is related to #25970. The SeriesGroupby.value_counts has a weird section that mentions 'scalar bins cannot be done at top level in a backward compatible way' and then does more awkward manual operations that seem to have an error. I'm looking at whether tests will pass and behavior will be correct without this special section.

Comment From: DataInformer

take

Comment From: swierh

This bug seems to occur if there's bins that are always empty. I've tested this for versions 1.3.4 and 1.5.1, and the bug is present in both.

Given the following dataframes, the groupby operation works as expected for df1, but not for df2.

df1 = pd.DataFrame(
    [["a", 1.5], ["b", 0.5], ["b", 1.5]],
    columns=["group", "value"],
)
df2 = pd.DataFrame(
    [["a", 1.5], ["b", 1.5], ["b", 1.5]],
    columns=["group", "value"],
)

And the following two groupby methods:

# groupby A:
df.groupby("group")["value"].value_counts(bins=[0, 1, 2]).unstack(level=0)

# groupby B:
pd.concat(
    {
        group: df.loc[df["group"] == group, "value"].value_counts(bins=[0, 1, 2])
        for group in df["group"].unique()
    }
).unstack(level=0)

I would expect both groupby operations to give identical results.

For df1 they give identical results, but for df2, which has an empty first bin, they do not. Method B gives the expected outcome, but A gives:

        group   a   b
value       
(-0.001, 1.0]   1   2
(1.0, 2.0]      0   0

Haven't looked into it properly, but currently the following line is my prime suspect: https://github.com/pandas-dev/pandas/blob/main/pandas/core/groupby/generic.py#L673

Pandas Dataframe Groupby value_counts with bins parameter

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`