Found on Stack Overflow post here
df=pd.DataFrame([[0,0],[1,100],[0,100],[2,0],[3,100],[4,100],[4,100],[4,100],[1,100],[3,100]],columns=['key','score'])
df.groupby('key')['score'].value_counts(bins=[0,20,40,60,80,100])
Problem description
Outputs counts at the with wrong index labels. Note: data only occurs in 0-20 and 80-100.
key score
0 (-0.001, 20.0] 1
(20.0, 40.0] 1
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
1 (20.0, 40.0] 2
(-0.001, 20.0] 0
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
2 (-0.001, 20.0] 1
(20.0, 40.0] 0
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
3 (20.0, 40.0] 2
(-0.001, 20.0] 0
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
4 (20.0, 40.0] 3
(-0.001, 20.0] 0
(40.0, 60.0] 0
(60.0, 80.0] 0
(80.0, 100.0] 0
Name: score, dtype: int64
Expected Output
df.groupby('key')['score'].apply(pd.Series.value_counts, bins=[0,20,40,60,80,100])
key
0 (80.0, 100.0] 1
(-0.001, 20.0] 1
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
1 (80.0, 100.0] 2
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
(-0.001, 20.0] 0
2 (-0.001, 20.0] 1
(80.0, 100.0] 0
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
3 (80.0, 100.0] 2
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
(-0.001, 20.0] 0
4 (80.0, 100.0] 3
(60.0, 80.0] 0
(40.0, 60.0] 0
(20.0, 40.0] 0
(-0.001, 20.0] 0
Name: score, dtype: int64
Output of pd.show_versions()
Comment From: DataInformer
I think is related to #25970. The SeriesGroupby.value_counts has a weird section that mentions 'scalar bins cannot be done at top level in a backward compatible way' and then does more awkward manual operations that seem to have an error. I'm looking at whether tests will pass and behavior will be correct without this special section.
Comment From: DataInformer
take
Comment From: swierh
This bug seems to occur if there's bins that are always empty. I've tested this for versions 1.3.4 and 1.5.1, and the bug is present in both.
Given the following dataframes, the groupby operation works as expected for df1, but not for df2.
df1 = pd.DataFrame(
[["a", 1.5], ["b", 0.5], ["b", 1.5]],
columns=["group", "value"],
)
df2 = pd.DataFrame(
[["a", 1.5], ["b", 1.5], ["b", 1.5]],
columns=["group", "value"],
)
And the following two groupby methods:
# groupby A:
df.groupby("group")["value"].value_counts(bins=[0, 1, 2]).unstack(level=0)
# groupby B:
pd.concat(
{
group: df.loc[df["group"] == group, "value"].value_counts(bins=[0, 1, 2])
for group in df["group"].unique()
}
).unstack(level=0)
I would expect both groupby operations to give identical results.
For df1
they give identical results, but for df2
, which has an empty first bin, they do not. Method B gives the expected outcome, but A gives:
group a b
value
(-0.001, 1.0] 1 2
(1.0, 2.0] 0 0
Haven't looked into it properly, but currently the following line is my prime suspect: https://github.com/pandas-dev/pandas/blob/main/pandas/core/groupby/generic.py#L673