Code Sample, a copy-pastable example if possible
df = pd.DataFrame({'A': 'a b c'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
df['B'] = np.nan
df.groupby(lambda x:x, axis=1).sum(min_count=1)
Problem description
The output for the above code should return NaN for the column 'B' of the DataFrame when min_count=1 as mentioned in DataFrame.sum doc., but it is returning 0 instead.
Actual Output
A B C
0 a 0.0 4
1 b 0.0 6
2 c 0.0 5
Expected Ouput
A B C
0 a NaN 4
1 b NaN 6
2 c NaN 5
So, I decided to remove the column with strings, then it returns NaN for the column 'B'. However, I think the columns in the Dataframe should be independent to each other. It seems like a bug.
del df['A']
df.groupby(lambda x:x, axis=1).sum(min_count=1)
Output
B C
0 NaN 4.0
1 NaN 6.0
2 NaN 5.0
Output of pd.show_versions()
Comment From: WillAyd
Hmm OK that is strange. Definitely some unwanted casting going on with the presence of column A.
Investigation and PRs are always welcome if you care to take a look
Comment From: Koustav-Samaddar
This is the most stripped down case where I am still able to reproduce the bug.
# bug_var = 1
bug_var = 'a'
df = pd.DataFrame({
'A': [bug_var, np.nan]
})
gresult = df.groupby(lambda x: x)
result = gresult.sum(min_count=1)
This yields,
result
A
0 a
1 0
but,
expected
A
0 a
1 NaN
Due to the non-numeric type, self._cython_agg_general
at groupby.py:1256 raises an Error (DataError
if numeric_only=True
otherwise AttributeError
) which causes the program to rely on self.aggregate
(groupby.py:1262) which is where the bug stems from.
I'll continue to take a look at this tomorrow, but in the meantime I'm putting this here if anyone finds this useful.
Comment From: ikramersh
This issue no longer occurs when tested with pandas version 1.3.4
Comment From: gmaiwald
take