Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame(data={'id': [1,1,1,1], 'my_bool': [True, False, False, True]})
df.groupby('id').diff() # returns incorrectly [NaN, -1, 0, 1], pandas 1.4 returns [NaN, True, False, True]
df['my_bool'].diff() # returns desirable output [NaN, True, False, True]
# other example
df2 = pd.DataFrame(data={'id': pd.Series([], dtype='int64'), 'my_bool': pd.Series([], dtype='bool')})
df2['my_bool'].diff() # correct result: empty Series
df2.groupby('id')['my_bool'].diff() # raises an exception TypeError: numpy boolean subtract, the `-` operator, is not supported
Issue Description
In Pandas 1.5and above (tested 1.5.2 via pip, and on main 2.0.0 via source install)
DataFrameGroupBy.diff:
- on boolean-typed columns returns an incorrectly typed Series,
- on empty boolean-typed columns triggers a numpy exception due to the incorrect operator -
use.
Expected Behavior
First example
import pandas as pd
>>> df = pd.DataFrame(data={'id': [1,1,1,1], 'my_bool': [True, False, False, True]})
>>> df.groupby('id').diff()
my_bool
0 NaN
1 True
2 False
3 True
Second example
Should not raise an exception
>>> df2 = pd.DataFrame(data={'id': pd.Series([], dtype='int64'), 'my_bool': pd.Series([], dtype="bool")})
>>> df2.groupby('id')['my_bool'].diff() # should return Series([], Name: my_bool, dtype: object)
Series([], Name: my_bool, dtype: bool)
Installed Versions
Comment From: rhshadrach
From the docstring of Series.diff:
For boolean dtypes, this uses operator.xor() rather than operator.sub(). The result is calculated according to current dtype in Series, however dtype of the result is always float64.
It seems this has not been implemented correctly. I'm in favor of how the docstring says this method should be implemented, so the result for df['my_bool'].diff()
in example 1 would be:
0 NaN
1 1.0
2 0.0
3 1.0
Name: my_bool, dtype: float64
Agreed that groupby should switch to xor with Boolean inputs as well (or alternatively both shouldn't), and that empty inputs should work here (Example 2).
One other thought is that perhaps we should preserve e.g. Int64
dtypes since they can hold NA values (unlike int64
). But I think that's best for a separate issue.