In a groupby/transform when some of the groups are missing, should the transformed values be set to missing (my preference), left unchanged, or should this be an error? Currently the behavior is inconsistent between Series and Frames, and between cythonized and non-cythonized transformations.
For a Series with a non-cythonized transformation, the values are left unchanged:
>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series([100, 200, 300, 400])
>>> s.groupby([1, 1, np.nan, np.nan]).transform(pd.Series.mean)
0 200
1 200
2 300
3 400
For a Series with cythonized functions, its an error (this changed between 0.14.1 and 0.15.0):
>>> s.groupby([1, 1, np.nan, np.nan]).transform(np.mean)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pandas/pandas/core/groupby.py", line 2425, in transform
return self._transform_fast(cyfunc)
File "pandas/pandas/core/groupby.py", line 2466, in _transform_fast
return self._set_result_index_ordered(Series(values))
File "pandas/pandas/core/groupby.py", line 494, in _set_result_index_ordered
result.index = self.obj.index
File "pandas/pandas/core/generic.py", line 1948, in __setattr__
object.__setattr__(self, name, value)
File "pandas/src/properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas/lib.c:41020)
File "pandas/pandas/core/series.py", line 262, in _set_axis
self._data.set_axis(axis, labels)
File "pandas/pandas/core/internals.py", line 2217, in set_axis
'new values have %d elements' % (old_len, new_len))
ValueError: Length mismatch: Expected axis has 2 elements, new values have 4 elements
For DataFrames, the results are opposite:
>>> f = pd.DataFrame({'a': s, 'b': s * 2})
>>> f
a b
0 100 200
1 200 400
2 300 600
3 400 800
>>> f.groupby([1, 1, np.nan, np.nan]).transform(np.sum)
a b
0 300 600
1 300 600
2 300 600
3 400 800
>>> f.groupby([1, 1, np.nan, np.nan]).transform(pd.DataFrame.sum)
Traceback (most recent call last):
File "pandas/pandas/core/groupby.py", line 3002, in transform
return self._transform_general(func, *args, **kwargs)
File "pandas/pandas/core/groupby.py", line 2968, in _transform_general
return self._set_result_index_ordered(concatenated)
File "pandas/pandas/core/groupby.py", line 494, in _set_result_index_ordered
result.index = self.obj.index
File "pandas/pandas/core/generic.py", line 1948, in __setattr__
object.__setattr__(self, name, value)
File "pandas/src/properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas/lib.c:41020)
File "pandas/pandas/core/generic.py", line 406, in _set_axis
self._data.set_axis(axis, labels)
File "pandas/pandas/core/internals.py", line 2217, in set_axis
'new values have %d elements' % (old_len, new_len))
ValueError: Length mismatch: Expected axis has 2 elements, new values have 4 elements
>>> print(pd.__version__)
0.15.1-125-ge463818
Comment From: jreback
agree on the consistency
on think a pass thru grouping is probably the most intuitive iow the values should be unchanged by a transform
care to do a pr to fix?
Comment From: kcarnold
This is still a bug (v0.20.3), though interestingly the Series now works correctly for np.mean (which crashed for the OP) and crashes for pd.Series.mean.
9941 was a similar issue that got fixed recently.
Comment From: gfyoung
@kcarnold : Interesting...is the stack-trace the same as before?
BTW, given what you said, we should probably add tests to confirm your first statement, though a PR to patch the still crashing behavior is still welcome!
Comment From: gfyoung
agree on the consistency
@jreback : Well, you have your consistency now it seems. They both work with numpy
, but they both fail with the pandas
-defined aggregate functions. :smile:
Comment From: rhshadrach
Now all four ops do not raise, but give inconsistent results. Using np.mean
and np.sum
you get four rows with NaNs filled in for the missing keys. Using pd...mean
and pd...sum
there are two rows. From #35751 (cc @arw2019 ), I think the output here (where dropna=True) should have two rows.
I think the issue is here:
https://github.com/pandas-dev/pandas/blob/53810fdadd7e0aa848ae4f37ca9421cf7ee29961/pandas/core/groupby/generic.py#L559-L567
where it is incorrect to use self.obj.index
.
The result when dropna=False is consistent and correct.
Comment From: rhshadrach
All four results are now consistent and correct, according to
https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.5.0.html#using-dropna-true-with-groupby-transforms
s = pd.Series([100, 200, 300, 400])
print(s.groupby([1, 1, np.nan, np.nan]).transform(pd.Series.mean))
# 0 150.0
# 1 150.0
# 2 NaN
# 3 NaN
# dtype: float64
print(s.groupby([1, 1, np.nan, np.nan]).transform(np.mean))
# 0 150.0
# 1 150.0
# 2 NaN
# 3 NaN
# dtype: float64
f = pd.DataFrame({'a': s, 'b': s * 2})
print(f.groupby([1, 1, np.nan, np.nan]).transform(np.sum))
# a b
# 0 300.0 600.0
# 1 300.0 600.0
# 2 NaN NaN
# 3 NaN NaN
print(f.groupby([1, 1, np.nan, np.nan]).transform(pd.DataFrame.sum))
# a b
# 0 300.0 600.0
# 1 300.0 600.0
# 2 NaN NaN
# 3 NaN NaN
Tests were added to pandas.tests.groupby.dropna
as part of the fixes linked above.