-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import pandas as pd
def calcscore(df):
return pd.Series([[','.join(df.item), df.score.sum()]], index=pd.Index([df.index[0]]))
df = pd.DataFrame(columns=['time', 'item', 'param', 'score'],
data=[[1, 'N1', 10, 100],
[1, 'N2', 20, 150],
[2, 'N3', 70, 300]])
for p in [5, 50]:
s = df[df.param > p].groupby('time', group_keys=False).apply(calcscore)
print(type(s))
print(s.to_list())
Problem description
I'm looping through a dataframe, selectively filtering by varying values for a param
column and returning aggregated data based on that filtering (the actual code is more complex). I'm intentionally returning a Series object from the apply function since I've benchmarked it to be faster than returning a DataFrame. The problem is that sometimes, a DataFrame is unexpectedly returned by the apply function instead of the Series object that I explicitly returned. This seems to happen when the Series object has only one row. Here is the unexpected output that I'm seeing:
<class 'pandas.core.series.Series'>
[['N1,N2', 250], ['N3', 300]]
<class 'pandas.core.frame.DataFrame'>
Traceback (most recent call last):
File "/tmp/applytest.py", line 22, in <module>
print(s.to_list())
File "/usr/lib64/python3.8/site-packages/pandas/core/generic.py", line 5130, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'to_list'
Expected Output
<class 'pandas.core.series.Series'>
[['N1,N2', 250], ['N3', 300]]
<class 'pandas.core.series.Series'>
[['N3', 300]]
Output of pd.show_versions()
Comment From: andrewkoo
I took a look at the source and found that the error may be related to the call to unstack
in _wrap_applied_output
(pandas>core>groupby>generic.py>DataFrameGroupBy>wrap_applied_output). concat
does indeed return a Series but unstack
is converting the series to a DataFrame.
Removing the call to unstack
seems to result in the expected output for this case, but I'm not sure if that would break other cases...
def _wrap_applied_output(self, keys, values, not_indexed_same=False):
...
if self.axis == 0 and isinstance(v, ABCSeries):
if (
isinstance(v.index, MultiIndex)
or key_index is None
or isinstance(key_index, MultiIndex)
):
...
else:
# GH5788 instead of stacking; concat gets the
# dtypes correct
from pandas.core.reshape.concat import concat
result = concat(
values,
keys=key_index,
names=key_index.names,
axis=self.axis,
).unstack()
result.columns = index
...
Not quite sure how we want to fix the bug (additional checks, etc.), but this is my first time contributing so would love to help any way I can!
Comment From: rhshadrach
I believe this is a duplicate of #31063.
The issue is that apply
attempts to transform the result differently in the cases where all the indices are the same or some are different. This leads to different behavior when there is a single row vs multiple rows. Certainly this is something that needs to be improved on - PRs always welcome!
Comment From: rhshadrach
Closing as a duplicate