A small, complete example of the issue
import pandas as pd
import numpy as np
dd = dict(a=np.arange(9), b=np.repeat(np.arange(3), 3))
df = pd.DataFrame(dd)
# Problematic line below
df.groupby('b', as_index=False).apply(np.std)
# The line below has the same problem
df.groupby('b', as_index=False).std()
# When as_index=False is not passed in, it works as expected.
# The following two lines work exactly as expected
# (An example of the group-by/apply working for some operations
# df.groupby('b', as_index=False).apply(np.mean)
# df.groupby('b', as_index=False).mean()
Expected Output
a b
0 0.816497 0.0
1 0.816497 1.0
2 0.816497 2.0
Actual Output
a b
0 0.816497 0.0
1 0.816497 0.0
2 0.816497 0.0
When as_index=False
is passed into groupby, finding the standard deviation doesn't work as expected. When as_index=True
, everything works as expected.
Finding the mean works as expected in both cases.
I have been able to reproduce the problem also on Linux with the same version of pandas.
Output of pd.show_versions()
Comment From: chris-b1
Yeah, I don't think as_index=False
is that well tested; xref #13217, though that seems to be a separate issue. PR to fix welcome!
Comment From: jorisvandenbossche
Note that the two cases (.apply(np.std)
and .std()
) give a different result:
In [59]: df.groupby('b', as_index=False).apply(np.std)
Out[59]:
a b
0 0.816497 0.0
1 0.816497 0.0
2 0.816497 0.0
In [60]: df.groupby('b', as_index=False).std()
Out[60]:
b a
0 0.000000 1.0
1 1.000000 1.0
2 1.414214 1.0
I think the .apply(np.std)
case is actually correct. It also applies the function on the 'b' column, and therefore this are all 0's. So it seems mainly the 'b' column in the .std()
case that is wrong
Comment From: jreback
these have different degrees of freedom
Comment From: jorisvandenbossche
@jreback Yes, that explains the 1 vs 0.816, but not the 0.00, 1.00, 1.41 for the b
column
Comment From: jreback
this is a dupe of #10355
Comment From: xieyuheng
https://github.com/pandas-dev/pandas/pull/25315