Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
d = {'l': ['left', 'right', 'left', 'right', 'left', 'right'],
'r': ['right', 'left', 'right', 'left', 'right', 'left'],
'v': [-1, 1, -1, 1, -1, np.nan]}
df = pd.DataFrame(d)
Problem description
When a grouped dataframe contains a value of np.NaN
the expected output is not aligned with numpy.sum
or pandas.Series.sum
NaN
as is given by the skipna=False
flag for pd.Series.sum
and also pd.DataFrame.sum
In [235]: df.v.sum(skipna=False)
Out[235]: nan
However, this behavior is not reflected in the pandas.DataFrame.groupby
object
In [237]: df.groupby('l')['v'].sum()['right']
Out[237]: 2.0
and cannot be forced by applying the np.sum
method directly
In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: 2.0
see this StackOverflow post for a workaround
Expected Output
In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: nan
and
In [237]: df.groupby('l')['v'].sum(skipna=False)['right']
Out[237]: nan
Output of pd.show_versions()
Comment From: jreback
np.sum
is translated directly to pandas sum. so this is as expected.
if you really really want this behavior of np.sum
. This is not useful in any way IMHO.
In [15]: df.groupby('l')['v'].apply(lambda x: np.sum(np.array(x)))['right']
Out[15]: nan
or
In [18]: df.groupby('l')['v'].apply(lambda x: x.sum(skipna=False))
Out[18]:
l
left -3.0
right NaN
The passthru skipna
parameter is not implemented ATM on groupby. So I will make an issue for that (thought we had one).
Comment From: jreback
xref https://github.com/pandas-dev/pandas/issues/15675
Comment From: flipdazed
This is not useful in any way IMHO.
consider the following
- merge two dataframes together from an SQl database
- carry out data manipulation to get an aggregated figure per some index values that is dependent on all values being present
Should a database entry be missing for some aggregated index value, the final figure should be returned as NaN
as missing data is an extremely common occurrence in industry
Either way, I agree with your assessment that it is inconsistent with current implementation as skipna
should be a kwarg
- many thanks for creating the new issue
Comment From: jreback
@flipdazed pandas propogates NaN values on purpose. Generally on aggregations you want to skip them. If you don't there are many options (e.g. look at .filter
on groupby, or simply .fillna
). but that is far less common than simply aggregating.
Should a database entry be missing for some aggregated index value, the final figure should be returned as NaN as missing data is an extremely common occurrence in industry
you really think so? sounds like you don't handle missing data at all.
Comment From: jorisvandenbossche
you really think so?
Jeff, there are certainly cases imaginable where you don't want to ignore missing values. And therefore we have that as a keyword.
Comment From: jreback
Jeff, there are certainly cases imaginable where you don't want to ignore missing values. And therefore we have that as a keyword.
sure, but the vast vast majority, you want to skipna. Its very uncommon in fact to assume ALL data is valid; that is my point.
Comment From: ajosanchez
It would be nice to have a keyword and get those NAs back in this case: groupby(['x']).resample('D').sum()
When I resample after a groupby i need and aggregating function. it would be nice not to get zeros when I increase the resolution so I can use .fillna(method='ffill').