Pandas Sum of pd.DataFrame.groupby.sum containing NaN should return NaN ?

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
d = {'l':  ['left', 'right', 'left', 'right', 'left', 'right'],
     'r': ['right', 'left', 'right', 'left', 'right', 'left'],
     'v': [-1, 1, -1, 1, -1, np.nan]}
df = pd.DataFrame(d)

Problem description

When a grouped dataframe contains a value of np.NaN the expected output is not aligned with numpy.sum or pandas.Series.sum

NaN as is given by the skipna=False flag for pd.Series.sum and also pd.DataFrame.sum

In [235]: df.v.sum(skipna=False)
Out[235]: nan

However, this behavior is not reflected in the pandas.DataFrame.groupby object

In [237]: df.groupby('l')['v'].sum()['right']
Out[237]: 2.0

and cannot be forced by applying the np.sum method directly

In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: 2.0

see this StackOverflow post for a workaround

Expected Output

In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: nan

and

In [237]: df.groupby('l')['v'].sum(skipna=False)['right']
Out[237]: nan

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.19.1 nose: 1.3.7 pip: 9.0.1 setuptools: 32.3.1 Cython: 0.25.2 numpy: 1.12.0 scipy: 0.18.1 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.5.1 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: 1.1.0 tables: 3.2.2 numexpr: 2.6.1 matplotlib: 1.5.3 openpyxl: 2.4.0 xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.4 lxml: 3.7.0 bs4: 4.5.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.1.4 pymysql: None psycopg2: None jinja2: 2.8 boto: 2.43.0 pandas_datareader: None

Comment From: jreback

np.sum is translated directly to pandas sum. so this is as expected.

if you really really want this behavior of np.sum. This is not useful in any way IMHO.

In [15]: df.groupby('l')['v'].apply(lambda x: np.sum(np.array(x)))['right']
Out[15]: nan

In [18]: df.groupby('l')['v'].apply(lambda x: x.sum(skipna=False))
Out[18]: 
l
left    -3.0
right    NaN

The passthru skipna parameter is not implemented ATM on groupby. So I will make an issue for that (thought we had one).

Comment From: jreback

xref https://github.com/pandas-dev/pandas/issues/15675

Comment From: flipdazed

This is not useful in any way IMHO.

consider the following

merge two dataframes together from an SQl database
carry out data manipulation to get an aggregated figure per some index values that is dependent on all values being present

Should a database entry be missing for some aggregated index value, the final figure should be returned as NaN as missing data is an extremely common occurrence in industry

Either way, I agree with your assessment that it is inconsistent with current implementation as skipna should be a kwarg - many thanks for creating the new issue

Comment From: jreback

@flipdazed pandas propogates NaN values on purpose. Generally on aggregations you want to skip them. If you don't there are many options (e.g. look at .filter on groupby, or simply .fillna). but that is far less common than simply aggregating.

Should a database entry be missing for some aggregated index value, the final figure should be returned as NaN as missing data is an extremely common occurrence in industry

you really think so? sounds like you don't handle missing data at all.

Comment From: jorisvandenbossche

you really think so?

Jeff, there are certainly cases imaginable where you don't want to ignore missing values. And therefore we have that as a keyword.

Comment From: jreback

Jeff, there are certainly cases imaginable where you don't want to ignore missing values. And therefore we have that as a keyword.

sure, but the vast vast majority, you want to skipna. Its very uncommon in fact to assume ALL data is valid; that is my point.

Comment From: ajosanchez

It would be nice to have a keyword and get those NAs back in this case: groupby(['x']).resample('D').sum()

When I resample after a groupby i need and aggregating function. it would be nice not to get zeros when I increase the resolution so I can use .fillna(method='ffill').

Pandas Sum of pd.DataFrame.groupby.sum containing NaN should return NaN ?

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`