Code Sample, a copy-pastable example if possible

Current (0.21) incorrect behavior:

>>> import pandas as pd
>>> pd.__version__
u'0.21.0'
>>> pd.Series().sum()
nan

Old (0.20.3) correct behavior:

>>> import pandas as pd
>>> pd.__version__
u'0.20.3'
>>> pd.Series().sum()
0

Problem description

Sum of an empty series should be 0, not nan, because otherwise the following invariant is violated:

pd.concat([s1,s2]).sum() == s1.sum() + s2.sum()

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.14.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: C LOCALE: None.None pandas: 0.21.0 pytest: None pip: 9.0.1 setuptools: 36.6.0 Cython: None numpy: 1.13.3 scipy: 1.0.0 pyarrow: None xarray: None IPython: 5.5.0 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.1.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: 1.0.2 lxml: None bs4: None html5lib: 1.0b10 sqlalchemy: 1.1.15 pymysql: None psycopg2: 2.7.3.2 (dt dec pq3 ext lo64) jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: max-sixty

https://github.com/pandas-dev/pandas/pull/17630

Comment From: jreback

pls read the what’s new http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#sum-prod-of-all-nan-series-dataframes-is-now-consistently-nan

pandas has always had this behavior

Comment From: jreback

note that your invariant holds with sum of empty = NaN otherwise you lose information

Comment From: sam-s

The invariant no longer holds:

>>> pd.concat([pd.Series([1]),pd.Series()]).equals(pd.Series([1]))
True

thus should be

pd.concat([pd.Series([1]),pd.Series()]).sum() == pd.Series([1]).sum() + pd.Series().sum()

which was True in 0.20.3 and now False

Comment From: jorisvandenbossche

pandas has always had this behavior

@jreback this is not true, pandas always had the 0 behaviour for empty series (we broke this behaviour on purpose, for sure, but it was a breaking change)

pd.concat([pd.Series([1]),pd.Series()]).sum() == pd.Series([1]).sum() + pd.Series().sum()

which was True in 0.20.3 and now False

This only evaluates to False because the + between scalars (0 + np.nan) is not skipping NaNs as the sum of pandas series does (if you have a series with [0, np.nan] and would sum this, you would get 0). So I am not sure your comparison really holds.

@sam-s we had long discussions about it, and both 0 and NaN behaviours have pros and cons, but in the end we needed to make a decision, which became NaN. You can read this discussion in https://github.com/pandas-dev/pandas/issues/9422

Comment From: sam-s

Okay, I am sure it's too late for me to weep and scream. However, how do I check that there are no NaNs in a DataFrame? I used df.isnull().sum().sum() == 0 before. What do I do now?

Comment From: jorisvandenbossche

I used df.isnull().sum().sum() == 0 before.

Doesn't that still work?

In [30]: pd.Series([1, 2, 3]).isnull().sum()
Out[30]: 0

In [31]: pd.Series([1, 2, np.nan]).isnull().sum()
Out[31]: 1

Or can you give a concrete example?

Comment From: jorisvandenbossche

Okay, I am sure it's too late for me to weep and scream.

Probably yes, but you can still raise your voice in https://github.com/pandas-dev/pandas/issues/9422. However I think it is mainly interesting to hear how it affects code (what do you need to do to work around the change) and how this can be easified.

Comment From: sam-s

+ vs sum

@jorisvandenbossche : you appear to be saying that + for scalars is somehow a different beast than sum for collections (lists/sets/series &c). This is so, well, unexpected, that I neglected to address it until you wrote in #9422 "... two different kinds of sums ...". I cannot comment there anymore, so I will reply here. Again, I understand that math is not your only rationale, but I beg you to remember that the bulk of your customers are applied mathematicians and what I am about to say is the way we think about these issues, and this is why we will be screaming bloody murder on sum([])==null till the end of the world and back.

When we have an associative binary operation f, such as +, we can define it on lists with at least two elements like this:

f(a,b,c,d,...z) := f(...(f(f(f(a,b),c),d),...),z)

or, in a more familiar infix notation

a+b+c+d+...+z := (((((a+b)+c)+d)+...)+z)

When the operation has a unit u (e.g., 0 for + or 1 for *): f(u,x)=f(x,u)=x for any x, we can extend the list operation to any list because

f(a,b,c,...z) = f(...(f(f(f(u,a),b),c),...),z)

and now f(a)=f(u,a)=a and f()=u. (((Note, parenthetically, that, strictly speaking, while + is a binary operation, but you are not arguing that a sum of a list of one element is undefined. Why? How is it different from a sum of a list of no argument from this POV?!)))

This definition is natural for any associative binary operation OP with a neutral element UNIT, and it works like this (this is a mathematical definition identical to the above):

class Collection:
  def apply(self, OP):
    result = UNIT
    for x in self:
      result = OP(result,x)
    return result

Clearly [x].apply(+)==x and [].apply(+)==0 .

This is fully applicable to all such associative binary operations with a neutral element, e.g.,:

  • + and 0
  • * and 1
  • max and -inf
  • min and inf
  • and/all and True
  • or/any and False

The bottom line: OP([])=null is wrong because it breaks associativity and your customers expect associativity the same way you expect your C compiler to be able to compile a+=0 so that a does not change.

Comment From: sam-s

Any vs All

The NaN/null behavior should be defined for collections which have some bad (NaN/null) data. Whether it has some valid data or not is irrelevant. E.g., in the presence of null, the above definition becomes

class Collection:
  def apply(self, OP):
    result = UNIT
    for x in self:
      if x.isnull() and propagate_null:
        return null
      else:
        result = OP(result,x)
    return result

Clearly apply will return null if there IS a null, not if there are NO non-nulls.

@shoyer writes in #9422 : This is reasonable from a mathematical perspective, but again is not the choice made by databases. In databases (and pandas) nulls can be introduced into empty results quite easily from joins, and in general there is no careful distinction.

The choice made by databases is a mistake made by the designers of SQL, as explained by @kenahoo. There is no reason for pandas to repeat the mistake.

Please do not perpetuate a mistake made by others. You are not beholden to them.

Comment From: sam-s

Finally, note that the above argument is applicable only to associative operations. std and mean are not such operations. You can make an argument for both * mean([]) = std([]) = std([x]) = NaN

and * mean([]), std([]), std([x]) --> exception

Both are legitimate design choices and, while, personally, I prefer the second one, I defer to you.

However, sum([])=nan is not a legitimate design decision because it breaks the contract of addition (yes, associativity is a part of the contract).