Pandas Empty product not equal to 1 - Nineya|java/go/python

In [1]: import pandas as pd
In [2]: pd.Series().product()
Out[2]: nan

I excepted:

Out[2]: 1

http://en.wikipedia.org/wiki/Empty_product

Comment From: jreback

hmm, that is different from numpy....

want to do a pull-request?

Comment From: ischwabacher

xref #7869

If you want to fix this, the problem is in the bottleneck_switch decorator in nanops.py; otherwise, you can wait for me to get to it. That decorator's constructor takes a zero_value argument for an operation to use when the array it's operating on is empty, then uses 0 instead in any case. We should rename the parameter to empty_value.

Comment From: anderslundstedt

I do not feel comfortable fixing this myself, so I will wait. Thanks for the prompt responses!

Comment From: anderslundstedt

Ooops. Accidentally closed. Now open again I hope. (I am not very experienced with Github.)

Comment From: cpcloud

I can take this.

Comment From: cpcloud

@anderslundstedt @jreback Fixing this to be consistent with numpy breaks tested behavior of TimeGrouper aggregations. See tseries/tests/test_resample.py:TestTimeGrouper.test_aggregate_with_nat for the test that breaks. Is this an API that should be broken?

Comment From: jreback

I would just fix that test ('prod') as its 'wrong' too

Comment From: cpcloud

i did that in the pr

Comment From: cpcloud

whoops forgot to xref this issue

Comment From: jorisvandenbossche

@cpcloud Can you give an example of a resample that would break?

Eg, would the NaN in the following example become 1?

In [3]: s1 = pd.Series(np.arange(5), index=pd.date_range('2012-01-01', periods=5))
In [4]: s2 = pd.Series(np.arange(5), index=pd.date_range('2012-01-10', periods=5))
In [5]: s = pd.concat([s1, s2])

In [10]: s.resample('2D', how='prod')
Out[10]:
2012-01-01     0
2012-01-03     6
2012-01-05     4
2012-01-07   NaN
2012-01-09     0
2012-01-11     2
2012-01-13    12
Freq: 2D, dtype: float64

Comment From: ischwabacher

Do we care that type(pd.Series([], dtype='f64').product()) is still int?

Comment From: cpcloud

@jorisvandenbossche no, if the group key is nan (or nat) then the value at that key will be 1 instead of nan or nat. Using your example s:

In [44]: s
Out[44]:
2012-01-01    0
2012-01-02    1
2012-01-03    2
2012-01-04    3
2012-01-05    4
2012-01-10    0
2012-01-11    1
2012-01-12    2
2012-01-13    3
2012-01-14    4
dtype: int64

In [45]: df = s.reset_index(name='col').rename(columns={'index': 'date'})

In [46]: df.loc[2, 'date'] = pd.NaT

In [47]: df
Out[47]:
        date  col
0 2012-01-01    0
1 2012-01-02    1
2        NaT    2
3 2012-01-04    3
4 2012-01-05    4
5 2012-01-10    0
6 2012-01-11    1
7 2012-01-12    2
8 2012-01-13    3
9 2012-01-14    4

In [48]: df.groupby(pd.TimeGrouper(key='date', freq='D')).prod()
Out[48]:
            col
date
2012-01-01    0
2012-01-02    1
2012-01-03    1
2012-01-04    3
2012-01-05    4
...         ...
2012-01-10    0
2012-01-11    1
2012-01-12    2
2012-01-13    3
2012-01-14    4

[14 rows x 1 columns]

Comment From: jorisvandenbossche

@cpcloud Hmm, I don't fully get that. What is the difference? When the third row is set to NaT in your example, this then actually means that the 2012-01-03 row is removed, and so is missing, just like the others (2012-01-06 to 2012-01-09). So if that date gives 1 in the result, the others should also?

WIth 0.14.1, your example gives:

In [8]: df.groupby(pd.TimeGrouper(key='date', freq='D')).prod()
Out[8]:
            col
date
2012-01-01    0
2012-01-02    1
2012-01-03  NaN
2012-01-04    3
2012-01-05    4
2012-01-06  NaN
2012-01-07  NaN
2012-01-08  NaN
2012-01-09  NaN
2012-01-10    0
2012-01-11    1
2012-01-12    2
2012-01-13    3
2012-01-14    4

So the NaN in the third row becomes 1 according to your example? But not the other NaN values? That seems inconsistent to me? But if they all become 1, that also seems like a rather large API break, no?

Comment From: cpcloud

Those will be one as well I just have my display max rows set to 10. I'll post the full example in a bit

Comment From: cpcloud

Yes it's a breaking change, I'm not sure if it's the way to go as I'm not sure if this brings any benefits other that consistency with numpy and Wikipedia. OTOH I would've expected more breakage by changing the zero value default in nanops

Comment From: jorisvandenbossche

In [23]: np.array([]).prod()
Out[23]: 1.0

In [24]: np.array([np.nan]).prod()
Out[24]: nan

So would it in some way be possible to regard this example above as a product of NaN (which stays NaN) instead of an empty product? (warning, very naive view without looking at the actual code is following: that you follow a reasoning like first 'resample' and add the missing dates with NaN values, and only then groupby to reduce with the product operator)

Comment From: anderslundstedt

@cpcloud It is more than "consistency with Wikipedia", it is matter of consistency with any mathematical identity involving a possibly empty product.

Comment From: cpcloud

It was a tongue in cheek comment. I don't like introducing "valid" values where previously there were nans. This seems like it would break a ton of existing code. @anderslundstedt what operations are you doing that allowed you to find this inconsistency?

Comment From: anderslundstedt

Well, I want to take the product of a possibly empty series. The context is that I want to compute historical stock prices adjusted for dividend etc., when there has been no dividend after price time this involves an empty product.

Comment From: cpcloud

Can you show some code and data in an example? Thanks.

Comment From: anderslundstedt

I do not see the point and I am not able to share the code I work with. My initial post basically sums up the problem.

It is not difficult to work around, so it is not a problem (for me) if you do not change the behaviour. Just though it would be nice with empty products equal to one, but I understand that you may not want to break existing code.

Comment From: jorisvandenbossche

What I wanted to say with my comment above: the fact that in a resample operation with product, missing dates end up 1 or NaN is more an "implementation detail" of resample than the insurmountable consequence of the mathematical identity of an empty product. As you can also see the missing dates in resample as a product of NaN, which is NaN.

So it would maybe be possible to change the result of Series([]).product() to 1 while keeping the same result in resample/TimeGrouper (giving NaNs)? (again, without knowing the code)

Note: in R, an empty product also gives 1 by definition.

Comment From: cpcloud

That's easy to do can just special case the empty product and leave the resample code as is.

Comment From: anderslundstedt

Yes, I meant empty products of series that are empty in the strongest sense (neither values nor missing values). I have no opinion how one best would treat products of a series with only missing values.

Comment From: jreback

as an aide, you can always do: s.fillna(1).product() to guarantee that you can do some sort of multiplication

Comment From: anderslundstedt

@jreback I do not know if you meant that it would solve the original case, which it does not:

In [1]: import pandas as pd
In [2]: pd.Series().fillna(1).product()
Out[2]: nan

Comment From: jreback

@anderslundstedt hmm, empty series is a special case I guess then, ok @cpcloud why don't we just fix the scalar case. What does that do to the resample case? (obvious the test case has to change)

Comment From: cpcloud

this is far less trivial than i thought, mostly bc arith methods are added after the class is defined so i can't override it properly. i guess i'll just have to monkey patch it. i have a big refactor of this arith stuff that defines them on ndframe abstractly so that a subclass can override them for specific cases if necessary

Comment From: jreback

@cpcloud status?

Comment From: TomAugspurger

This is being handled in https://github.com/pandas-dev/pandas/issues/18678