In [1]: import pandas as pd
In [2]: pd.Series().product()
Out[2]: nan
I excepted:
Out[2]: 1
http://en.wikipedia.org/wiki/Empty_product
Comment From: jreback
hmm, that is different from numpy....
want to do a pull-request?
Comment From: ischwabacher
xref #7869
If you want to fix this, the problem is in the bottleneck_switch
decorator in nanops.py
; otherwise, you can wait for me to get to it. That decorator's constructor takes a zero_value
argument for an operation to use when the array it's operating on is empty, then uses 0 instead in any case. We should rename the parameter to empty_value
.
Comment From: anderslundstedt
I do not feel comfortable fixing this myself, so I will wait. Thanks for the prompt responses!
Comment From: anderslundstedt
Ooops. Accidentally closed. Now open again I hope. (I am not very experienced with Github.)
Comment From: cpcloud
I can take this.
Comment From: cpcloud
@anderslundstedt @jreback Fixing this to be consistent with numpy
breaks tested behavior of TimeGrouper
aggregations. See tseries/tests/test_resample.py:TestTimeGrouper.test_aggregate_with_nat
for the test that breaks. Is this an API that should be broken?
Comment From: jreback
I would just fix that test ('prod') as its 'wrong' too
Comment From: cpcloud
i did that in the pr
Comment From: cpcloud
whoops forgot to xref this issue
Comment From: jorisvandenbossche
@cpcloud Can you give an example of a resample that would break?
Eg, would the NaN
in the following example become 1?
In [3]: s1 = pd.Series(np.arange(5), index=pd.date_range('2012-01-01', periods=5))
In [4]: s2 = pd.Series(np.arange(5), index=pd.date_range('2012-01-10', periods=5))
In [5]: s = pd.concat([s1, s2])
In [10]: s.resample('2D', how='prod')
Out[10]:
2012-01-01 0
2012-01-03 6
2012-01-05 4
2012-01-07 NaN
2012-01-09 0
2012-01-11 2
2012-01-13 12
Freq: 2D, dtype: float64
Comment From: ischwabacher
Do we care that type(pd.Series([], dtype='f64').product())
is still int
?
Comment From: cpcloud
@jorisvandenbossche no, if the group key is nan
(or nat
) then the value at that key will be 1 instead of nan
or nat
. Using your example s
:
In [44]: s
Out[44]:
2012-01-01 0
2012-01-02 1
2012-01-03 2
2012-01-04 3
2012-01-05 4
2012-01-10 0
2012-01-11 1
2012-01-12 2
2012-01-13 3
2012-01-14 4
dtype: int64
In [45]: df = s.reset_index(name='col').rename(columns={'index': 'date'})
In [46]: df.loc[2, 'date'] = pd.NaT
In [47]: df
Out[47]:
date col
0 2012-01-01 0
1 2012-01-02 1
2 NaT 2
3 2012-01-04 3
4 2012-01-05 4
5 2012-01-10 0
6 2012-01-11 1
7 2012-01-12 2
8 2012-01-13 3
9 2012-01-14 4
In [48]: df.groupby(pd.TimeGrouper(key='date', freq='D')).prod()
Out[48]:
col
date
2012-01-01 0
2012-01-02 1
2012-01-03 1
2012-01-04 3
2012-01-05 4
... ...
2012-01-10 0
2012-01-11 1
2012-01-12 2
2012-01-13 3
2012-01-14 4
[14 rows x 1 columns]
Comment From: jorisvandenbossche
@cpcloud Hmm, I don't fully get that. What is the difference? When the third row is set to NaT
in your example, this then actually means that the 2012-01-03
row is removed, and so is missing, just like the others (2012-01-06 to 2012-01-09). So if that date gives 1 in the result, the others should also?
WIth 0.14.1, your example gives:
In [8]: df.groupby(pd.TimeGrouper(key='date', freq='D')).prod()
Out[8]:
col
date
2012-01-01 0
2012-01-02 1
2012-01-03 NaN
2012-01-04 3
2012-01-05 4
2012-01-06 NaN
2012-01-07 NaN
2012-01-08 NaN
2012-01-09 NaN
2012-01-10 0
2012-01-11 1
2012-01-12 2
2012-01-13 3
2012-01-14 4
So the NaN
in the third row becomes 1 according to your example? But not the other NaN
values? That seems inconsistent to me? But if they all become 1, that also seems like a rather large API break, no?
Comment From: cpcloud
Those will be one as well I just have my display max rows set to 10. I'll post the full example in a bit
Comment From: cpcloud
Yes it's a breaking change, I'm not sure if it's the way to go as I'm not sure if this brings any benefits other that consistency with numpy and Wikipedia. OTOH I would've expected more breakage by changing the zero value default in nanops
Comment From: jorisvandenbossche
In [23]: np.array([]).prod()
Out[23]: 1.0
In [24]: np.array([np.nan]).prod()
Out[24]: nan
So would it in some way be possible to regard this example above as a product of NaN
(which stays NaN) instead of an empty product?
(warning, very naive view without looking at the actual code is following: that you follow a reasoning like first 'resample' and add the missing dates with NaN values, and only then groupby to reduce with the product operator)
Comment From: anderslundstedt
@cpcloud It is more than "consistency with Wikipedia", it is matter of consistency with any mathematical identity involving a possibly empty product.
Comment From: cpcloud
It was a tongue in cheek comment. I don't like introducing "valid" values where previously there were nans. This seems like it would break a ton of existing code. @anderslundstedt what operations are you doing that allowed you to find this inconsistency?
Comment From: anderslundstedt
Well, I want to take the product of a possibly empty series. The context is that I want to compute historical stock prices adjusted for dividend etc., when there has been no dividend after price time this involves an empty product.
Comment From: cpcloud
Can you show some code and data in an example? Thanks.
Comment From: anderslundstedt
I do not see the point and I am not able to share the code I work with. My initial post basically sums up the problem.
It is not difficult to work around, so it is not a problem (for me) if you do not change the behaviour. Just though it would be nice with empty products equal to one, but I understand that you may not want to break existing code.
Comment From: jorisvandenbossche
What I wanted to say with my comment above: the fact that in a resample operation with product, missing dates end up 1
or NaN
is more an "implementation detail" of resample
than the
insurmountable consequence of the mathematical identity of an empty product.
As you can also see the missing dates in resample
as a product of NaN
, which is NaN
.
So it would maybe be possible to change the result of Series([]).product()
to 1
while keeping the same result in resample/TimeGrouper
(giving NaNs
)? (again, without knowing the code)
Note: in R, an empty product also gives 1 by definition.
Comment From: cpcloud
That's easy to do can just special case the empty product and leave the resample code as is.
Comment From: anderslundstedt
Yes, I meant empty products of series that are empty in the strongest sense (neither values nor missing values). I have no opinion how one best would treat products of a series with only missing values.
Comment From: jreback
as an aide, you can always do: s.fillna(1).product()
to guarantee that you can do some sort of multiplication
Comment From: anderslundstedt
@jreback I do not know if you meant that it would solve the original case, which it does not:
In [1]: import pandas as pd
In [2]: pd.Series().fillna(1).product()
Out[2]: nan
Comment From: jreback
@anderslundstedt hmm, empty series is a special case I guess then, ok @cpcloud why don't we just fix the scalar case. What does that do to the resample case? (obvious the test case has to change)
Comment From: cpcloud
this is far less trivial than i thought, mostly bc arith methods are added after the class is defined so i can't override it properly. i guess i'll just have to monkey patch it. i have a big refactor of this arith stuff that defines them on ndframe abstractly so that a subclass can override them for specific cases if necessary
Comment From: jreback
@cpcloud status?
Comment From: TomAugspurger
This is being handled in https://github.com/pandas-dev/pandas/issues/18678