Code
import pandas as pd
s1 = pd.Series([1, 2, 3, 4], dtype='Int64')
s2 = pd.Series([1, 2, float('nan'), 3, 4], dtype='Int64')
cs1 = s1.cumsum()
cs2 = s1.cumsum()
print(str(s1.dtype), str(cs1.dtype))
print(str(s2.dtype), str(cs2.dtype))
Output
Int64 object
Int64 object
Expected Output
Int64 Int64
Int64 Int64
Problem description
After an operation like a cumulative sum on a column/series with an Int64
dtype, the dtype of the result is downcast to object
. The contents (integers and nans) of the result still qualify for it to have an Int64
dtype.
cummax
, cummin
and cumprod
have the same behaviour.
Output of pd.show_versions()
Comment From: TomAugspurger
I don't think that cumulative operations are currently part of the extension array interface.
Likely, we need something similar to ExtensionArray._reduce
, but for cumulative methods.
Comment From: datajanko
Python's itertools
contain an accumulate
method, so I'd suggest ExtensionArray._accumulate
and would like to take a shot at this.
Comment From: datajanko
@TomAugspurger would this direction/naming be okay?
Is the assumption correct, that I'd have to e.g. create a nancumsum
function in core.nanops.py
?
Comment From: jorisvandenbossche
We might need to think a bit more in general about this problem. Adding _accumulate
is fine for accumulative methods, but do we want to keep adding such methods for other kinds of functions? (eg the np.round
that came up before)
Do we want to eg use the __numpy_function__
protocol for this? Or a similar mechanism specifically for pandas?
Now for this specific case, the numpy protocol is actually not sufficient, as numpy only has cumsum
and cumprod
, and not cummin
/cummax
Comment From: mwaskom
This may be obvious to those who know the implementation details, but it happens with .diff()
too (on pandas 0.25.2):
pd.Series([1, 2, 3], dtype="Int16").diff()
0 NaN
1 1
2 1
dtype: object
Comment From: jorisvandenbossche
Series.diff
is actually different, and has a custom implementation in pandas, that could be expanded to handle our nullable dtypes. I think it is worth opening a separate issue for that.