Pandas Nullable Int64 column changes type after some (cumsum) operations

Code

import pandas as pd

s1 = pd.Series([1, 2, 3, 4], dtype='Int64')
s2 = pd.Series([1, 2, float('nan'), 3, 4], dtype='Int64')

cs1 = s1.cumsum()
cs2 = s1.cumsum()

print(str(s1.dtype), str(cs1.dtype))
print(str(s2.dtype), str(cs2.dtype))

Output

Int64 object
Int64 object

Expected Output

Int64 Int64
Int64 Int64

Problem description

After an operation like a cumulative sum on a column/series with an Int64 dtype, the dtype of the result is downcast to object. The contents (integers and nans) of the result still qualify for it to have an Int64 dtype.

cummax, cummin and cumprod have the same behaviour.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.7.4.final.0 python-bits : 64 OS : Linux OS-release : 5.0.13-arch1-1-ARCH machine : x86_64 processor : byteorder : little LC_ALL : en_US.utf8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 0.25.1 numpy : 1.17.0 pytz : 2019.2 dateutil : 2.8.0 pip : 19.2.3 setuptools : 41.1.0 Cython : 0.29.13 pytest : 5.1.0 hypothesis : None sphinx : 2.2.0 blosc : None feather : None xlsxwriter : None lxml.etree : 4.4.1 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.10.1 IPython : 7.7.0 pandas_datareader: None bs4 : 4.8.0 bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.4.1 matplotlib : 3.1.1 numexpr : 2.7.0 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.3.1 sqlalchemy : None tables : None xarray : 0.12.3 xlrd : 1.2.0 xlwt : None xlsxwriter : None import pandas as pd s = pd.Series([1, 2, 3, 4], dtype='Int64') cs = s.cumsum() print(str(s.dtype)) print(str(cs.dtype)) import pandas as pd

Comment From: TomAugspurger

I don't think that cumulative operations are currently part of the extension array interface.

Likely, we need something similar to ExtensionArray._reduce, but for cumulative methods.

Comment From: datajanko

Python's itertools contain an accumulate method, so I'd suggest ExtensionArray._accumulate and would like to take a shot at this.

Comment From: datajanko

@TomAugspurger would this direction/naming be okay?

Is the assumption correct, that I'd have to e.g. create a nancumsum function in core.nanops.py?

Comment From: jorisvandenbossche

We might need to think a bit more in general about this problem. Adding _accumulate is fine for accumulative methods, but do we want to keep adding such methods for other kinds of functions? (eg the np.round that came up before) Do we want to eg use the __numpy_function__ protocol for this? Or a similar mechanism specifically for pandas?

Now for this specific case, the numpy protocol is actually not sufficient, as numpy only has cumsum and cumprod, and not cummin/cummax

Comment From: mwaskom

This may be obvious to those who know the implementation details, but it happens with .diff() too (on pandas 0.25.2):

pd.Series([1, 2, 3], dtype="Int16").diff()

0    NaN
1      1
2      1
dtype: object