Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
di = pd.DataFrame({'A':[1, 2, 4],'B':[3, 4, -5]}, dtype='Int64')
di.sum().dtypes # dtype('int64')
di.sum(axis=1).dtypes # dtype('float64') ??
di.T.sum().T.dtypes # dtype('int64')
di.T.sum().T - di.sum(axis=1)
# 0 0.0
# 1 0.0
# 2 0.0
# dtype: float64
df = pd.DataFrame({'A':[1, 2, pd.NA, 4],'B':[3, pd.NA, 4, -5]}, dtype="Int64")
assert( (df.sum().dtypes == 'int64') and
(df.sum(axis=1).dtypes == 'float64')and
(df.T.sum().T.dtypes == 'int64') )
dg = pd.DataFrame({'A':[1, 2, None, 4],'B':[3, None, 4, -5]}, dtype='int') # FutureWarning
assert( (dg.sum().dtypes == 'O') and (dg.sum(axis=1).dtypes == 'float64'))
dh = pd.DataFrame({'A':[1, 2, 4],'B':[3, 4, -5]}, dtype='int')
assert(dh.sum().dtypes == dh.sum(axis=1).dtypes == 'int64')
Issue Description
With normal integers dh.sum(axis=1); dh.sum()
the obtained sums are integers as well unless a value is missing, then things go odd (with dg.sum(); dg.sum(axis=1)
) one gets an object or a float.
The proposed solution for this, the Nullable integer type ('Int64'), only partly works here.
Summing along the index axis (0, default), di.sum()
, keeps 'Int64' as would be expected.
But this type is not kept when summing over rows, along the columns-axis.
See code example: with di.sum(axis=1)
the resulting sums are dtype 'float' not 'Int64'.
Using the expected behaviour for axis=0, one can keep 'Int64' after summing rows by means of double transposition .
A pd.NA missing value in a dataframe of dtype 'Int64', also yields a float after summing rows (df.sum(axis=1)
)
Expected Behavior
Nullability of 'Int64', would mean that integers are not becoming floats due to other datatypes or after some normal operation on the dataframe (that would not affect integers, like summing).
Installed Versions
INSTALLED VERSIONS
commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.9.16.final.0 python-bits : 64 OS : Linux OS-release : 5.19.17 Version : #1 SMP PREEMPT_DYNAMIC Mon Oct 24 13:00:29 CDT 2022 machine : x86_64 processor : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8
pandas : 1.5.2 numpy : 1.23.5 pytz : 2022.1 dateutil : 2.8.2 setuptools : 65.1.1 pip : 22.2.2 Cython : 0.29.28 pytest : 7.2.0 hypothesis : None sphinx : 4.5.0 blosc : None feather : None xlsxwriter : 3.0.3 lxml.etree : 4.8.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 8.7.0 pandas_datareader: None bs4 : 4.10.0 bottleneck : None brotli : 1.0.9 fastparquet : None fsspec : None gcsfs : None matplotlib : 3.6.2 numba : None numexpr : 2.8.3 odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.9.1 snappy : None sqlalchemy : 1.4.45 tables : None tabulate : None xarray : 2022.12.0 xlrd : 1.1.0 xlwt : None zstandard : None tzdata : None
Comment From: phofl
Hi, thanks for your report. We have a bunch of open issues discussing this for reduction operations. Please search the issue tracker
Comment From: brobr
Sorry, for the bother, but whatever you meant by "discussing this for reduction operations", I did not notice this seemingly quite weird error mentioned among the open issues for 'Int64' .
Could you maybe explain what exactly it was the duplicate of? There is talk of 'reduction operations' in #49603 (which does not mention that summing integer values over one axis should change type; it starts with objects), while #42895 referred there concerned the bug that a mean of 'Int64' values did not give a float (which was to be expected). By summing 'Int64' values you would expect to keep the type, or at least consistent output.
Possibly all this stuff is programmatically related but I am not too familiar with the inner workings of pandas. In view of the experimental state of "Int64" I hoped this user-feedback would have been helpful.
Please, don't get me wrong, I appreciate pandas enormously (it made an idea possible I had carried around for years before getting any clue how to do it until I learnt a bit of pandas). Keep up the good work.
Comment From: phofl
The underlying reason is 1D eas like in #42895, the behavior you observed is off for all reduction operations
Comment From: brobr
Thanks, that 1D 'eas' explains it all; sorry I missed that.