In pandas 1.5.x and before, the dtype of the result is different when using axis=1
:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, dtype=object)
print(df.sum(axis=1, numeric_only=None))
print(df.sum(axis=1, numeric_only=False))
0 5.0
1 7.0
2 9.0
dtype: float64
0 5
1 7
2 9
dtype: object
When using axis=0
, both the examples above result in object dtype. #49551 changed the numeric_only=False
case to return float64
in order to preserve the default behavior. However, it seems better to have the result be object dtype for both examples above instead.
Comment From: rhshadrach
I'm now thinking that the result should be int instead.
Here are the results with the BlockManager
pd.set_option('data_manager', 'block')
df = pd.DataFrame({'a': [1]}, dtype=object)
print(df.sum(axis=0, numeric_only=False))
# a 1
# dtype: object
print(df.sum(axis=0, numeric_only=True))
# Series([], dtype: float64)
print(df.sum(axis=1, numeric_only=False))
# 0 1.0
# dtype: float64
print(df.sum(axis=1, numeric_only=True))
# 0 0.0
# dtype: float64
and the ArrayManager
pd.set_option('data_manager', 'array')
df = pd.DataFrame({'a': [1]}, dtype=object)
print(df.sum(axis=0, numeric_only=False))
# a 1
# dtype: int64
print(df.sum(axis=0, numeric_only=True))
# Series([], dtype: float64)
print(df.sum(axis=1, numeric_only=False))
# 0 1.0
# dtype: float64
print(df.sum(axis=1, numeric_only=True))
# 0 0.0
# dtype: float64
For df.sum(axis=0, numeric_only=False)
:
With the ArrayManager, the 1-dim reduction results in a Python object (an integer), which is then inferred to be of integer type when the resulting array is constructed. With the BlockManager, the 2-dim reduction results in a 1-dim NumPy array of object type (which remains object type).
In other words, the ArrayManager naturally needs to infer from Python objects whereas the BlockManager does not. I think these should be consistent, but could see arguments either way. I think I would prefer the BlockManager to attempt to convert from object rather than the ArrayManager casting to object.
For df.sum(axis=1, numeric_only=False)
:
This result being float looks like a bug to me; we explicitly try to cast to float at the end of DataFrame._reduce
, but only when numeric_only=False
and axis=1
.
cc @jorisvandenbossche @jbrockmendel for any thoughts
Comment From: jbrockmendel
The 1-dim reduction in AM is tricky (also shows up in 1D EAs e.g #42895).
My inclination here is to treat df.sum(axis=1)
like df[0] + df[1] + ... + df[N]
, which we'd expect to retain object dtype in these cases.
Comment From: rhshadrach
The 1-dim reduction in AM is tricky (also shows up in 1D EAs e.g #42895).
In the linked issue, the dtype starts as Int64, so makes sense that object is wrong. However here the starting dtype is object; does it make sense for the output to be int as opposed to object?
My inclination here is to treat
df.sum(axis=1)
likedf[0] + df[1] + ... + df[N]
, which we'd expect to retain object dtype in these cases.
Thanks, I find this compelling. What about for something like min
that can't be expressed in this fashion?
Comment From: rhshadrach
In line with #51205 for groupby, I think object dtypes should remain object across all aggregations.