Pandas BUG: DataFrame reductions with object dtype and axis=1

In pandas 1.5.x and before, the dtype of the result is different when using axis=1:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, dtype=object)
print(df.sum(axis=1, numeric_only=None))
print(df.sum(axis=1, numeric_only=False))
0    5.0
1    7.0
2    9.0
dtype: float64
0    5
1    7
2    9
dtype: object

When using axis=0, both the examples above result in object dtype. #49551 changed the numeric_only=False case to return float64 in order to preserve the default behavior. However, it seems better to have the result be object dtype for both examples above instead.

Comment From: rhshadrach

I'm now thinking that the result should be int instead.

Here are the results with the BlockManager

pd.set_option('data_manager', 'block')
df = pd.DataFrame({'a': [1]}, dtype=object)

print(df.sum(axis=0, numeric_only=False))
# a    1
# dtype: object

print(df.sum(axis=0, numeric_only=True))
# Series([], dtype: float64)

print(df.sum(axis=1, numeric_only=False))
# 0    1.0
# dtype: float64

print(df.sum(axis=1, numeric_only=True))
# 0    0.0
# dtype: float64

and the ArrayManager

pd.set_option('data_manager', 'array')
df = pd.DataFrame({'a': [1]}, dtype=object)

print(df.sum(axis=0, numeric_only=False))
# a    1
# dtype: int64

print(df.sum(axis=0, numeric_only=True))
# Series([], dtype: float64)

print(df.sum(axis=1, numeric_only=False))
# 0    1.0
# dtype: float64

print(df.sum(axis=1, numeric_only=True))
# 0    0.0
# dtype: float64

For df.sum(axis=0, numeric_only=False):

With the ArrayManager, the 1-dim reduction results in a Python object (an integer), which is then inferred to be of integer type when the resulting array is constructed. With the BlockManager, the 2-dim reduction results in a 1-dim NumPy array of object type (which remains object type).

In other words, the ArrayManager naturally needs to infer from Python objects whereas the BlockManager does not. I think these should be consistent, but could see arguments either way. I think I would prefer the BlockManager to attempt to convert from object rather than the ArrayManager casting to object.

For df.sum(axis=1, numeric_only=False):

This result being float looks like a bug to me; we explicitly try to cast to float at the end of DataFrame._reduce, but only when numeric_only=False and axis=1.

cc @jorisvandenbossche @jbrockmendel for any thoughts

Comment From: jbrockmendel

The 1-dim reduction in AM is tricky (also shows up in 1D EAs e.g #42895).

My inclination here is to treat df.sum(axis=1) like df[0] + df[1] + ... + df[N], which we'd expect to retain object dtype in these cases.

Comment From: rhshadrach

The 1-dim reduction in AM is tricky (also shows up in 1D EAs e.g #42895).

In the linked issue, the dtype starts as Int64, so makes sense that object is wrong. However here the starting dtype is object; does it make sense for the output to be int as opposed to object?

My inclination here is to treat df.sum(axis=1) like df[0] + df[1] + ... + df[N], which we'd expect to retain object dtype in these cases.

Thanks, I find this compelling. What about for something like min that can't be expressed in this fashion?

Comment From: rhshadrach

In line with #51205 for groupby, I think object dtypes should remain object across all aggregations.