Pandas BUG: String upgraded to complex128 (and cascading to other columns) in df.mean() and df.agg('mean')

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas (1.1.1; latest from conda).
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd

df = pd.DataFrame([{'db': 'J', 'numeric': 123}])
df2 = pd.DataFrame([{'db': 'J'}])
df3 = pd.DataFrame([{'db': 'j'}])
df4 = pd.DataFrame([{'db': 'J', 'numeric': 123},
                    {'db': 'J', 'numeric': 456}])
df5 = pd.DataFrame([{'db': 'J'}], dtype='string')

# Initial columns are _correct_ types
df.dtypes

# complex128 and across columns
df.mean()

# agg('mean') is the same.
df.agg('mean')

# Happens even when the columns at issue is alone.
df2.mean()

# Case of the 'J' doesn't matter.
df3.mean()

# Numeric only works as expected.
df.mean(numeric_only=True)

# Two rows works as expected, too.
df4.mean()

# The new StringDtype doesn't appear to matter.
np.mean(df['db'].astype('string').array)
type(np.mean(df['db'].astype('string').array))
df['db'].astype('string').dtype
df5.mean()

# Happens in PandasArray.
np.mean(df['db'].array)
type(np.mean(df['db'].array))

# Not in a numpy array.
np.mean(df['db'].to_numpy())

# Also happens in a Series.
df['db'].mean()
type(df['db'])

# The conversion happens in nanops.nanmean().
pd.core.nanops.nanmean(df['db'])

# Doesn't look like _get_values() is to blame.
pd.core.nanops._get_values(df['db'], True)[0].mean()

# The sum doesn't appear to be it, either.
values = pd.core.nanops._get_values(df['db'], True)[0]
dtype_sum = pd.core.nanops._get_values(df['db'], True)[2]
type(values.sum(None, dtype=dtype_sum))

# The conversion happens in pd.core.nanops._ensure_numeric().
pd.core.nanops._ensure_numeric(values.sum(None, dtype=dtype_sum))

# Interestingly enough, it looks like a sum somewhere is concatenating
# the two 'J's together.
pd.Series(['J', 'J']).astype('complex128').mean()
pd.Series(['J', 'J']).mean()
pd.Series(['J']).mean()

# Here's an example. Then, that sum can't be cast.
pd.core.nanops._get_values(df4['db'], True)[0].sum()

With output:

>>> import numpy as np
>>> import pandas as pd
>>>
>>> df = pd.DataFrame([{'db': 'J', 'numeric': 123}])
>>> df2 = pd.DataFrame([{'db': 'J'}])
>>> df3 = pd.DataFrame([{'db': 'j'}])
>>> df4 = pd.DataFrame([{'db': 'J', 'numeric': 123},
...                     {'db': 'J', 'numeric': 456}])
>>> df5 = pd.DataFrame([{'db': 'J'}], dtype='string')
>>>
>>> # Initial columns are _correct_ types
>>> df.dtypes
db         object
numeric     int64
dtype: object
>>>
>>> # complex128 and across columns
>>> df.mean()
db           0.000000+1.000000j
numeric    123.000000+0.000000j
dtype: complex128
>>>
>>> # agg('mean') is the same.
>>> df.agg('mean')
db           0.000000+1.000000j
numeric    123.000000+0.000000j
dtype: complex128
>>>
>>> # Happens even when the columns at issue is alone.
>>> df2.mean()
db    0.000000+1.000000j
dtype: complex128
>>>
>>> # Case of the 'J' doesn't matter.
>>> df3.mean()
db    0.000000+1.000000j
dtype: complex128
>>>
>>> # Numeric only works as expected.
>>> df.mean(numeric_only=True)
numeric    123.0
dtype: float64
>>>
>>> # Two rows works as expected, too.
>>> df4.mean()
numeric    289.5
dtype: float64
>>> # The new StringDtype doesn't appear to matter.
>>> np.mean(df['db'].astype('string').array)
1j
>>> type(np.mean(df['db'].astype('string').array))
<class 'complex'>
>>> df['db'].astype('string').dtype
StringDtype
>>> df5.mean()
Series([], dtype: float64)
>>>
>>> # Happens in PandasArray.
>>> np.mean(df['db'].array)
1j
>>> type(np.mean(df['db'].array))
<class 'complex'>
>>>
>>> # Not in a numpy array.
>>> np.mean(df['db'].to_numpy())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 5, in mean
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 3372, in mean
    return _methods._mean(a, axis=axis, dtype=dtype,
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/_methods.py", line 172, in _mean
    ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'str' and 'int'
>>>
>>> # Also happens in a Series.
>>> df['db'].mean()
1j
>>> type(df['db'])
<class 'pandas.core.series.Series'>
>>>
>>> # The conversion happens in nanops.nanmean().
>>> pd.core.nanops.nanmean(df['db'])
1j
>>>
>>> # Doesn't look like _get_values() is to blame.
>>> pd.core.nanops._get_values(df['db'], True)[0].mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/_methods.py", line 172, in _mean
    ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'str' and 'int'
>>>
>>> # The sum doesn't appear to be it, either.
>>> values = pd.core.nanops._get_values(df['db'], True)[0]
>>> dtype_sum = pd.core.nanops._get_values(df['db'], True)[2]
>>> type(values.sum(None, dtype=dtype_sum))
<class 'str'>
>>>
>>> # The conversion happens in pd.core.nanops._ensure_numeric().
>>> pd.core.nanops._ensure_numeric(values.sum(None, dtype=dtype_sum))
1j
>>>
>>> # Interestingly enough, it looks like a sum somewhere is concatenating
>>> # the two 'J's together.
>>> pd.Series(['J', 'J']).astype('complex128').mean()
1j
>>> pd.Series(['J', 'J']).mean()
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1427, in _ensure_numeric
    x = float(x)
ValueError: could not convert string to float: 'JJ'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1431, in _ensure_numeric
    x = complex(x)
ValueError: complex() arg is a malformed string

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/generic.py", line 11459, in stat_func
    return self._reduce(
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/series.py", line 4236, in _reduce
    return op(delegate, skipna=skipna, **kwds)
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 71, in _f
    return f(*args, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 129, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 563, in nanmean
    the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1434, in _ensure_numeric
    raise TypeError(f"Could not convert {x} to numeric") from err
TypeError: Could not convert JJ to numeric
>>> pd.Series(['J']).mean()
1j
>>>
>>> # Here's an example. Then, that sum can't be cast.
>>> pd.core.nanops._get_values(df4['db'], True)[0].sum()
'JJ'

Problem description

It seems like df.mean() is being too aggressive about attempting to work with string columns in a one-row dataframe. In my use case, if I get a query result with one row, I get this bug. I have lots of these queries, and even one of them being affected will flow through to the concatenated dataframe downstream.

I first noticed this by getting an error that I couldn't write a parquet file with a complex128 column, and tracked it back from there.

Expected Output

Expected output is what we get with numeric_only=True set or more than one row. See above.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : f2ca0a2665b2d169c97de87b8e778dbed86aea07 python : 3.8.5.final.0 python-bits : 64 OS : Darwin OS-release : 19.6.0 Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.1.1 numpy : 1.19.1 pytz : 2020.1 dateutil : 2.8.1 pip : 20.2.2 setuptools : 49.6.0.post20200814 Cython : None pytest : 6.0.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.5.2 html5lib : None pymysql : None psycopg2 : 2.8.6 (dt dec pq3 ext lo64) jinja2 : 2.11.2 IPython : 7.18.1 pandas_datareader: None bs4 : 4.9.1 bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.3.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.15.1 pytables : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 1.3.19 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

Comment From: rhshadrach

Thanks for the report, PRs to fix are welcome!

Comment From: jtkiley

I just updated the code examples with more exploration of where this actually happens. It's in pd.core.nanops._ensure_numeric().

https://github.com/pandas-dev/pandas/blob/2a7d3326dee660824a8433ffd01065f8ac37f7d6/pandas/core/nanops.py#L1409-L1435

If something passed in is an object dtype, it tries to convert it to complex128. That works with 'J', but not with 'JJ' (i.e. what that sum method returns from the two row example (not sure if that's expected, either).

It's not exactly clear to me where the best place to fix it would be. The logic for some of the other functions works differently (often making sure that something can be cast to a float). The standard library complex() function does the same conversion to complex with a string 'J', though that's an explicit request.

I think the problem is that this behavior is implicit, and I wonder if using complex numbers is a common enough use case to get implicit casting of strings without a warning (cf. the one for datetime means). The use here along with sum means that it's often going to get a concatenated string that won't convert, so it'll be inconsistent and less likely to cast as rows increase.

Thoughts on where/how to fix it?

Comment From: biddwan09

take

Comment From: jtkiley

@biddwan09 I poked around a bit more, and here are a couple of thoughts:

I think the best fix may be to fix the issue where sum is concatenating strings. That seems like unexpected behavior. Instead, I'd expect a reduction on a dataframe to not return a column for that string column. I couldn't quite track down how the sum method is implemented to experiment more, but the behavior I describe above doesn't happen with other functions, so this may be the most limited change available.
An alternative would be to condition that complex cast to something that is less ambiguous. For example, you might require that string to have a digit and a j, or, more restrictively, something like \d+(?:\.\d+)?[+-]\d+(?:\.\d+)?[jJ]. That moves it out of alignment with the standard library, but this is both implicit and pretty deep in the weeds, so I think there's a fair rationale for doing so.

I hope that helps!

Comment From: biddwan09

@jtkiley thanks for the approaches. I will try to understand how the sum method works and put a fix from there

Comment From: mgabs

I had a similar issue running M1 on Macos 12.4

File ".venv/lib/python3.9/site-packages/pandas/core/nanops.py", line 1629, in _ensure_numeric
    raise TypeError(f"Could not convert {x} to numeric") from err

The issue was resolved when I installed snowflake-connector[pandas] that pulls pyarrow 6.0. I don't claim to be sure of the workaround, or why it was solved.

I'd like feedback if someone seeing the same issue managed to workaround it the same way.

Pandas BUG: String upgraded to complex128 (and cascading to other columns) in df.mean() and df.agg('mean')

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`