-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas (1.1.1; latest from conda).
-
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import numpy as np
import pandas as pd
df = pd.DataFrame([{'db': 'J', 'numeric': 123}])
df2 = pd.DataFrame([{'db': 'J'}])
df3 = pd.DataFrame([{'db': 'j'}])
df4 = pd.DataFrame([{'db': 'J', 'numeric': 123},
{'db': 'J', 'numeric': 456}])
df5 = pd.DataFrame([{'db': 'J'}], dtype='string')
# Initial columns are _correct_ types
df.dtypes
# complex128 and across columns
df.mean()
# agg('mean') is the same.
df.agg('mean')
# Happens even when the columns at issue is alone.
df2.mean()
# Case of the 'J' doesn't matter.
df3.mean()
# Numeric only works as expected.
df.mean(numeric_only=True)
# Two rows works as expected, too.
df4.mean()
# The new StringDtype doesn't appear to matter.
np.mean(df['db'].astype('string').array)
type(np.mean(df['db'].astype('string').array))
df['db'].astype('string').dtype
df5.mean()
# Happens in PandasArray.
np.mean(df['db'].array)
type(np.mean(df['db'].array))
# Not in a numpy array.
np.mean(df['db'].to_numpy())
# Also happens in a Series.
df['db'].mean()
type(df['db'])
# The conversion happens in nanops.nanmean().
pd.core.nanops.nanmean(df['db'])
# Doesn't look like _get_values() is to blame.
pd.core.nanops._get_values(df['db'], True)[0].mean()
# The sum doesn't appear to be it, either.
values = pd.core.nanops._get_values(df['db'], True)[0]
dtype_sum = pd.core.nanops._get_values(df['db'], True)[2]
type(values.sum(None, dtype=dtype_sum))
# The conversion happens in pd.core.nanops._ensure_numeric().
pd.core.nanops._ensure_numeric(values.sum(None, dtype=dtype_sum))
# Interestingly enough, it looks like a sum somewhere is concatenating
# the two 'J's together.
pd.Series(['J', 'J']).astype('complex128').mean()
pd.Series(['J', 'J']).mean()
pd.Series(['J']).mean()
# Here's an example. Then, that sum can't be cast.
pd.core.nanops._get_values(df4['db'], True)[0].sum()
With output:
>>> import numpy as np
>>> import pandas as pd
>>>
>>> df = pd.DataFrame([{'db': 'J', 'numeric': 123}])
>>> df2 = pd.DataFrame([{'db': 'J'}])
>>> df3 = pd.DataFrame([{'db': 'j'}])
>>> df4 = pd.DataFrame([{'db': 'J', 'numeric': 123},
... {'db': 'J', 'numeric': 456}])
>>> df5 = pd.DataFrame([{'db': 'J'}], dtype='string')
>>>
>>> # Initial columns are _correct_ types
>>> df.dtypes
db object
numeric int64
dtype: object
>>>
>>> # complex128 and across columns
>>> df.mean()
db 0.000000+1.000000j
numeric 123.000000+0.000000j
dtype: complex128
>>>
>>> # agg('mean') is the same.
>>> df.agg('mean')
db 0.000000+1.000000j
numeric 123.000000+0.000000j
dtype: complex128
>>>
>>> # Happens even when the columns at issue is alone.
>>> df2.mean()
db 0.000000+1.000000j
dtype: complex128
>>>
>>> # Case of the 'J' doesn't matter.
>>> df3.mean()
db 0.000000+1.000000j
dtype: complex128
>>>
>>> # Numeric only works as expected.
>>> df.mean(numeric_only=True)
numeric 123.0
dtype: float64
>>>
>>> # Two rows works as expected, too.
>>> df4.mean()
numeric 289.5
dtype: float64
>>> # The new StringDtype doesn't appear to matter.
>>> np.mean(df['db'].astype('string').array)
1j
>>> type(np.mean(df['db'].astype('string').array))
<class 'complex'>
>>> df['db'].astype('string').dtype
StringDtype
>>> df5.mean()
Series([], dtype: float64)
>>>
>>> # Happens in PandasArray.
>>> np.mean(df['db'].array)
1j
>>> type(np.mean(df['db'].array))
<class 'complex'>
>>>
>>> # Not in a numpy array.
>>> np.mean(df['db'].to_numpy())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<__array_function__ internals>", line 5, in mean
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 3372, in mean
return _methods._mean(a, axis=axis, dtype=dtype,
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/_methods.py", line 172, in _mean
ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'str' and 'int'
>>>
>>> # Also happens in a Series.
>>> df['db'].mean()
1j
>>> type(df['db'])
<class 'pandas.core.series.Series'>
>>>
>>> # The conversion happens in nanops.nanmean().
>>> pd.core.nanops.nanmean(df['db'])
1j
>>>
>>> # Doesn't look like _get_values() is to blame.
>>> pd.core.nanops._get_values(df['db'], True)[0].mean()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/_methods.py", line 172, in _mean
ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'str' and 'int'
>>>
>>> # The sum doesn't appear to be it, either.
>>> values = pd.core.nanops._get_values(df['db'], True)[0]
>>> dtype_sum = pd.core.nanops._get_values(df['db'], True)[2]
>>> type(values.sum(None, dtype=dtype_sum))
<class 'str'>
>>>
>>> # The conversion happens in pd.core.nanops._ensure_numeric().
>>> pd.core.nanops._ensure_numeric(values.sum(None, dtype=dtype_sum))
1j
>>>
>>> # Interestingly enough, it looks like a sum somewhere is concatenating
>>> # the two 'J's together.
>>> pd.Series(['J', 'J']).astype('complex128').mean()
1j
>>> pd.Series(['J', 'J']).mean()
Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1427, in _ensure_numeric
x = float(x)
ValueError: could not convert string to float: 'JJ'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1431, in _ensure_numeric
x = complex(x)
ValueError: complex() arg is a malformed string
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/generic.py", line 11459, in stat_func
return self._reduce(
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/series.py", line 4236, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 71, in _f
return f(*args, **kwargs)
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 129, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 563, in nanmean
the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1434, in _ensure_numeric
raise TypeError(f"Could not convert {x} to numeric") from err
TypeError: Could not convert JJ to numeric
>>> pd.Series(['J']).mean()
1j
>>>
>>> # Here's an example. Then, that sum can't be cast.
>>> pd.core.nanops._get_values(df4['db'], True)[0].sum()
'JJ'
Problem description
It seems like df.mean()
is being too aggressive about attempting to work with string columns in a one-row dataframe. In my use case, if I get a query result with one row, I get this bug. I have lots of these queries, and even one of them being affected will flow through to the concatenated dataframe downstream.
I first noticed this by getting an error that I couldn't write a parquet file with a complex128 column, and tracked it back from there.
Expected Output
Expected output is what we get with numeric_only=True
set or more than one row. See above.
Output of pd.show_versions()
Comment From: rhshadrach
Thanks for the report, PRs to fix are welcome!
Comment From: jtkiley
I just updated the code examples with more exploration of where this actually happens. It's in pd.core.nanops._ensure_numeric()
.
https://github.com/pandas-dev/pandas/blob/2a7d3326dee660824a8433ffd01065f8ac37f7d6/pandas/core/nanops.py#L1409-L1435
If something passed in is an object dtype, it tries to convert it to complex128
. That works with 'J'
, but not with 'JJ'
(i.e. what that sum method returns from the two row example (not sure if that's expected, either).
It's not exactly clear to me where the best place to fix it would be. The logic for some of the other functions works differently (often making sure that something can be cast to a float). The standard library complex()
function does the same conversion to complex with a string 'J'
, though that's an explicit request.
I think the problem is that this behavior is implicit, and I wonder if using complex numbers is a common enough use case to get implicit casting of strings without a warning (cf. the one for datetime means). The use here along with sum
means that it's often going to get a concatenated string that won't convert, so it'll be inconsistent and less likely to cast as rows increase.
Thoughts on where/how to fix it?
Comment From: biddwan09
take
Comment From: jtkiley
@biddwan09 I poked around a bit more, and here are a couple of thoughts:
- I think the best fix may be to fix the issue where
sum
is concatenating strings. That seems like unexpected behavior. Instead, I'd expect a reduction on a dataframe to not return a column for that string column. I couldn't quite track down how thesum
method is implemented to experiment more, but the behavior I describe above doesn't happen with other functions, so this may be the most limited change available. - An alternative would be to condition that
complex
cast to something that is less ambiguous. For example, you might require that string to have a digit and aj
, or, more restrictively, something like\d+(?:\.\d+)?[+-]\d+(?:\.\d+)?[jJ]
. That moves it out of alignment with the standard library, but this is both implicit and pretty deep in the weeds, so I think there's a fair rationale for doing so.
I hope that helps!
Comment From: biddwan09
@jtkiley thanks for the approaches. I will try to understand how the sum method works and put a fix from there
Comment From: mgabs
I had a similar issue running M1 on Macos 12.4
File ".venv/lib/python3.9/site-packages/pandas/core/nanops.py", line 1629, in _ensure_numeric
raise TypeError(f"Could not convert {x} to numeric") from err
The issue was resolved when I installed snowflake-connector[pandas]
that pulls pyarrow 6.0
.
I don't claim to be sure of the workaround, or why it was solved.
I'd like feedback if someone seeing the same issue managed to workaround it the same way.