-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
import pandas as pd
ser = pd.Series(['1', '2'])
ser.sum() # this returns 12 instead of 3
ser.mean() # this returns 6.0 instead of 1.5
Issue Description
The sum() and mean() methods of the Series and DataFrame return wrong values when applied to integers as strings. This is silently catastrophic since the user is left with apparently good but erroneous results.
Expected Behavior
>>> import pandas as pd
>>> ser = pd.Series(['1', '2'])
>>> ser.sum()
TypeError: ...
OR
>>> ser.sum()
3
Installed Versions
Comment From: CloseChoice
why is this unexpected? This is exactly the standard behaviour in Python:
'1' + '2'
Out[2]: '12'
Note that the output of ser.sum()
is '12'
not 12
.
When you apply the sum to the series we first do a sum, then divide it by the number of summed elements. And somewhere in the middle happens a cast from string to numbers. That might be potentially unexpected behaviour.
We might need to adapt the documentation since from it states here:
This is equivalent to the method ``numpy.sum``.
Which is not correct, see:
np.sum(['1', '2'])
Traceback (most recent call last):
...
TypeError: cannot perform reduce with flexible type
Edit: Do want to change the behaviour of Series.mean()
and Series.sum()
to throw an error in this case? I wouldn't vote for that but the behaviour of Series.mean()
is somewhat strange since it casts implicitly while Series.sum()
doesn't do that.
Comment From: phofl
sum
is as expected, since it behaves the same for strings like "a" and "b" but mean
is off. Concatenating the strings and then casting to int and dividing by the number of ocurrences is definitiely not what I would have expected since using characters raises
Comment From: ogahide
Thank you for your comments.
The built-in function sum()
and numpy.sum()
return an error when arguments are strings, and so I expected the same on Pandas. It's OK, but it just looks counterintuitive to me.
>>>sum(['1', '2'])
TypeError: unsupported operand type(s) for +: 'int' and 'str'
>>>import numpy as np
>>>arr = np.array(['1', '2'])
>>>np.sum(arr) # arr.sum() returns the same error
TypeError: cannot perform reduce with flexible type
Another problem is that Pandas displays an integer and its string exactly the same (you cannot tell whether 123 is an integer or a string from its appearance) and potentially blinds the user to the pitfall.
Comment From: shubham11941140
To solve the issue, do I change the behaviour of pandas to numpy?
Or, do I change it to sum of string resulting in concatenation and mean of strings giving an error because dividing strings has no meaning?
Comment From: asishm
sum
is as expected, since it behaves the same for strings like "a" and "b" butmean
is off.
I would say sum
should not be expected either.
str.cat
exists for precisely this- the equivalent
np.array(['a', 'b']).sum()
fails