I read in my dataframe with
pd.read_csv('df.csv')
And then I run the code:
df['a'] = pd.to_numeric(df['a'], errors='coerce')
but the column does not get converted. When I use errors = 'raise' it gives me the numbers that are not convertible but it should be dropping them with coerce.... This was working perfectly in Pandas 0.19 and i Updated to 0.20.3. Did the way to_numeric works change between the two versions?
Comment From: jreback
this would need a reproducible example
Comment From: mficek
I think it's duplicate with https://github.com/pandas-dev/pandas/issues/17007
Comment From: gfyoung
@mficek : Potentially, but can't confirm yet
Comment From: jorisvandenbossche
It's certainly not an exact duplicate, as the example shown in #17007 also does not work correctly in 0.19, while here it is mentioned it worked in 0.19
Comment From: Sarickshah
import pandas as pd
df = pd.DataFrame({'number':["00000_81234523499", "81654839"], 'date':['2017-07-28', '2017-07-29']})
pd.to_numeric(df.number, errors='coerce')
And the numbers stay as strings
Comment From: gfyoung
@Sarickshah : Thanks for this! Could you do us a favor and move your example to your issue description above? Also, if you could provide the output that you're seeing as well as the expected output, that would be great for us as well.
Comment From: jorisvandenbossche
On 0.19 the first one is coerces (which seems expected, since it raises an error on parsing, but errors='coerce'
):
In [7]: pd.to_numeric(df.number, errors='coerce')
Out[7]:
0 NaN
1 81654839.0
Name: number, dtype: float64
Comment From: jorisvandenbossche
Actually, this seems to work as well on 0.20.3:
In [1]: df = pd.DataFrame({'number':["00000_81234523499", "81654839"], 'date':['2017-07-28', '2017-07-29']})
...: pd.to_numeric(df.number, errors='coerce')
Out[1]:
0 NaN
1 81654839.0
Name: number, dtype: float64
In [2]: pd.__version__
Out[2]: '0.20.3'
@Sarickshah Can you show the exact output of what you get?
Comment From: jreback
looks fixed in 0.20.3.
Comment From: blakebjorn
Doesn't seem to be fixed, could be something to do with the python binaries if it isn't reproducible? (Windows 7 x64 here)
import pandas as pd
df = pd.DataFrame([{"UPC":"12345678901234567890"},{"UPC":"1234567890"},{"UPC":"ITEM"}])
print(pd.to_numeric(df['UPC'],errors='coerce'))
print(pd.__version__)
0 12345678901234567890
1 1234567890
2 ITEM
Name: UPC, dtype: object
0.20.3
I think it has something to do with the long (>20 character) number strings. This is taken from a sheet of ~6 million digits. If i do something like:
def fix_number(e):
try:
return float(e)
except:
return np.nan
df['UPC'] = df['UPC'].apply(fix_number)
I get 5.2 million duplicate values - it seems like the function works until it encounters a problematic value .8 million rows in and then assigns the last valid retval to the remaining 5.2 million rows
Edit - This works:
print(pd.to_numeric(df2['UPC'].apply(lambda x: x[:19] if len(x)>19 else x),errors='coerce'))
but this doesn't:
print(pd.to_numeric(df2['UPC'].apply(lambda x: x[:20] if len(x)>20 else x),errors='coerce'))
So it looks like any string with a character count >= 20 will break the to_numeric function
Comment From: jorisvandenbossche
Indeed, that example is not working correctly (both on master as in 0.20.3). The other example is working though, so the difference indeed seems to be the large number.
So it seems that when the value would be converted to uint64 (instead of int64), the errors='coerce'
is not working.
Comment From: jorisvandenbossche
In [89]: s = pd.Series(["12345678901234567890", "1234567890", "ITEM"])
In [90]: pd.to_numeric(s, errors='coerce')
Out[90]:
0 12345678901234567890
1 1234567890
2 ITEM
dtype: object
In [91]: s = pd.Series(["12345678901234567890", "1234567890"])
In [92]: pd.to_numeric(s, errors='coerce')
Out[92]:
0 12345678901234567890
1 1234567890
dtype: uint64
So you can see that the parsing of the big value (> 20 chars) itself is working, as the return value is uint64. When a NaN has to be introduced, it should just be converted to float64 as it happens with int64.
Comment From: blakebjorn
I think the biggest point of confusion is that there is no exception raised when errors="coerce" and it fails to coerce anything. As this is more of a limitation of the underlying numpy dtypes I don't think there is a real fix here.
Something simple like this would solve the point of confusion, and users would have the ability to figure out how to best handle it from there on out, whether it being to drop large numbers from the dataframe or leaving them as objects and manually pruning errors.
I don't think coercing uint64 to float64 is the best way to handle it, and I would go as far as to suggest there should be a warning for int64 -> float64 conversion, because anything above 2**53 will create unforeseen problems for people unaware of the float64 limitations, for example:
print("%f" % np.float64(9007199254740992))
print("%f" % np.float64(9007199254740993))
>>>9007199254740992.000000
>>>9007199254740992.000000