Pandas Column is not converting to numeric when errors=coerce

I read in my dataframe with

 pd.read_csv('df.csv')

And then I run the code:

df['a'] = pd.to_numeric(df['a'], errors='coerce')

but the column does not get converted. When I use errors = 'raise' it gives me the numbers that are not convertible but it should be dropping them with coerce.... This was working perfectly in Pandas 0.19 and i Updated to 0.20.3. Did the way to_numeric works change between the two versions?

Comment From: jreback

this would need a reproducible example

Comment From: mficek

I think it's duplicate with https://github.com/pandas-dev/pandas/issues/17007

Comment From: gfyoung

@mficek : Potentially, but can't confirm yet

Comment From: jorisvandenbossche

It's certainly not an exact duplicate, as the example shown in #17007 also does not work correctly in 0.19, while here it is mentioned it worked in 0.19

Comment From: Sarickshah

import pandas as pd

df = pd.DataFrame({'number':["00000_81234523499", "81654839"], 'date':['2017-07-28', '2017-07-29']})

pd.to_numeric(df.number, errors='coerce')

And the numbers stay as strings

Comment From: gfyoung

@Sarickshah : Thanks for this! Could you do us a favor and move your example to your issue description above? Also, if you could provide the output that you're seeing as well as the expected output, that would be great for us as well.

Comment From: jorisvandenbossche

On 0.19 the first one is coerces (which seems expected, since it raises an error on parsing, but errors='coerce'):

In [7]: pd.to_numeric(df.number, errors='coerce')
Out[7]: 
0           NaN
1    81654839.0
Name: number, dtype: float64

Comment From: jorisvandenbossche

Actually, this seems to work as well on 0.20.3:

In [1]: df = pd.DataFrame({'number':["00000_81234523499", "81654839"], 'date':['2017-07-28', '2017-07-29']})
   ...: pd.to_numeric(df.number, errors='coerce')
Out[1]: 
0           NaN
1    81654839.0
Name: number, dtype: float64

In [2]: pd.__version__
Out[2]: '0.20.3'

@Sarickshah Can you show the exact output of what you get?

Comment From: jreback

looks fixed in 0.20.3.

Comment From: blakebjorn

Doesn't seem to be fixed, could be something to do with the python binaries if it isn't reproducible? (Windows 7 x64 here)

import pandas as pd
df = pd.DataFrame([{"UPC":"12345678901234567890"},{"UPC":"1234567890"},{"UPC":"ITEM"}])
print(pd.to_numeric(df['UPC'],errors='coerce'))
print(pd.__version__)

0    12345678901234567890
1              1234567890
2                    ITEM
Name: UPC, dtype: object
0.20.3

I think it has something to do with the long (>20 character) number strings. This is taken from a sheet of ~6 million digits. If i do something like:

def fix_number(e):
    try:
        return float(e)
    except:
        return np.nan
df['UPC'] = df['UPC'].apply(fix_number)

I get 5.2 million duplicate values - it seems like the function works until it encounters a problematic value .8 million rows in and then assigns the last valid retval to the remaining 5.2 million rows

Edit - This works:

print(pd.to_numeric(df2['UPC'].apply(lambda x: x[:19] if len(x)>19 else x),errors='coerce'))

but this doesn't:

print(pd.to_numeric(df2['UPC'].apply(lambda x: x[:20] if len(x)>20 else x),errors='coerce'))

So it looks like any string with a character count >= 20 will break the to_numeric function

Comment From: jorisvandenbossche

Indeed, that example is not working correctly (both on master as in 0.20.3). The other example is working though, so the difference indeed seems to be the large number.

So it seems that when the value would be converted to uint64 (instead of int64), the errors='coerce' is not working.

Comment From: jorisvandenbossche

In [89]: s = pd.Series(["12345678901234567890", "1234567890", "ITEM"])

In [90]: pd.to_numeric(s, errors='coerce')
Out[90]: 
0    12345678901234567890
1              1234567890
2                    ITEM
dtype: object

In [91]: s = pd.Series(["12345678901234567890", "1234567890"])

In [92]: pd.to_numeric(s, errors='coerce')
Out[92]: 
0    12345678901234567890
1              1234567890
dtype: uint64

So you can see that the parsing of the big value (> 20 chars) itself is working, as the return value is uint64. When a NaN has to be introduced, it should just be converted to float64 as it happens with int64.

Comment From: blakebjorn

I think the biggest point of confusion is that there is no exception raised when errors="coerce" and it fails to coerce anything. As this is more of a limitation of the underlying numpy dtypes I don't think there is a real fix here.

Something simple like this would solve the point of confusion, and users would have the ability to figure out how to best handle it from there on out, whether it being to drop large numbers from the dataframe or leaving them as objects and manually pruning errors.

I don't think coercing uint64 to float64 is the best way to handle it, and I would go as far as to suggest there should be a warning for int64 -> float64 conversion, because anything above 2**53 will create unforeseen problems for people unaware of the float64 limitations, for example:

print("%f" % np.float64(9007199254740992))
print("%f" % np.float64(9007199254740993))
>>>9007199254740992.000000
>>>9007199254740992.000000