I encountered a potentially incorrect behavior of pandas replace with strings and integers. If the dataframe has both 0 (integer) and '0' (strings) then replace '0' affects both strings and integers. Here's how it goes:
In [1]: df = pd.DataFrame({'numbers' : [0, 1, 2, 0], 'strings' : ['0', 1, 2, '0']})
To check that it's indeed the correct setup:
In [2]: df.dtypes
Out [2]:
numbers int64
strings object
dtype: object
And check individual values:
In [3]: type(df['numbers'][0])
Out[3]: numpy.int64
In [4]: type(df['strings'][0])
Out[4]: str
Now, do replace:
In [5]: df.replace(to_replace='0', value=np.NaN, inplace=True)
In [6]: df.head()
Out[6]:
numbers strings
0 NaN NaN
1 1 1
2 2 2
3 NaN NaN
As you can see, it replaced both strings and integers, however should have worked only on the strings. If we try same on integers, it works correctly:
In [7]: df = pd.DataFrame({'numbers' : [0, 1, 2, 0], 'strings' : ['0', 1, 2, '0']})
...: df.replace(to_replace=0, value=np.NaN, inplace=True)
...: print df.head()
Out [7]:
numbers strings
0 NaN 0
1 1 1
2 2 2
3 NaN 0
Output of pd.show_versions()
Comment From: jorisvandenbossche
@ozhogin Thanks for the report. That indeed looks like a bug (I think replace
is not that much tested for non-string values, the docs also mainly speak about strings)
Always welcome to look into it!
Comment From: jorisvandenbossche
Related issue: https://github.com/pandas-dev/pandas/issues/12747. I am going to close this issue, and add it as an additional case in #12747