I'm recoding multiple columns in a dataframe and have come across a strange result that I can't quite figure out. I'm probably not recoding in the most efficient manner possible, but it's mostly the error that I'm hoping someone can explain.
s1 = pd.DataFrame([np.nan, '1', '2', '3', '4', '5', columns='col1')
s2 = pd.DataFrame([np.nan, 1, 2, 3, 4, 5], columns='col1')
s1_dic = {np.nan: np.nan, '1': 1, '2':2, '3':3, '4':3, '5':3}
s2_dic = {np.nan: np.nan, 1: 1, 2:2, 3:3, 4:3, 5:3}
s1['col1'].apply(lambda x: s1_dic[x])
s2['col1'].apply(lambda x: s2_dic[x])
s1 works fine, but when I try to do the same thing with a list of integers and a np.nan, I get KeyError: nan which is confusing. Any help would be appreciated.
edit @hayd: http://stackoverflow.com/q/33286748/1240268
Comment From: max-sixty
This may be to do with NaN not being equal to itself. But it works with s1
, and you can still use it as a python dictionary key (I'm not sure how that works actually?), so it must be that the path taken depends on the dtype
.
In [164]: pd.np.nan==pd.np.nan
Out[164]: False
In [170]: s1_dic[pd.np.nan]
Out[170]: nan
In [171]: s2_dic[pd.np.nan]
Out[171]: nan
This is a separate point from the above, but there may be a much easier way of doing what you're trying to do. What's the wider goal?
FYI this is a corrected version of your code:
s1 = pd.DataFrame([np.nan, '1', '2', '3', '4', '5'], columns=['col1'])
s2 = pd.DataFrame([np.nan, 1, 2, 3, 4, 5], columns=['col1'])
s1_dic = {np.nan: np.nan, '1': 1, '2':2, '3':3, '4':3, '5':3}
s2_dic = {np.nan: np.nan, 1: 1, 2:2, 3:3, 4:3, 5:3}
s1['col1'].apply(lambda x: s1_dic[x])
s2['col1'].apply(lambda x: s2_dic[x])
And not sure what the principles that the maintainers use, but IMHO this sort of question may be better on SO.
Comment From: kawochen
I would caution against what you are doing, since even if that worked, something like this could happen.
In [1]: {1:1, 2:2, 3:3}[1.0000000000000001]
Out[1]: 1
@maximilianr in a simple dict
, it is in general OK that nan
doesn't equal itself, because the is
operator has precedence.
Comment From: max-sixty
@kawochen Ah great, thanks
Comment From: hayd
I suggested @brianhuey post here. I think this is a bug/edge case in map_infer
actually, that seems to be where the error comes from. It works with object dtype:
In [11]: d1 = {1: 1, np.nan: np.nan}
In [12]: pd.lib.map_infer(np.array([np.nan, 1], dtype='object'), lambda x: d1[x], True)
Out[12]: array([ nan, 1.])
In [13]: pd.lib.map_infer(np.array([np.nan, 1]), lambda x: d1[x], True)---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-13-d8d2e1f5f73a> in <module>()
----> 1 pd.lib.map_infer(np.array([np.nan, 1]), lambda x: d1[x], True)
pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:58435)()
<ipython-input-154-d8d2e1f5f73a> in <lambda>(x)
----> 1 pd.lib.map_infer(np.array([np.nan, 1]), lambda x: d1[x], True)
KeyError: nan
Comment From: kawochen
@hayd then you have the answer right there already! np.float64(np.nan)
is not a singleton, doesn't equal np.float64(np.nan)
, is not np.nan
and doesn't equal np.nan
; in an object array we have the np.nan
that is of type float
in it
Comment From: ron819
is this still an issue?
Comment From: rhshadrach
This is an issue with using np.nan
as a key in a dictionary, which is unreliable at best because it does not equal to itself, and you often get a view of np.nan
so an is
check returns False as well. Closing.