Code Sample, a copy-pastable example if possible
from pandas.compat import StringIO
import pandas as pd
t1 = """float
1
"""
t2 = """float
NaN
"""
for t in t1, t2:
df = pd.read_csv(StringIO(t), dtype={'float': 'str'})
print(type(df['float'][0]))
Problem description
Even when explicitly specifying dtype above, read_csv still converts values in the float column to a float when the string is "NaN". This behavior appears to be limited to "NaN" as it doesn't happen for regular numbers. Still, I unexpectedly ran across a "NaN" string in my application so it's blocking me.
Comment From: gfyoung
So at the surface, pandas is actually respecting your wishes, as the dtype of the column is object. However, it just so happens that the element contained in that column is not converted to a str. That being said, I think it should convert to a string (the NaN itself). PR to patch is welcome!
Comment From: giba0
@gfyoung A newbie here! I would like to help with this problem! Any tips on where to start?
Comment From: gfyoung
@gilbertoolimpio : Thanks for volunteering! So read_csv is a somewhat tricky / convoluted function, but it's do-able to understand once you've worked with for long enough.
Head over to parsers.pyx. That's where the bulk of the processing for the C engine is done. Look at the _read_rows method. There, you should be able to trace when we have finished reading the data and when we begin to transform data types of columns.
Comment From: giba0
Ok @gfyoung! Thanks!
Comment From: giba0
@jamesqo and @gfyoung the expected result would be
Comment From: gfyoung
Yes, that's correct!
Comment From: giba0
For now you may want to use the solution below while not finding a solution.
df = pd.read_csv(StringIO(t), dtype={'float': 'str'}, engine='python')
@gfyoung
I'm having trouble debugging the files pyx could you give me a hint?
Comment From: gfyoung
Debugging the files pulls? What do you mean by that?
Comment From: giba0
@gfyoung I'm sorry, I meant pyx, my spell-checker, made me fail!
Comment From: gfyoung
@gilbertoolimpio : Hahaha, got it. Debugging pyx files is kind of annoying, so I generally just add a ton of print statements wherever I can.
Comment From: jreback
this is a duplicate of #15669 which needs discussion.
Comment From: jreback
since NaN are by definition the missing value marker this is correct. you can comment on #15669 if you would like.
Comment From: jamesqo
@jreback The value isn't missing, it's the string "NaN".
Comment From: gfyoung
If you read the first part of the issue in #15669, I can see why it's considered a duplicate.