Code Sample, a copy-pastable example if possible
from pandas.compat import StringIO
import pandas as pd
t1 = """float
1
"""
t2 = """float
NaN
"""
for t in t1, t2:
df = pd.read_csv(StringIO(t), dtype={'float': 'str'})
print(type(df['float'][0]))
Problem description
Even when explicitly specifying dtype
above, read_csv
still converts values in the float
column to a float when the string is "NaN". This behavior appears to be limited to "NaN" as it doesn't happen for regular numbers. Still, I unexpectedly ran across a "NaN" string in my application so it's blocking me.
Comment From: gfyoung
So at the surface, pandas
is actually respecting your wishes, as the dtype
of the column is object
. However, it just so happens that the element contained in that column is not converted to a str
. That being said, I think it should convert to a string (the NaN
itself). PR to patch is welcome!
Comment From: giba0
@gfyoung A newbie here! I would like to help with this problem! Any tips on where to start?
Comment From: gfyoung
@gilbertoolimpio : Thanks for volunteering! So read_csv
is a somewhat tricky / convoluted function, but it's do-able to understand once you've worked with for long enough.
Head over to parsers.pyx
. That's where the bulk of the processing for the C engine is done. Look at the _read_rows
method. There, you should be able to trace when we have finished reading the data and when we begin to transform data types of columns.
Comment From: giba0
Ok @gfyoung! Thanks!
Comment From: giba0
@jamesqo and @gfyoung the expected result would be
Comment From: gfyoung
Yes, that's correct!
Comment From: giba0
For now you may want to use the solution below while not finding a solution.
df = pd.read_csv(StringIO(t), dtype={'float': 'str'}, engine='python')
@gfyoung
I'm having trouble debugging the files pyx
could you give me a hint?
Comment From: gfyoung
Debugging the files pulls? What do you mean by that?
Comment From: giba0
@gfyoung I'm sorry, I meant pyx
, my spell-checker, made me fail!
Comment From: gfyoung
@gilbertoolimpio : Hahaha, got it. Debugging pyx
files is kind of annoying, so I generally just add a ton of print
statements wherever I can.
Comment From: jreback
this is a duplicate of #15669 which needs discussion.
Comment From: jreback
since NaN are by definition the missing value marker this is correct. you can comment on #15669 if you would like.
Comment From: jamesqo
@jreback The value isn't missing, it's the string "NaN"
.
Comment From: gfyoung
If you read the first part of the issue in #15669, I can see why it's considered a duplicate.