Code Sample, a copy-pastable example if possible
In [109]:
data = np.array([[ 'c1', 'c2'],
[ '', 0.285],
[ 10.1, 0.285]], dtype=object)
In [110]:
pd.DataFrame(data).to_csv('test.csv',header=False,index=False)
In [111]:
!cat test.csv
Out[111]:
c1,c2
,0.285
10.1,0.285
In [113]:
pd.read_csv('test.csv',converters={'c1':str},engine='c').values
Out[113]:
array([['', 0.285],
['10.1', 0.285]], dtype=object)
In [114]:
pd.read_csv('test.csv',converters={'c1':str},engine='python').values
Out[114]:
array([[nan, 0.285],
['10.1', 0.285]], dtype=object)
Expected Output
Notice that the output for the python engine and the c engine are different. I am not sure which one is preferable/expected.
output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 13.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 21.0.0
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.39.0
Details
The culpruit seems to be this function: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1299
Before that call, results
and values
both contain the ''
string in the python engine version of the parser. That call changes the ''
' value in values
to nan
and thus changes the value of results
as well. Specifically, the change for the python engine happens here: https://github.com/pydata/pandas/blob/0c6226cbbc319ec22cf4c957bdcc055eaa7aea99/pandas/src/inference.pyx#L1009
if (convert_empty and val == '') or (val in na_values)
It seems that na_values
contains ''
by default for certain parsing calls. This means that when val == ''
triggers the nan
conversion whether or not convert_empty
is true.
I haven't tracked down the change in the c
engine.
Comment From: jreback
I guess. you are doing something really really odd here by converting an explicitly float column. To be honest the converter
argument is not very idiomatic and non-performant. These types of conversions (if you really wanted to do it) are much better performed after reading/parsing.
cc @gfyoung
Comment From: gte620v
Fair enough. I don't think it is pressing bug, but probably a bug nonetheless.... I came across it when I was trying to add converters to read_html
.
Comment From: gfyoung
This is definitely a bug in the Python engine (C engine looks fine AFAICT) and can be easily fixed as @gte620v pointed out. PR should be on the way unless @gte620v you've already started.
Comment From: gte620v
No I haven't. I'm not exactly sure what the best fix would be. @gfyoung, please go ahead.
Comment From: gfyoung
For future reference, here is a (more) minimal example to reproduce this:
>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = 'a,b\n,1'
>>> read_csv(StringIO(data), converters={0: str}, engine='c').values # correct
array([['', 1]], dtype=object)
>>> read_csv(StringIO(data), converters={0: str}, engine='python').values # incorrect
array([[nan, 1]], dtype=object)
Comment From: gfyoung
Sigh...I thought it was simple. Then I thought about it again (and saw tests fail), and I see now that it is another manifestation of the major discrepancy between the Python and the C engines.
I thought the Python engine was bugged, but now I realise that it isn't. The reason the logic is written that way is because we want to fully control which values are NaN
when we pass in na_values
, which is why we pass in convert_empty=False
.
So why the C engine does not convert to NaN
is because of this issue with converters here. Notice that no NaN
processing is done if you have a converter.
Arguably, this is a dupe of my issue here. @jreback, you can be the judge whether this one here can be closed in light of this other issue.
Comment From: jreback
@gfyoung yes I agree, #13302 is a generalization of this. I dont' know why that continue is there in the c-engine, but I suspect it will be non-trivial to remove :>