Code Sample, a copy-pastable example if possible

In [109]:
data = np.array([[ 'c1', 'c2'],
                [ '', 0.285],
                [ 10.1, 0.285]], dtype=object)

In [110]:
pd.DataFrame(data).to_csv('test.csv',header=False,index=False)

In [111]:
!cat test.csv
Out[111]:
c1,c2
,0.285
10.1,0.285

In [113]:
pd.read_csv('test.csv',converters={'c1':str},engine='c').values
Out[113]:
array([['', 0.285],
       ['10.1', 0.285]], dtype=object)

In [114]:
pd.read_csv('test.csv',converters={'c1':str},engine='python').values
Out[114]:
array([[nan, 0.285],
       ['10.1', 0.285]], dtype=object)

Expected Output

Notice that the output for the python engine and the c engine are different. I am not sure which one is preferable/expected.

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 13.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 21.0.0
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.39.0

Details

The culpruit seems to be this function: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1299

Before that call, results and values both contain the '' string in the python engine version of the parser. That call changes the ''' value in values to nan and thus changes the value of results as well. Specifically, the change for the python engine happens here: https://github.com/pydata/pandas/blob/0c6226cbbc319ec22cf4c957bdcc055eaa7aea99/pandas/src/inference.pyx#L1009

if (convert_empty and val == '') or (val in na_values)

It seems that na_values contains '' by default for certain parsing calls. This means that when val == '' triggers the nan conversion whether or not convert_empty is true.

I haven't tracked down the change in the c engine.

Comment From: jreback

I guess. you are doing something really really odd here by converting an explicitly float column. To be honest the converter argument is not very idiomatic and non-performant. These types of conversions (if you really wanted to do it) are much better performed after reading/parsing.

cc @gfyoung

Comment From: gte620v

Fair enough. I don't think it is pressing bug, but probably a bug nonetheless.... I came across it when I was trying to add converters to read_html.

Comment From: gfyoung

This is definitely a bug in the Python engine (C engine looks fine AFAICT) and can be easily fixed as @gte620v pointed out. PR should be on the way unless @gte620v you've already started.

Comment From: gte620v

No I haven't. I'm not exactly sure what the best fix would be. @gfyoung, please go ahead.

Comment From: gfyoung

For future reference, here is a (more) minimal example to reproduce this:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = 'a,b\n,1'
>>> read_csv(StringIO(data), converters={0: str}, engine='c').values  # correct
array([['', 1]], dtype=object)
>>> read_csv(StringIO(data), converters={0: str}, engine='python').values  # incorrect
array([[nan, 1]], dtype=object)

Comment From: gfyoung

Sigh...I thought it was simple. Then I thought about it again (and saw tests fail), and I see now that it is another manifestation of the major discrepancy between the Python and the C engines.

I thought the Python engine was bugged, but now I realise that it isn't. The reason the logic is written that way is because we want to fully control which values are NaN when we pass in na_values, which is why we pass in convert_empty=False.

So why the C engine does not convert to NaN is because of this issue with converters here. Notice that no NaN processing is done if you have a converter.

Arguably, this is a dupe of my issue here. @jreback, you can be the judge whether this one here can be closed in light of this other issue.

Comment From: jreback

@gfyoung yes I agree, #13302 is a generalization of this. I dont' know why that continue is there in the c-engine, but I suspect it will be non-trivial to remove :>