Duplicate of #13302


When reading via read_fwf and passing both na_values and converters at the same time, na_value detection does not work:

Code Sample, a copy-pastable example if possible

Assume a fwf-style text file like this:

A B C 10 -1 10 10 10 10

In: read_fwf(file_path, converters={'B':lambda x: int(x)/10.}, na_values={'B':-1})
Out:
>     A    B   C
> 0  10 -0.1  10
> 1  10  1.0  10

Expected Output

>     A    B   C
> 0  10  NaN  10
> 1  10  1.0  10

It would be nice if this could be handled behind the scenes. In pandas.io.parsers._clean_na_values, float representations of the na_values are produced by _floatify_na_values. Maybe passing the converters to this function might be a solution?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.10.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 23 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.19.2 (...)

Comment From: jreback

cc @gfyoung

Comment From: gfyoung

So this is in fact a point of discussion for the API (and not a bug per se):

When should na_values be applied?

Currently, as evidenced by the code (and personal testing) they are applied after the converters are applied. This is why there are no NaN values because post-conversion, there are no more NaN values technically.

I don't see the harm in choosing one or the other, but we just need agreement and consistency (we have the latter currently). Regardless of how this issue goes, documentation updates will be needed to explain when na_values are applied and to give an example in io.rst

Comment From: jorisvandenbossche

I we choose, I would find it more logical that na_values is applied first.

Comment From: gfyoung

@jorisvandenbossche : Very well. So we have two people (@gitporst and @jorisvandenbossche) in favor of applying the na_values before the converters are applied. Is anyone willing to play devil's advocate for applying after the converters? Otherwise, I think we'll go with trying to apply before (I say trying because who knows what tests will break).

Comment From: gfyoung

@jorisvandenbossche : Argh...I forgot. We have this ugly bug that I reported in #13302. We might need to put a hold on resolving this and save it for a longer-term revamp of read_csv in 1.0. What do you think?

Comment From: jreback

@gfyoung as long as we are consistent with the approach I think its fine (IOW apply na_values first).

to be honest, converters is just trying to push more functionaility into read_csv, when you really want to do the opposite. IOW, do conversions after, where the tooling is much more idiomatic, rather than trying to shove everything into passed conversion functions.

not sure what this means

save it for a longer-term revamp of read_csv in 1.0.

what exactly are you referring? (do you have some plans for this)?

Comment From: gfyoung

@jreback : I was just saying that patching converters is not easy in the short-term because of the bug I pointed and the inconsistency I pointed out in the issue.

However, at least we seem to have agreement that converters should come after na_values. For bookkeeping purposes, I would be in favor of closing this and then moving some of this information into my issue (#13302).

Comment From: jorisvandenbossche

Yes, this indeeds seems like a duplicate of #13302. @gfyoung can you move the necessary information? (just the agreement that na_values should be applied before converters I think?)

I am also not sure why solving this or the bug in #13302 would need a 1.0. Do you mean that the logical way to solve it would have a lot of other consequences?

Comment From: gfyoung

@jorisvandenbossche : The consequences of changing it would be too big I think. I gave it a try before a couple of months ago, and things just broke. Though if someone wants to prove me wrong, be my guest! :smile: