Duplicate of #13302
When reading via read_fwf and passing both na_values and converters at the same time, na_value detection does not work:
Code Sample, a copy-pastable example if possible
Assume a fwf-style text file like this:
A B C 10 -1 10 10 10 10
In: read_fwf(file_path, converters={'B':lambda x: int(x)/10.}, na_values={'B':-1})
Out:
> A B C
> 0 10 -0.1 10
> 1 10 1.0 10
Expected Output
> A B C
> 0 10 NaN 10
> 1 10 1.0 10
It would be nice if this could be handled behind the scenes. In pandas.io.parsers._clean_na_values, float representations of the na_values are produced by _floatify_na_values. Maybe passing the converters to this function might be a solution?
Output of pd.show_versions()
Comment From: jreback
cc @gfyoung
Comment From: gfyoung
So this is in fact a point of discussion for the API (and not a bug per se):
When should na_values
be applied?
Currently, as evidenced by the code (and personal testing) they are applied after the converters are applied. This is why there are no NaN
values because post-conversion, there are no more NaN
values technically.
I don't see the harm in choosing one or the other, but we just need agreement and consistency (we have the latter currently). Regardless of how this issue goes, documentation updates will be needed to explain when na_values
are applied and to give an example in io.rst
Comment From: jorisvandenbossche
I we choose, I would find it more logical that na_values
is applied first.
Comment From: gfyoung
@jorisvandenbossche : Very well. So we have two people (@gitporst and @jorisvandenbossche) in favor of applying the na_values
before the converters are applied. Is anyone willing to play devil's advocate for applying after the converters? Otherwise, I think we'll go with trying to apply before (I say trying because who knows what tests will break).
Comment From: gfyoung
@jorisvandenbossche : Argh...I forgot. We have this ugly bug that I reported in #13302. We might need to put a hold on resolving this and save it for a longer-term revamp of read_csv
in 1.0. What do you think?
Comment From: jreback
@gfyoung as long as we are consistent with the approach I think its fine (IOW apply na_values
first).
to be honest, converters
is just trying to push more functionaility into read_csv
, when you really want to do the opposite. IOW, do conversions after, where the tooling is much more idiomatic, rather than trying to shove everything into passed conversion functions.
not sure what this means
save it for a longer-term revamp of read_csv in 1.0.
what exactly are you referring? (do you have some plans for this)?
Comment From: gfyoung
@jreback : I was just saying that patching converters is not easy in the short-term because of the bug I pointed and the inconsistency I pointed out in the issue.
However, at least we seem to have agreement that converters
should come after na_values
. For bookkeeping purposes, I would be in favor of closing this and then moving some of this information into my issue (#13302).
Comment From: jorisvandenbossche
Yes, this indeeds seems like a duplicate of #13302. @gfyoung can you move the necessary information? (just the agreement that na_values
should be applied before converter
s I think?)
I am also not sure why solving this or the bug in #13302 would need a 1.0. Do you mean that the logical way to solve it would have a lot of other consequences?
Comment From: gfyoung
@jorisvandenbossche : The consequences of changing it would be too big I think. I gave it a try before a couple of months ago, and things just broke. Though if someone wants to prove me wrong, be my guest! :smile: