I was surprised to learn that default behavior for reading in a CSV like this:
%more foo.txt
x,y
"NA",NA
"foo",foo
is the following:
In [56]: pd.read_csv('foo.txt')
Out[56]:
x y
0 NaN NaN
1 foo foo
I realize this can be changed by using pd.read_csv('foo.txt', keep_default_na=False)
, but this doesn't seem like ideal default behavior for the following reasons.
- NA can be a valid value in numerous cases
- Abbreviation for North America, sodium, lots of other things
- Possible value for any code with two alphabetic characters
- NA might very well be a valid response (distinct from true missing) in survey data even if it literally means something like "not applicable".
- This behavior is likely unexpected in many cases and dangerous for that reason. Here is a recent example from stack overflow. As a counter-example, I checked the behavior of Excel and Stata and both will read in NA as a text value, not as missing.
- It just seems like better practice would be to handle interpretation of NA and NaN after reading in the data as text rather than doing this on the fly. When this behavior is unwanted, it may be very hard to detect after the fact.
Edit to add: At risk of belaboring the point, note that the following round trips with either read_csv
or from_csv
(using all default values) will convert 'NA' to NaN:
df=pd.DataFrame({ 'x':['NA','foo'] })
df.to_csv('test.csv')
df1 = pd.read_csv('test.csv')
df2 = pd.DataFrame.from_csv('test.csv')
Comment From: TomAugspurger
R has the same behavior (at least for "NA"), for whatever that's worth. This is a pretty well engrained part of the API, and I don't think the upside is worth breaking it.
Comment From: johne13
@TomAugspurger Thanks for the response. I did not know R also did this, but changing this would then be a positive way to differentiate from R IMO. And pandas is not yet to version 1. ;-)
@jreback Err, OK, was hoping maybe for a brief explanation at least, or maybe leave this open longer than 24 hrs to allow some community input. Admittedly I'm not sure how disruptive a change like this would be.
I don't want to be overly dramatic, but to me this is a very serious data integrity issue. Although I doubt this commonly causes problems, when it does, it may go undetected. And if detected it might be hard to debug. Maybe some of it comes down to whether CSV is considered an important input method and I think it is even though it's hardly ever the first choice, it's sometimes the necessary choice.
FWIW, I assumed all programs behaved like Excel and Stata and would never knowingly have the default option set to change potentially valid data to missing. I mostly work with economic and survey data (for many years now) and often depend on CSV to move data between different programs. I honestly have never been in a situation where I would want read_csv to set NA or NaN missing at all, much less as a default behavior.
Comment From: TomAugspurger
Even though we aren't at 1.0, backwards compatibility is important to us. We've got a lot of users already. That said, we will break compatibility, but only if the upshot is big enough.
The fact that there's an option to control the behavior acknowledges that this is an ambiguous case (which comes down to CSV being bad way to store and transmit data, but like you said all too common).
Something like https://github.com/tomaugspurger/engarde could potentially help with ensure that you maintain data integrity as your data flows through.
Comment From: jorisvandenbossche
@johne13 I think the desirability of the actual default (interpreting "NA" as a missing value or not) does depend a lot of the actual working/field use case, and so it will never be possible to the 'right' thing for everybody. E.g. for easy transfer of data from R to python using csv, this is a handy default as R's default output uses NA in csv to denote NA's.
But as it indeed can lead to hard to detect problems, there are maybe others ways to partially deal with this problem rather than changing the default. Just throwing two ideas (but not sure I would support these myself): - interprete NA only for numerical columns as NaN by default and not for string/object columns (but I don't know if this is 1) technically possible (don't know at which point the NA interference is done) and 2) desirable as it increases the complexity of the default value)) - output a warning when "NA" occurences are replaced by NaN (so the user at least knows this is done, and is warned when this was not his/her intention), and an option to suppress this message (can be together with suppressing dtypewarnings)
Comment From: johne13
@TomAugspurger I disagree this is really an ambiguous case at all (and haven't heard an argument to the contrary yet, either here or on stack exchange), but I realize the cost-benefit decision of such a change that breaks backwards compatibility is up to you, Jeff, et al.
What about a warning message when data values are changed? Would it be reasonable to open an issue for that?
I also found the documentation of this behavior to be very confusing, but I saw existing issues already posted for that and will comment in those rather than here.
Comment From: jreback
@johne13 as discussed above. this is really a choice that one has to make and it was made a long time ago. 'NA' is typically much much more used to indicate missing values. Now it maybe that in your field that is not so common, but for the vast majority of the user base when reading a csv that is true.
So a choice was made and very difficult to remove even if we wanted. You have the option to turn off the default parsing. I suppose one could have a pd.set_option('csv.na_defaults',......)
that one could easily just change the default. If that is what you are proposing, then would be ok.
Comment From: jreback
there is some work being done on this, mainly to have an option to return the errors/warnings/conversions that the parser is doing as additional data to the user. so you have the option to 'see' what was changed/substituted/etc.
FYI, since is generally not a great medium for describing data (though we all have/use them). You ought to consider other formats such as HDF5
and msgpack
for this.
Comment From: johne13
@jorisvandenbossche Thanks, I strongly agree with both suggestions if they are feasible. If a column contains only numbers and NA then it seems reasonable that NA=missing.
Regarding R, what will happen if I export from R a text column of, say, continents abbreviated as: ['NA', 'EU', 'AF'] and then read that into pandas? I assume that must be read as missing, but don't use R so don't know for sure how R outputs or if R treats NA the same for text and number columns?
@jreback OK, thanks for the explanation. Personally I would be more interested in having pandas give a warning/info message rather than a new option. My concern is with all the people who are using read_csv without understanding default behavior, not all the people who do.
Comment From: jorisvandenbossche
Another change that we would maybe be more likely to consider, is the parsing of quoted values of NA
(so only changing "NA"
not be converted automatically, but leaving NA
and alike converted to NaN as it is now).
Personally, I would even this consider as a bug that it treats "NA"
and NA
the same, as I would expect that quoted values should be left untouched. But I don't know how long this behaviour has been this way.
(to admit, I even thought it was this way, and wanted to answer that to your question on R, but I saw that in your original post above it is clearly an example with quotes)
Comment From: TomAugspurger
Another change that we would maybe be more likely to consider, is the parsing of quoted values of NA (so only changing "NA" not be converted automatically, but leaving NA and alike converted to NaN as it is now). Personally, I would even this consider as a bug that it treats "NA" and NA the same, as I would expect that quoted values should be left untouched. But I don't know how long this behaviour has been this way.
I’m potentially +1 on that Joris.
Comment From: johne13
@jreback I'm not arguing that CSV or similar would ever be my first choice, but public economic data is often provided only as text or as text and SAS. The private data I work with is similarly provided in that manner (always text, often SAS, nothing else ever and that's not changing soon). So I think the defaults of read_csv and read_fwf are extremely important.
I have also edited my original post to note that a round trip with to_csv
and read_csv
using defaults will convert a data value of 'NA' to NaN.
Comment From: BrenBarn
Just found this when searching for info about NaN reading/writing in Pandas. I agree that quoted NA should be handled differently.
A significant problem is that right now string values of "NA"
or "NaN"
are not CSV round-trippable without special-casing the arguments. If you create a DataFrame with a string value like this, save with .to_csv()
, and then read with .read_csv()
, you get NaN
(the float value) instead of your string value. I think this is buggy behavior. If pandas wants to read certain things as NaN
, fine, but if you have a string value, it needs to have a way to write that value so that it will know later to read it in as the same string value, and not as the float NaN
. I think the simplest way to do this would be to distinguish quoted and unquoted NA values.
Comment From: jorisvandenbossche
Yes, as I said above, I think we should at least consider this (differentiating quotes vs unquoted "NA" strings), so therefore reopening.
Not that by default this will not change much, as the default is to not write quotes. Bug at least if you have a source file with quotes, it would read in fine without additional keywords.
Comment From: jorisvandenbossche
Actually, opened a new issue: https://github.com/pandas-dev/pandas/issues/15669