From a stackoverflow question, I'm working on a Mac, trying to read a csv generated on Windows that ends with a '\x1a'
EOF character, and pd.read_csv
creates a new row at the end, ['\x1a', NaN, NaN, ...]
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s = StringIO('a,b\r\n1,2\r\n3,4\r\n\x1a')
In [4]: df = pd.read_csv(s)
In [5]: df
Out[5]:
a b
0 1 2
1 3 4
2 NaN
Now I'm manually checking for that character and a bunch of NaN's in the last row, and dropping it. Would it be worth adding an option to not create this last row (or ignore the EOF automatically)? In python I'm currently doing
def strip_eof(df):
"Drop last row if it begins with '\x1a' and ends with NaN's"
lastrow = df.iloc[-1]
if lastrow[0] == '\x1a' and lastrow[1:].isnull().all():
return df.drop([lastrow.name], axis=0)
return df
Comment From: jreback
this looks very similar to #5500 yes?
Comment From: wcbeard
Ah, didn't see that one. Yes looks similar, though seems the other one is referring to EOFs that are in the wrong place, rather than at end of file. I can't tell whether the solution to that issue would fix this problem as well though.
Comment From: jreback
ok will mark it as a bug. I wonder if the EOF is the SAME EOF as used in 'pandas/src/parser/tokenizer.c'
this might be a windows thing. How did you generate the file?
Comment From: hayd
I've seen this too (not on Windows).
Comment From: wcbeard
I assume someone generated it from a Windows machine (I get it from a samba share), but other than that couldn't say. I just tried reading the same file from my Windows VM, and it also added the bogus row at the end. The character might be from some old Windows standard...?
Comment From: jreback
hmm....maybe give a shot at searching old issues; seems to recall this somewhere....
otherwise would certainly appreciate a PR for this, sounds like a bug to me
Comment From: wcbeard
I don't know C, so I was trying to fix it in python after the parser does the work (in pandas.io.parsers.TextFileReader.read
). I'm realizing after some failed tests, however, that the bug messes up the dtype information as well. For example, adding "\x1a"
to the end of
data = """A,B,C
1,2,a
4,5,b
"""
changes the dtypes of the series' from [int, int, object]
to [object, float, object]
because of the string in the first column and Nan's after that.
I'm not familiar with any way to reclaim the original dtypes after dropping the bad row-- from my limited understanding it seems the proper way to do it would be at the C level in the parser.
Comment From: jreback
you. can convert_objects() to reinfer the dtypes
yah this needs to be fixed at the c level
Comment From: jcull
One way to get around this is filtering the input and giving read_csv a StringIO object:
with open('csv.csv','rb') as myfile:
myfh = cStringIO.StringIO(''.join(itertools.ifilter(lambda x: x[0] != '\x1a',myfile)))
df = pd.read_csv(myfh)
This will allow read_csv to properly interpret the data types of all the columns so you don't have to convert them later, which especially helpful if you have a converter or datetime columns. This will probably increase your memory usage with the extra StringIO object.
Comment From: gfyoung
The issue with "skipping" an EOF
on the last row is that you would have to either look ahead to check if you were on the last row OR check the entire last row all over again once you finished reading. In the average case, that's going to hurt performance.
The workaround for this would be to set skipfooter
:
>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = 'a,b\r\n1,2\r\n3,4\r\n\x1a'
>>> df = read_csv(StringIO(data), engine='python', skipfooter=1)
>>> df
a b
0 1 2
1 3 4
>>> df.dtypes
a int64
b int64
dtype: object
In light of this, and the fact that "fixing it" is probably more detrimental than beneficial, I would recommend that this issue be closed.
Comment From: gfyoung
No activity on this issue for about a year, and I had recommended it be closed then (because the proper fix is just skipfooter
support). Reading the conversation, I agree with my earlier self.