Pandas Ignore EOF character in read_csv

From a stackoverflow question, I'm working on a Mac, trying to read a csv generated on Windows that ends with a '\x1a' EOF character, and pd.read_csv creates a new row at the end, ['\x1a', NaN, NaN, ...]

In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s = StringIO('a,b\r\n1,2\r\n3,4\r\n\x1a')
In [4]: df = pd.read_csv(s)
In [5]: df
Out[5]:
   a   b
0  1   2
1  3   4
2   NaN

Now I'm manually checking for that character and a bunch of NaN's in the last row, and dropping it. Would it be worth adding an option to not create this last row (or ignore the EOF automatically)? In python I'm currently doing

def strip_eof(df):
    "Drop last row if it begins with '\x1a' and ends with NaN's"
    lastrow = df.iloc[-1]
    if lastrow[0] == '\x1a' and lastrow[1:].isnull().all():
        return df.drop([lastrow.name], axis=0)
    return df

Comment From: jreback

this looks very similar to #5500 yes?

Comment From: wcbeard

Ah, didn't see that one. Yes looks similar, though seems the other one is referring to EOFs that are in the wrong place, rather than at end of file. I can't tell whether the solution to that issue would fix this problem as well though.

Comment From: jreback

ok will mark it as a bug. I wonder if the EOF is the SAME EOF as used in 'pandas/src/parser/tokenizer.c'

this might be a windows thing. How did you generate the file?

Comment From: hayd

I've seen this too (not on Windows).

Comment From: wcbeard

I assume someone generated it from a Windows machine (I get it from a samba share), but other than that couldn't say. I just tried reading the same file from my Windows VM, and it also added the bogus row at the end. The character might be from some old Windows standard...?

Comment From: jreback

hmm....maybe give a shot at searching old issues; seems to recall this somewhere....

otherwise would certainly appreciate a PR for this, sounds like a bug to me

Comment From: wcbeard

I don't know C, so I was trying to fix it in python after the parser does the work (in pandas.io.parsers.TextFileReader.read). I'm realizing after some failed tests, however, that the bug messes up the dtype information as well. For example, adding "\x1a" to the end of

data = """A,B,C
1,2,a
4,5,b
"""

changes the dtypes of the series' from [int, int, object] to [object, float, object] because of the string in the first column and Nan's after that.

I'm not familiar with any way to reclaim the original dtypes after dropping the bad row-- from my limited understanding it seems the proper way to do it would be at the C level in the parser.

Comment From: jreback

you. can convert_objects() to reinfer the dtypes

yah this needs to be fixed at the c level

Comment From: jcull

One way to get around this is filtering the input and giving read_csv a StringIO object:

with open('csv.csv','rb') as myfile:
    myfh = cStringIO.StringIO(''.join(itertools.ifilter(lambda x: x[0] != '\x1a',myfile)))

df = pd.read_csv(myfh)

This will allow read_csv to properly interpret the data types of all the columns so you don't have to convert them later, which especially helpful if you have a converter or datetime columns. This will probably increase your memory usage with the extra StringIO object.

Comment From: gfyoung

The issue with "skipping" an EOF on the last row is that you would have to either look ahead to check if you were on the last row OR check the entire last row all over again once you finished reading. In the average case, that's going to hurt performance.

The workaround for this would be to set skipfooter:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = 'a,b\r\n1,2\r\n3,4\r\n\x1a'
>>> df = read_csv(StringIO(data), engine='python', skipfooter=1)
>>> df
   a  b
0  1  2
1  3  4
>>> df.dtypes
a    int64
b    int64
dtype: object

In light of this, and the fact that "fixing it" is probably more detrimental than beneficial, I would recommend that this issue be closed.

Comment From: gfyoung

No activity on this issue for about a year, and I had recommended it be closed then (because the proper fix is just skipfooter support). Reading the conversation, I agree with my earlier self.