Pandas BUG: read_csv() does strange things if data rows have trailing delimiter

Given a file t.txt like this:

A,B,C
1,4,7,
2,5,8,
3,6,9,

pandas.read_csv('t.txt') produces:

   A  B   C
1  4  7 NaN
2  5  8 NaN
3  6  9 NaN

In other words, it implies a nonexistent header column for what it believes is the index. I would expect it to do the same as this code:

pd.read_csv('t.txt', index_col=False).set_index('A')

Which is:

The first example has the same problem even if you specify index_col=0, which is confusing because saying the first column should be used as the index is pretty much what's being done in the second example.

Files like this are commonly produced by vendors like Bloomberg whose formats we cannot expect to change.

Pandas 0.18.1.

Comment From: jorisvandenbossche

@jzwinck From the docstring on the index_col keyword

If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to not use the first column as the index (row names)

So doing

In [1]: s = """A,B,C
   ...: 1,4,7,
   ...: 2,5,8,
   ...: 3,6,9,"""

In [4]: pd.read_csv(StringIO(s))
Out[4]: 
   A  B   C
1  4  7 NaN
2  5  8 NaN
3  6  9 NaN

In [5]: pd.read_csv(StringIO(s), index_col=False)
Out[5]: 
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

gives you the desired result.

I agree it is a bit confusing / solution hard to find, but I think this was originally added as a feature to deal with such formatted files (where the index was included but did not get a name, so where the header row has one element less that the data rows).

Comment From: jzwinck

@jorisvandenbossche I think if the user explicitly requests index_col=0 the current behavior is totally bizarre. The user asked for the first column to be the index but instead all the headers are shifted. This is a very poor user experience.

Comment From: jorisvandenbossche

I think if the user explicitly requests index_col=0 the current behavior is totally bizarre.

Well, in any case the first column is used as the index values, only not with the index name that you expect.

I agree that, depending on what type of formatted csv files you mostly use, this can be confusing user experience. But I just want to point out that this behaviour is intentional and documented (see also http://pandas.pydata.org/pandas-docs/stable/io.html#index-columns-and-trailing-delimiters), so in any case not a bug. And given that fact, I think changing this default behaviour would not be a good idea IMO.

And I agree that there is currently not a good way to both ignore the trailing comma's (using index_col=False and setting the first column as the index (as index_col is already used for the first part ..). Apart from doing it in two steps of course with pd.read_csv(..., index_col=False).set_index('A')

Comment From: jorisvandenbossche

Closing this as a usage question. @jzwinck as I said before, I agree this is not ideal behaviour for many cases, but it is the documented behaviour. If you have suggestions to improve the documentation or ways to signal this, very welcome!

Comment From: smcinerney

@jorisvandenbossche : but the read_csv docpage doesn't mention it anywhere. So it's undocumented behavior!

Can you reopen this as DOCBUG and simply add something to the read_csv doc page:

"Trailing delimiters on data rows (i.e. non-header rows) confuse how the parser handles header and data rows - check you don't have them".

This sort of troubleshooting info really needs to be directly on the read_csv page itself, and not incorporated-by-reference over on the io docpage. Especially since it doesn't fit nicely under any one single parameter of read_csv