Pandas BUG: read_csv raises exception if first row is incomplete

xref #4749 xref #8985

This is a bit like issue #4749, but the conditions are different.

I have been able to reproduce it by specifying usecols:

In [98]: mydata='1,2,3\n1,2\n1,2\n'
In [99]: pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'], usecols=['a', 'c'])
Out[99]: 
   a   c
0  1   3
1  1 NaN
2  1 NaN

[3 rows x 2 columns]
In [100]: mydata='1,2\n1,2,3\n4,5,6\n'
In [101]: pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'], usecols=['a', 'c'])
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-101-f60127771eeb> in <module>()
----> 1 pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'], usecols=['a', 'c'])

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)
    418                     infer_datetime_format=infer_datetime_format)
    419 
--> 420         return _read(filepath_or_buffer, kwds)
    421 
    422     parser_f.__name__ = name

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    223         return parser
    224 
--> 225     return parser.read()
    226 
    227 _parser_defaults = {

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
    624                 raise ValueError('skip_footer not supported for iteration')
    625 
--> 626         ret = self._engine.read(nrows)
    627 
    628         if self.options.get('as_recarray'):

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
   1068 
   1069         try:
-> 1070             data = self._reader.read(nrows)
   1071         except StopIteration:
   1072             if nrows is None:

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader.read (pandas/parser.c:6866)()

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7086)()

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._read_rows (pandas/parser.c:7691)()

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:7575)()

/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.raise_parser_error (pandas/parser.c:19038)()

CParserError: Error tokenizing data. C error: Expected 2 fields in line 2, saw 3

In [102]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-4-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: 0.20
numpy: 1.8.0
scipy: None
statsmodels: None
IPython: 1.2.1
sphinx: 1.1.3
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: 2012c
bottleneck: 0.8.0
tables: 3.1.0
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: 0.7.4
xlsxwriter: None
sqlalchemy: None
lxml: None
bs4: None
html5lib: 0.999
bq: None
apiclient: None

Comment From: jreback

hmm..

I think the error is useful here, rather than providing short row behavior. You said the header was 3 fields (e.g. by specifying names and then using those columns).

What would you have the results be?

Comment From: altaurog

I don't see why the behavior where the first row is short should differ from where the second row is short. If I specify three columns in the header, then I want three columns. Any row which has fewer values than that should be filled with NaN, assuming the dtype supports it, thus:

In [100]: mydata='1,2\n1,2,3\n4,5,6\n'
In [101]: pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'], usecols=['a', 'c'])
Out[101]: 
   a   c
0  1 NaN
1  1   3
2  4   6

[3 rows x 2 columns]

The current behavior is inconsistent and frankly not very useful. My understanding from the docs is that the intention is to handle real-world data files intelligently and gracefully so we can go about getting our work done. Raising an exception in this case is neither intelligent nor graceful.

In any case, I might as well mention that the obvious workaround is to omit usecols and slice the DataFrame afterwards:

In [100]: mydata='1,2\n1,2,3\n4,5,6\n'
In [101]: df = pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'])[['a', 'c']]

The result is the same, but this approach sacrifices "much faster parsing time and lower memory usage" which we were promised with usecols.

Comment From: jreback

the difference is that the first row is a header and this is a mispecification of the names

so you are asking pandas to decide which parameter is valid here

I think that is a good reason to raise an exception

that said would consider a pull request to modify this to just take the names (and maybe just have a warning) - I think as a user you would want to know that your file is malformed?

otherwise you would have set header=None and just used skip lines no ?

Comment From: altaurog

Thanks for your explanation.

In this case, the first row is not a header and the file is not malformed any more than it would be if subsequent lines were short. Perhaps I misunderstood, but I was under the impression that header is set to None implicitly when I specify names in the call to read_csv. In any case, the exception is raised even with an explicit header=None.

Comment From: jreback

hmm...that could be a bug then if header=None, want to try to fix this?

their is a complex interaction between parameters, this is most probably an untested case.

Comment From: altaurog

I'm new to pandas. (I am really impressed and very much enjoying it, I should add.) I would love to fix this, but I had a look at the code and I wasn't quite able to follow it. Unfortunately, I don't have the time right now to track this down and fix it. The performance benefits of usecols aren't a big loss for me, so I'm just going to press ahead right now with the workaround I mentioned above. The fact that it worked without usecols seemed to me to indicate it was a bug, though, so I thought I should at least create an issue for it even if I can't fix it. Now you have one fewer untested cases.

Comment From: jreback

@altaurog great...thanks for reporting. marked as a bug.

Comment From: evanpw

The case where usecols is set actually has explicitly different behavior (in parser.pyx):

# Enforce this unless usecols
if not self.has_usecols:
    self.parser.expected_fields = len(self.names)

Do we want to get rid of this check? I don't see a good reason for it (and all tests pass without it), but maybe I'm missing something.

Comment From: gfyoung

An update here: seems like the C engine is happy, but the Python engine is a little sad:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = '1,2\n1,2,3'
>>> usecols = ['a', 'c']
>>> names = ['a', 'b', 'c']
>>>
>>> read_csv(StringIO(data), usecols=usecols, names=names, engine='c')
   a    c
0  1  NaN
1  1  3.0
>>> read_csv(StringIO(data), usecols=usecols, names=names, engine='python')
...
ValueError: Number of passed names did not match number of header fields in the file

I suspect the error in the Python engine is being raised because we only check the first row and compare it to the header.

Comment From: dmitriyshashkin

Strangely enough I'm having the same problem in 1.3.5