xref #4749 xref #8985
This is a bit like issue #4749, but the conditions are different.
I have been able to reproduce it by specifying usecols
:
In [98]: mydata='1,2,3\n1,2\n1,2\n'
In [99]: pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'], usecols=['a', 'c'])
Out[99]:
a c
0 1 3
1 1 NaN
2 1 NaN
[3 rows x 2 columns]
In [100]: mydata='1,2\n1,2,3\n4,5,6\n'
In [101]: pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'], usecols=['a', 'c'])
---------------------------------------------------------------------------
CParserError Traceback (most recent call last)
<ipython-input-101-f60127771eeb> in <module>()
----> 1 pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'], usecols=['a', 'c'])
/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)
418 infer_datetime_format=infer_datetime_format)
419
--> 420 return _read(filepath_or_buffer, kwds)
421
422 parser_f.__name__ = name
/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
223 return parser
224
--> 225 return parser.read()
226
227 _parser_defaults = {
/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
624 raise ValueError('skip_footer not supported for iteration')
625
--> 626 ret = self._engine.read(nrows)
627
628 if self.options.get('as_recarray'):
/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
1068
1069 try:
-> 1070 data = self._reader.read(nrows)
1071 except StopIteration:
1072 if nrows is None:
/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader.read (pandas/parser.c:6866)()
/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7086)()
/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._read_rows (pandas/parser.c:7691)()
/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:7575)()
/home/altaurog/venv/p27/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.raise_parser_error (pandas/parser.c:19038)()
CParserError: Error tokenizing data. C error: Expected 2 fields in line 2, saw 3
In [102]: pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.13.1
Cython: 0.20
numpy: 1.8.0
scipy: None
statsmodels: None
IPython: 1.2.1
sphinx: 1.1.3
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: 2012c
bottleneck: 0.8.0
tables: 3.1.0
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: 0.7.4
xlsxwriter: None
sqlalchemy: None
lxml: None
bs4: None
html5lib: 0.999
bq: None
apiclient: None
Comment From: jreback
hmm..
I think the error is useful here, rather than providing short row behavior. You said the header was 3 fields (e.g. by specifying names
and then using those columns).
What would you have the results be?
Comment From: altaurog
I don't see why the behavior where the first row is short should differ from where the second row is short. If I specify three columns in the header, then I want three columns. Any row which has fewer values than that should be filled with NaN, assuming the dtype supports it, thus:
In [100]: mydata='1,2\n1,2,3\n4,5,6\n'
In [101]: pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'], usecols=['a', 'c'])
Out[101]:
a c
0 1 NaN
1 1 3
2 4 6
[3 rows x 2 columns]
The current behavior is inconsistent and frankly not very useful. My understanding from the docs is that the intention is to handle real-world data files intelligently and gracefully so we can go about getting our work done. Raising an exception in this case is neither intelligent nor graceful.
In any case, I might as well mention that the obvious workaround is to omit usecols
and slice the DataFrame afterwards:
In [100]: mydata='1,2\n1,2,3\n4,5,6\n'
In [101]: df = pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'])[['a', 'c']]
The result is the same, but this approach sacrifices "much faster parsing time and lower memory usage" which we were promised with usecols
.
Comment From: jreback
the difference is that the first row is a header and this is a mispecification of the names
so you are asking pandas to decide which parameter is valid here
I think that is a good reason to raise an exception
that said would consider a pull request to modify this to just take the names (and maybe just have a warning) - I think as a user you would want to know that your file is malformed?
otherwise you would have set header=None and just used skip lines no ?
Comment From: altaurog
Thanks for your explanation.
In this case, the first row is not a header and the file is not malformed any more than it would be if subsequent lines were short. Perhaps I misunderstood, but I was under the impression that header is set to None implicitly when I specify names in the call to read_csv. In any case, the exception is raised even with an explicit header=None.
Comment From: jreback
hmm...that could be a bug then if header=None
, want to try to fix this?
their is a complex interaction between parameters, this is most probably an untested case.
Comment From: altaurog
I'm new to pandas. (I am really impressed and very much enjoying it, I should add.) I would love to fix this, but I had a look at the code and I wasn't quite able to follow it. Unfortunately, I don't have the time right now to track this down and fix it. The performance benefits of usecols
aren't a big loss for me, so I'm just going to press ahead right now with the workaround I mentioned above. The fact that it worked without usecols
seemed to me to indicate it was a bug, though, so I thought I should at least create an issue for it even if I can't fix it. Now you have one fewer untested cases.
Comment From: jreback
@altaurog great...thanks for reporting. marked as a bug.
Comment From: evanpw
The case where usecols
is set actually has explicitly different behavior (in parser.pyx):
# Enforce this unless usecols
if not self.has_usecols:
self.parser.expected_fields = len(self.names)
Do we want to get rid of this check? I don't see a good reason for it (and all tests pass without it), but maybe I'm missing something.
Comment From: gfyoung
An update here: seems like the C engine is happy, but the Python engine is a little sad:
>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = '1,2\n1,2,3'
>>> usecols = ['a', 'c']
>>> names = ['a', 'b', 'c']
>>>
>>> read_csv(StringIO(data), usecols=usecols, names=names, engine='c')
a c
0 1 NaN
1 1 3.0
>>> read_csv(StringIO(data), usecols=usecols, names=names, engine='python')
...
ValueError: Number of passed names did not match number of header fields in the file
I suspect the error in the Python engine is being raised because we only check the first row and compare it to the header.
Comment From: dmitriyshashkin
Strangely enough I'm having the same problem in 1.3.5