A small, complete example of the issue
when loading a file of the type, where headers have a trailing whitespace,
a ,b ,c
1,2,3
4,5,6
I would expect the following code to work and give the result
pandas.read_table('test_data.csv', sep=',', usecols=['a', 'b'])
Expected Output
a b
0 1 2
1 4 5
Actual Output
Neither the c
nor the python
engine produce the expected result.
the tracebacks have been concatenated for brevity.
/Users/rahulporuri/Github/pandas/pandas/io/parsers.pyc in __init__(self, src, **kwds)
1431
1432 if len(self.names) < len(self.usecols):
-> 1433 raise ValueError("Usecols do not match names.")
1434
1435 self._set_noconvert_columns()
ValueError: Usecols do not match names.
/Users/rahulporuri/Github/pandas/pandas/io/parsers.pyc in _handle_usecols(self, columns, usecols_key)
2199 for u in self.usecols:
2200 if isinstance(u, string_types):
-> 2201 col_indices.append(usecols_key.index(u))
2202 else:
2203 col_indices.append(u)
ValueError: 'a' is not in list
This is related to an issue reported earlier https://github.com/pandas-dev/pandas/issues/14460 on stripping columns/column names of whitespaces.
On a side note, if the file has column names with leading whitespaces instead of trailing whitespaces, adding the skipinitialspace=True
kwarg to pandas.read_table
produces the expected result.
Output of pd.show_versions()
Comment From: jorisvandenbossche
This just boils down to the issue you already reported. Pandas does not strip whitespace from the columns, so your actual column names are 'a '
and 'b '
(so including the space), and consequently your usecols=['a', 'b']
do not work as they don't match the column names (using usecols=['a ', 'b ']
works as expected).
Comment From: rahulporuri
yes. Do we want pandas
to strip whitespace from column names/columns? if not by default, maybe using another kwarg to pandas.read_table
Comment From: jreback
you can already do https://github.com/pandas-dev/pandas/pull/14234, or post-strip with .str.strip()
. so not sure this is compelling.
Comment From: jorisvandenbossche
I think it is possibly useful, but that's already discussed in the other issue (#14460). So let's close this one.
Comment From: rahulporuri
ohh. my bad. i didn't know that we could pass a callable to usecols
. and agreed, we can strip post loading the dataframe and then drop unnecessary columns but that is a workaround IMO.
Comment From: jorisvandenbossche
@rahulporuri the callables are at the moment only a PR for enhancement, not in master or released version.
Comment From: maxima120
pretty much all software i use including excel write column names in csv with spaces: "a, b, c, ...". Pandas should be able to read these correctly as intended names of columns are a b c. This is obvious behaviour and not a bug. Hence if pandas think its "b " "c " etc then that is a bug.
People often forget that bug is not only when something blows up in your face. Bug is also when something behaves differently from what is reasonable to expect. And this is exactly the case.