Pandas Strip whitespace from column names when usecols in read_csv

A small, complete example of the issue

when loading a file of the type, where headers have a trailing whitespace,

a ,b ,c 
1,2,3
4,5,6

I would expect the following code to work and give the result

pandas.read_table('test_data.csv', sep=',', usecols=['a', 'b'])

Expected Output

   a  b
0  1  2
1  4  5

Actual Output

Neither the c nor the python engine produce the expected result. the tracebacks have been concatenated for brevity.

/Users/rahulporuri/Github/pandas/pandas/io/parsers.pyc in __init__(self, src, **kwds)
   1431 
   1432             if len(self.names) < len(self.usecols):
-> 1433                 raise ValueError("Usecols do not match names.")
   1434 
   1435         self._set_noconvert_columns()

ValueError: Usecols do not match names.

/Users/rahulporuri/Github/pandas/pandas/io/parsers.pyc in _handle_usecols(self, columns, usecols_key)
   2199                 for u in self.usecols:
   2200                     if isinstance(u, string_types):
-> 2201                         col_indices.append(usecols_key.index(u))
   2202                     else:
   2203                         col_indices.append(u)

ValueError: 'a' is not in list

This is related to an issue reported earlier https://github.com/pandas-dev/pandas/issues/14460 on stripping columns/column names of whitespaces.

On a side note, if the file has column names with leading whitespaces instead of trailing whitespaces, adding the skipinitialspace=True kwarg to pandas.read_table produces the expected result.

Output of `pd.show_versions()`

## INSTALLED VERSIONS commit: 794f79295484298525dbc8dd3b8ab251ad065e61 python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 16.0.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.19.0+27.g794f792 nose: None pip: 8.1.2 setuptools: 23.1.0 Cython: 0.24 numpy: 1.10.4 scipy: 0.17.1 statsmodels: None xarray: 0.7.2 IPython: 4.1.2 sphinx: 1.4.1 patsy: None dateutil: 2.5.2 pytz: 2016.7 blosc: None bottleneck: None tables: 3.2.2 numexpr: 2.6.0 matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: None

Comment From: jorisvandenbossche

This just boils down to the issue you already reported. Pandas does not strip whitespace from the columns, so your actual column names are 'a ' and 'b ' (so including the space), and consequently your usecols=['a', 'b'] do not work as they don't match the column names (using usecols=['a ', 'b '] works as expected).

Comment From: rahulporuri

yes. Do we want pandas to strip whitespace from column names/columns? if not by default, maybe using another kwarg to pandas.read_table

Comment From: jreback

you can already do https://github.com/pandas-dev/pandas/pull/14234, or post-strip with .str.strip(). so not sure this is compelling.

Comment From: jorisvandenbossche

I think it is possibly useful, but that's already discussed in the other issue (#14460). So let's close this one.

Comment From: rahulporuri

ohh. my bad. i didn't know that we could pass a callable to usecols. and agreed, we can strip post loading the dataframe and then drop unnecessary columns but that is a workaround IMO.

Comment From: jorisvandenbossche

@rahulporuri the callables are at the moment only a PR for enhancement, not in master or released version.

Comment From: maxima120

pretty much all software i use including excel write column names in csv with spaces: "a, b, c, ...". Pandas should be able to read these correctly as intended names of columns are a b c. This is obvious behaviour and not a bug. Hence if pandas think its "b " "c " etc then that is a bug.

People often forget that bug is not only when something blows up in your face. Bug is also when something behaves differently from what is reasonable to expect. And this is exactly the case.

Pandas Strip whitespace from column names when usecols in read_csv

A small, complete example of the issue

Expected Output

Actual Output

Output of pd.show_versions()

Output of `pd.show_versions()`