pd.read_table
can make use of usecols
to reduce the memory footprint. usecols
supports a list of labels, but this is only supported on the "python" engine. Using usecols
with the "c" engine will result in the following error:
>>> pd.read_table('/nonsense', sep=':', usecols=['test'])
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 389, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 730, in __init__
self._make_engine(self.engine)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 923, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1434, in __init__
raise ValueError("Usecols do not match names.")
ValueError: Usecols do not match names.
usecols
works instead with labels if you specify the python engine manually.
We should implement usecols
with labels also for the "c" engine.
The python engine is way slower than the "c" engine, and since it allocates the buffers in python, is actually much more wasteful also in terms of memory usage despite "usecols" being supported.
Comment From: chris-b1
I think you might have an error in your data? The c engine does support usecols:
from io import StringIO
pd.read_table(StringIO("""a:b:c
1:2:3
4:5:6"""), sep=':', usecols=['b'], engine='c')
Out[125]:
b
0 2
1 5
Comment From: wavexx
On Thu, Jun 08 2017, chris-b1 wrote:
I think you might have an error in your data? The c engine does support usecols:
Mh, after closing my python session and restarting from scratch I really cannot reproduce this, even myself.
Weird, as I debugged this for quite a bit, but I guess I had some issue. Sorry for the noise.