data = pd.read_hdf('large.hdf5', 'key', columns=['some', 'columns'])

Problem description

I saved a large dataframe with about 36 columns and ~20M rows as an HDF5 using something like:

for chunk in pd.read_csv('foo.csv'):
    chunk.to_hdf('foo.h5', 'foo', format='t', append=True)

Reading a ~4 gb h5 file with only 3 of the columns specified causes memory to balloon to over 10 GB. Calling df.info() I get:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20065041 entries, 0 to 20065040
Data columns (total 3 columns):
mp_no    int64
dt       datetime64[ns]
pap_r    float32
dtypes: datetime64[ns](1), float32(1), int64(1)
memory usage: 535.8 MB

Furthermore, it seems like the memory allocation persists even when the object is deleted.

Expected Output

Memory usage shouldn't be much more than the memory needed for the data structure.

Output of pd.show_versions()

I'm running this in Jupyter. jupyter==1.0.0 jupyter-client==5.0.1 jupyter-console==5.1.0 jupyter-core==4.3.0 INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-79-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.19.2 nose: None pip: 9.0.1 setuptools: 36.0.1 Cython: None numpy: 1.12.1 scipy: 0.19.0 statsmodels: None xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: 3.4.2 numexpr: 2.6.2 matplotlib: 2.0.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.5.3 html5lib: 0.999999999 httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 boto: None pandas_datareader: None

Comment From: jreback

xref https://github.com/pandas-dev/pandas/issues/5329

pulling a sub-set of columns pulls ALL the data to memory, then does a reindex. It is possible, though not easy to select particular columns from a PyTables table. (which is what backs the store).

see docs here.

this is essentially an implementation detal.