Pandas Suspected memory leak with read_hdf()

data = pd.read_hdf('large.hdf5', 'key', columns=['some', 'columns'])

Problem description

I saved a large dataframe with about 36 columns and ~20M rows as an HDF5 using something like:

for chunk in pd.read_csv('foo.csv'):
    chunk.to_hdf('foo.h5', 'foo', format='t', append=True)

Reading a ~4 gb h5 file with only 3 of the columns specified causes memory to balloon to over 10 GB. Calling df.info() I get:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20065041 entries, 0 to 20065040
Data columns (total 3 columns):
mp_no    int64
dt       datetime64[ns]
pap_r    float32
dtypes: datetime64[ns](1), float32(1), int64(1)
memory usage: 535.8 MB

Furthermore, it seems like the memory allocation persists even when the object is deleted.

Expected Output

Memory usage shouldn't be much more than the memory needed for the data structure.

Output of `pd.show_versions()`

I'm running this in Jupyter. jupyter==1.0.0 jupyter-client==5.0.1 jupyter-console==5.1.0 jupyter-core==4.3.0 INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-79-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.19.2 nose: None pip: 9.0.1 setuptools: 36.0.1 Cython: None numpy: 1.12.1 scipy: 0.19.0 statsmodels: None xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: 3.4.2 numexpr: 2.6.2 matplotlib: 2.0.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.5.3 html5lib: 0.999999999 httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 boto: None pandas_datareader: None

Comment From: jreback

xref https://github.com/pandas-dev/pandas/issues/5329

pulling a sub-set of columns pulls ALL the data to memory, then does a reindex. It is possible, though not easy to select particular columns from a PyTables table. (which is what backs the store).

see docs here.

this is essentially an implementation detal.

Pandas Suspected memory leak with read_hdf()

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`