data = pd.read_hdf('large.hdf5', 'key', columns=['some', 'columns'])
Problem description
I saved a large dataframe with about 36 columns and ~20M rows as an HDF5 using something like:
for chunk in pd.read_csv('foo.csv'):
chunk.to_hdf('foo.h5', 'foo', format='t', append=True)
Reading a ~4 gb h5 file with only 3 of the columns specified causes memory to balloon to over 10 GB. Calling df.info() I get:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20065041 entries, 0 to 20065040
Data columns (total 3 columns):
mp_no int64
dt datetime64[ns]
pap_r float32
dtypes: datetime64[ns](1), float32(1), int64(1)
memory usage: 535.8 MB
Furthermore, it seems like the memory allocation persists even when the object is deleted.
Expected Output
Memory usage shouldn't be much more than the memory needed for the data structure.
Output of pd.show_versions()
I'm running this in Jupyter.
jupyter==1.0.0
jupyter-client==5.0.1
jupyter-console==5.1.0
jupyter-core==4.3.0
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-79-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.12.1
scipy: 0.19.0
statsmodels: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
boto: None
pandas_datareader: None
Comment From: jreback
xref https://github.com/pandas-dev/pandas/issues/5329
pulling a sub-set of columns pulls ALL the data to memory, then does a reindex. It is possible, though not easy to select particular columns from a PyTables table. (which is what backs the store).
see docs here.
this is essentially an implementation detal.