Code Sample, a copy-pastable example if possible

df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))

df.to_hdf('no.string.save.once.h5', 'w')

df.to_hdf('no.string.save.twice.h5', 'w')
df.to_hdf('no.string.save.twice.h5', 'w')

add string columns

df2 = df.copy()
df2['E'] = 'this_is_a_string'
df2['F'] = 'this is another string'

df2.to_hdf('with.string.save.once.h5', 'w')

df2.to_hdf('with.string.save.twice.h5', 'w')
df2.to_hdf('with.string.save.twice.h5', 'w')

df2.to_hdf('with.string.save.three-times.h5', 'w') df2.to_hdf('with.string.save.three-times.h5', 'w') df2.to_hdf('with.string.save.three-times.h5', 'w')

Problem description

In the above example, df is a data frame with no string columns. It is saved once to path 'no.string.save.once.h5'; and saved twice to the path 'no.string.save.twice.h5'. The file sizes are the same, which is what we expect:

407208 no.string.save.once.h5 407208 no.string.save.twice.h5

However, if the dataframe has string columns (e.g. df2), then saving it twice to the same path will increase the file size:

1979728 with.string.save.once.h5 2501240 with.string.save.twice.h5 3022752 with.string.save.three-times.h5

This surprises me because it seems to be saying that before I save an h5 file in "write" mode, I should really make sure that file is deleted from disk. Otherwise, if I keep writing to the same path multiple times, the file size will keep blowing up.

pd.show_versions()

INSTALLED VERSIONS

commit: None python: 2.7.8.final.0 python-bits: 64 OS: Linux OS-release: 3.0.38-0.5-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C

pandas: 0.17.1 nose: 1.3.6 pip: 1.5.6 setuptools: 0.6 Cython: 0.21.1 numpy: 1.9.3 scipy: 0.16.1 statsmodels: 0.6.1 IPython: 4.0.1 sphinx: 1.2.2 patsy: 0.3.0 dateutil: 2.3 pytz: 2015.4 blosc: None bottleneck: 0.8.0 tables: 3.2.2 numexpr: 2.4 matplotlib: 1.4.3 openpyxl: 2.0.4 xlrd: 0.9.0 xlwt: 0.7.4 xlsxwriter: 0.7.3 lxml: 3.4.4 bs4: 4.3.2 html5lib: None httplib2: None apiclient: None sqlalchemy: 0.9.7 pymysql: 0.6.2.None psycopg2: None Jinja2: None

Comment From: chris-b1

There is something a little strange going on here (actually the first case might be the bug), but you are not passing mode='w', the second positional parameter to to_hdf is the key that the object will be stored under.

If you use mode='w' this does work (i.e. file size doesn't increase).


df2.to_hdf('with.string.save.three-times.h5', 'df', mode='w')
df2.to_hdf('with.string.save.three-times.h5', 'df', mode='w')
df2.to_hdf('with.string.save.three-times.h5', 'df', mode='w')

In [66]: !dir *.h5
 Volume in drive C is OS
 Volume Serial Number is 4874-6764

05/12/2017  03:21 PM           407,192 no.string.save.once.h5
05/12/2017  03:21 PM           407,192 no.string.save.twice.h5
05/12/2017  03:22 PM         1,900,536 with.string.save.once.h5
05/12/2017  03:26 PM         1,900,536 with.string.save.three-times.h5
               6 File(s)      8,020,112 bytes
               0 Dir(s)  358,787,080,192 bytes free

Comment From: jreback

read the big red warning here:

http://pandas.pydata.org/pandas-docs/stable/io.html#delete-from-a-table

you are using w as the key to save the frame as @chris-b1 notes. Then writing it again (default is 'a' for append). Causes the original key to be deleted and a new one written. since hdf5 does not reclaim space the file size will increase.

you can specify mode='w' to rewrite the file if you would like.

Pandas pandas to_hdf increases file size when saving twice to the same path ('write' mode, with string columns)

Code Sample, a copy-pastable example if possible

add string columns

Problem description

INSTALLED VERSIONS