Code Sample, a copy-pastable example if possible
df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))
df.to_hdf('no.string.save.once.h5', 'w')
df.to_hdf('no.string.save.twice.h5', 'w')
df.to_hdf('no.string.save.twice.h5', 'w')
add string columns
df2 = df.copy()
df2['E'] = 'this_is_a_string'
df2['F'] = 'this is another string'
df2.to_hdf('with.string.save.once.h5', 'w')
df2.to_hdf('with.string.save.twice.h5', 'w')
df2.to_hdf('with.string.save.twice.h5', 'w')
df2.to_hdf('with.string.save.three-times.h5', 'w') df2.to_hdf('with.string.save.three-times.h5', 'w') df2.to_hdf('with.string.save.three-times.h5', 'w')
Problem description
In the above example, df is a data frame with no string columns. It is saved once to path 'no.string.save.once.h5'; and saved twice to the path 'no.string.save.twice.h5'. The file sizes are the same, which is what we expect:
407208 no.string.save.once.h5 407208 no.string.save.twice.h5
However, if the dataframe has string columns (e.g. df2), then saving it twice to the same path will increase the file size:
1979728 with.string.save.once.h5 2501240 with.string.save.twice.h5 3022752 with.string.save.three-times.h5
This surprises me because it seems to be saying that before I save an h5 file in "write" mode, I should really make sure that file is deleted from disk. Otherwise, if I keep writing to the same path multiple times, the file size will keep blowing up.
pd.show_versions()
INSTALLED VERSIONS
commit: None python: 2.7.8.final.0 python-bits: 64 OS: Linux OS-release: 3.0.38-0.5-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C
pandas: 0.17.1 nose: 1.3.6 pip: 1.5.6 setuptools: 0.6 Cython: 0.21.1 numpy: 1.9.3 scipy: 0.16.1 statsmodels: 0.6.1 IPython: 4.0.1 sphinx: 1.2.2 patsy: 0.3.0 dateutil: 2.3 pytz: 2015.4 blosc: None bottleneck: 0.8.0 tables: 3.2.2 numexpr: 2.4 matplotlib: 1.4.3 openpyxl: 2.0.4 xlrd: 0.9.0 xlwt: 0.7.4 xlsxwriter: 0.7.3 lxml: 3.4.4 bs4: 4.3.2 html5lib: None httplib2: None apiclient: None sqlalchemy: 0.9.7 pymysql: 0.6.2.None psycopg2: None Jinja2: None
Comment From: chris-b1
There is something a little strange going on here (actually the first case might be the bug), but you are not passing mode='w'
, the second positional parameter to to_hdf
is the key that the object will be stored under.
If you use mode='w'
this does work (i.e. file size doesn't increase).
df2.to_hdf('with.string.save.three-times.h5', 'df', mode='w')
df2.to_hdf('with.string.save.three-times.h5', 'df', mode='w')
df2.to_hdf('with.string.save.three-times.h5', 'df', mode='w')
In [66]: !dir *.h5
Volume in drive C is OS
Volume Serial Number is 4874-6764
05/12/2017 03:21 PM 407,192 no.string.save.once.h5
05/12/2017 03:21 PM 407,192 no.string.save.twice.h5
05/12/2017 03:22 PM 1,900,536 with.string.save.once.h5
05/12/2017 03:26 PM 1,900,536 with.string.save.three-times.h5
6 File(s) 8,020,112 bytes
0 Dir(s) 358,787,080,192 bytes free
Comment From: jreback
read the big red warning here:
http://pandas.pydata.org/pandas-docs/stable/io.html#delete-from-a-table
you are using w
as the key to save the frame as @chris-b1 notes. Then writing it again (default is 'a' for append). Causes the original key to be deleted and a new one written. since hdf5 does not reclaim space the file size will increase.
you can specify mode='w'
to rewrite the file if you would like.