Pandas crash with TypeError when repeatedly appending to HDFStore

When merging HDF5 files and thus appending to a Series stored in a node, I get the following stacktrace from time to time, with no really clear reproducer. By no clear reproducer, I meaning it happens always at the same place after having looped over several 'to merge' files but it does not appear if I run a sequence of separate processes for each file.

My pseudocode is as follows:

def merge_stores(out_store, in_store_files):
  open(out_store, 'a')
  for in_store_file in in_store_files:
    in_store = pd.HDFStore(in_store_file, 'r')
    for key in in_store.keys():
      out_store.put(key, in_store.get(key), format='t', append=True, min_itemsize=48)

And here is the stack trace I have at the put: /opt/anaconda/lib/python2.7/site-packages/tables/group.py:1156: UserWarning: problems loading leaf /PC40102D/table::

  Attribute chunksize exists in node PC40102D, but can't get it.

The leaf will become an ``UnImplemented`` node.
  % (self._g_join(childname), exc))
!!!!!!!!!!!!!!!!!!
Uncaught exception
!!!!!!!!!!!!!!!!!!
Traceback (most recent call last):
  File "mergehdf5.py", line 100, in <module>
    main()
  File "mergehdf5.py", line 95, in main
    merge_files(options.outputfile, files, delete=False, append=options.append)
  File "mergehdf5.py", line 81, in merge_files
    out_store_keys = merge_file(out_store, in_store, out_store_keys, append=append)
  File "mergehdf5.py", line 52, in merge_file
    out_store.put(key, to_merge_in, format='t', append=True, min_itemsize=STR_SIZE)
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 818, in put
    self._write_to_group(key, value, append=append, **kwargs)
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 1270, in _write_to_group
    s.write(obj=value, append=append, complib=complib, **kwargs)
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 3891, in write
    obj=obj, data_columns=obj.columns, **kwargs)
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 3605, in write
    **kwargs)
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 3136, in create_axes
    if self.infer_axes():
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 2137, in infer_axes
    s = self.storable
  File "/opt/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 2832, in storable
    return getattr(self.group, 'table', None)
  File "/opt/anaconda/lib/python2.7/site-packages/tables/group.py", line 811, in __getattr__
    return self._f_get_child(name)
  File "/opt/anaconda/lib/python2.7/site-packages/tables/group.py", line 684, in _f_get_child
    return self._v_file._get_node(childpath)
  File "/opt/anaconda/lib/python2.7/site-packages/tables/file.py", line 1562, in _get_node
    node = self._node_manager.get_node(nodepath)
  File "/opt/anaconda/lib/python2.7/site-packages/tables/file.py", line 436, in get_node
    node = self.node_factory(key)
  File "/opt/anaconda/lib/python2.7/site-packages/tables/group.py", line 1158, in _g_load_child
    return UnImplemented(self, childname)
  File "/opt/anaconda/lib/python2.7/site-packages/tables/unimplemented.py", line 69, in __init__
    super(UnImplemented, self).__init__(parentnode, name)
  File "/opt/anaconda/lib/python2.7/site-packages/tables/leaf.py", line 262, in __init__
    super(Leaf, self).__init__(parentnode, name, _log)
  File "/opt/anaconda/lib/python2.7/site-packages/tables/node.py", line 271, in __init__
    self._v_objectid = self._g_open()
  File "/opt/anaconda/lib/python2.7/site-packages/tables/unimplemented.py", line 72, in _g_open
    (self.shape, self.byteorder, object_id) = self._open_unimplemented()
  File "hdf5extension.pyx", line 2095, in tables.hdf5extension.UnImplemented._open_unimplemented (tables/hdf5extension.c:19140)
TypeError: argument 2 to map() must support iteration
DEBUG MODE
Closing remaining open files:/var/engdata/hdf5output/hdf5/M002-20150104-comp.hdf5...done/tmp/M002-201501-comp.hdf5...done```

Comment From: jreback

pls post pd.show_versions()

show a sample of one of the in_stores, e.g. just printit

it looks you are opening the out_store as a regular file. you need to open this as a store.

Comment From: migdard

I will post this data on Monday because I'm not at work. But from memory I have pytables 3.1.1 and pandas 0.14.2. This is the latest anaconda distribution for Linux x86_64 . Of course I open the store correctly, the pseudo code is not right in that respect.

I deeply hope we can sort this out quick because this issue came up as part of an evaluation for a 10 years long big data project and this just gave me a big set back with respect to maturity of the code base for this usage.

Thank you

envoyé de mon GSM sent from my mobile On Feb 13, 2015 6:09 PM, "jreback" notifications@github.com wrote:

pls post pd.show_versions()

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/9483#issuecomment-74288538.

Comment From: migdard

Here the script used, one printed in_store and the output of pd.show_versions() is:

[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://binstar.org
>>> import pandas as pd
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.8.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-123.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: 1.3.4
Cython: 0.21
numpy: 1.9.0
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.2.0
sphinx: 1.2.3
patsy: 0.3.0
scikits.timeseries: None
dateutil: 1.5
pytz: 2014.7
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None
>>>

Comment From: migdard

Interestingly I do not reproduce the issue while running the same script on Windows with pandas 0.15.2

Comment From: migdard

Unfortunately updating pandas to 0.15.2 on the linux box does not fix the issue. Here is the output from pd.show_versions() on the windows box where the issue is not present:

Python 2.7.7 |Anaconda 2.0.1 (64-bit)| (default, Jun 11 2014, 10:40:02) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://binstar.org
>>> import pandas as pd
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.7.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.15.2
nose: 1.3.3
Cython: 0.20.1
numpy: 1.9.0
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.3.0
sphinx: 1.2.2
patsy: 0.3.0
dateutil: 1.5
pytz: 2014.3
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None
>>>

Comment From: jreback

you might have an issue with an older HDF libraries on your linux box. you can try conda update hdf5.

Comment From: migdard

My linux box is not connected to the internet and I can't seem to find an hdf5 package on continuum repositories - still looking further, though. conda list |grep hdf5 shows hdf5 1.8.13 on the linux box and nothing on the windows one

Comment From: migdard

All right, just found hdf5 1.8.14 for anaconda. Upgrading and coming back to tell the results.

Comment From: migdard

This did not do it. I have the exact same stacktrace after upgrading HDF5 library.

Comment From: jreback

you can try posting on pytables. No idea what your problem is.

Comment From: jorisvandenbossche

@migdard Closing this. If you still have problem, feel free to reopen or comment further here