When merging HDF5 files and thus appending to a Series stored in a node, I get the following stacktrace from time to time, with no really clear reproducer. By no clear reproducer, I meaning it happens always at the same place after having looped over several 'to merge' files but it does not appear if I run a sequence of separate processes for each file.
My pseudocode is as follows:
def merge_stores(out_store, in_store_files):
open(out_store, 'a')
for in_store_file in in_store_files:
in_store = pd.HDFStore(in_store_file, 'r')
for key in in_store.keys():
out_store.put(key, in_store.get(key), format='t', append=True, min_itemsize=48)
And here is the stack trace I have at the put:
/opt/anaconda/lib/python2.7/site-packages/tables/group.py:1156: UserWarning: problems loading leaf /PC40102D/table
::
Attribute chunksize exists in node PC40102D, but can't get it.
The leaf will become an ``UnImplemented`` node.
% (self._g_join(childname), exc))
!!!!!!!!!!!!!!!!!!
Uncaught exception
!!!!!!!!!!!!!!!!!!
Traceback (most recent call last):
File "mergehdf5.py", line 100, in <module>
main()
File "mergehdf5.py", line 95, in main
merge_files(options.outputfile, files, delete=False, append=options.append)
File "mergehdf5.py", line 81, in merge_files
out_store_keys = merge_file(out_store, in_store, out_store_keys, append=append)
File "mergehdf5.py", line 52, in merge_file
out_store.put(key, to_merge_in, format='t', append=True, min_itemsize=STR_SIZE)
File "/opt/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 818, in put
self._write_to_group(key, value, append=append, **kwargs)
File "/opt/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 1270, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "/opt/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 3891, in write
obj=obj, data_columns=obj.columns, **kwargs)
File "/opt/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 3605, in write
**kwargs)
File "/opt/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 3136, in create_axes
if self.infer_axes():
File "/opt/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 2137, in infer_axes
s = self.storable
File "/opt/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 2832, in storable
return getattr(self.group, 'table', None)
File "/opt/anaconda/lib/python2.7/site-packages/tables/group.py", line 811, in __getattr__
return self._f_get_child(name)
File "/opt/anaconda/lib/python2.7/site-packages/tables/group.py", line 684, in _f_get_child
return self._v_file._get_node(childpath)
File "/opt/anaconda/lib/python2.7/site-packages/tables/file.py", line 1562, in _get_node
node = self._node_manager.get_node(nodepath)
File "/opt/anaconda/lib/python2.7/site-packages/tables/file.py", line 436, in get_node
node = self.node_factory(key)
File "/opt/anaconda/lib/python2.7/site-packages/tables/group.py", line 1158, in _g_load_child
return UnImplemented(self, childname)
File "/opt/anaconda/lib/python2.7/site-packages/tables/unimplemented.py", line 69, in __init__
super(UnImplemented, self).__init__(parentnode, name)
File "/opt/anaconda/lib/python2.7/site-packages/tables/leaf.py", line 262, in __init__
super(Leaf, self).__init__(parentnode, name, _log)
File "/opt/anaconda/lib/python2.7/site-packages/tables/node.py", line 271, in __init__
self._v_objectid = self._g_open()
File "/opt/anaconda/lib/python2.7/site-packages/tables/unimplemented.py", line 72, in _g_open
(self.shape, self.byteorder, object_id) = self._open_unimplemented()
File "hdf5extension.pyx", line 2095, in tables.hdf5extension.UnImplemented._open_unimplemented (tables/hdf5extension.c:19140)
TypeError: argument 2 to map() must support iteration
DEBUG MODE
Closing remaining open files:/var/engdata/hdf5output/hdf5/M002-20150104-comp.hdf5...done/tmp/M002-201501-comp.hdf5...done```
Comment From: jreback
pls post pd.show_versions()
show a sample of one of the in_stores
, e.g. just printit
it looks you are opening the out_store
as a regular file. you need to open this as a store.
Comment From: migdard
I will post this data on Monday because I'm not at work. But from memory I have pytables 3.1.1 and pandas 0.14.2. This is the latest anaconda distribution for Linux x86_64 . Of course I open the store correctly, the pseudo code is not right in that respect.
I deeply hope we can sort this out quick because this issue came up as part of an evaluation for a 10 years long big data project and this just gave me a big set back with respect to maturity of the code base for this usage.
Thank you
envoyé de mon GSM sent from my mobile On Feb 13, 2015 6:09 PM, "jreback" notifications@github.com wrote:
pls post pd.show_versions()
— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/9483#issuecomment-74288538.
Comment From: migdard
Here the script used, one printed in_store
and the output of pd.show_versions()
is:
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://binstar.org
>>> import pandas as pd
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.8.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-123.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.14.1
nose: 1.3.4
Cython: 0.21
numpy: 1.9.0
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.2.0
sphinx: 1.2.3
patsy: 0.3.0
scikits.timeseries: None
dateutil: 1.5
pytz: 2014.7
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None
>>>
Comment From: migdard
Interestingly I do not reproduce the issue while running the same script on Windows with pandas 0.15.2
Comment From: migdard
Unfortunately updating pandas to 0.15.2 on the linux box does not fix the issue.
Here is the output from pd.show_versions()
on the windows box where the issue is not present:
Python 2.7.7 |Anaconda 2.0.1 (64-bit)| (default, Jun 11 2014, 10:40:02) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://binstar.org
>>> import pandas as pd
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.7.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.15.2
nose: 1.3.3
Cython: 0.20.1
numpy: 1.9.0
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.3.0
sphinx: 1.2.2
patsy: 0.3.0
dateutil: 1.5
pytz: 2014.3
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None
>>>
Comment From: jreback
you might have an issue with an older HDF libraries on your linux box. you can try conda update hdf5
.
Comment From: migdard
My linux box is not connected to the internet and I can't seem to find an hdf5 package on continuum repositories - still looking further, though.
conda list |grep hdf5
shows hdf5 1.8.13
on the linux box and nothing on the windows one
Comment From: migdard
All right, just found hdf5 1.8.14 for anaconda. Upgrading and coming back to tell the results.
Comment From: migdard
This did not do it. I have the exact same stacktrace after upgrading HDF5 library.
Comment From: jreback
you can try posting on pytables
. No idea what your problem is.
Comment From: jorisvandenbossche
@migdard Closing this. If you still have problem, feel free to reopen or comment further here