Pandas Rapid Concurrent Access Breaks HDFStore

There are several errors associated with this. One of them is that the HDFStore fails to load some tables inside the store file, but fails silently. Another error is a loud HDF5 error that gives a full traceback. Both should be visible in the following code.

import pandas as pd
import multiprocessing as mp

def f(x):
    tt = pd.HDFStore('storage.h5', mode = 'r')
    print tt

p = mp.Pool()
p.map(f, range(50))

Comment From: jreback

show pd.show_versions()

Comment From: jreback

and ptdump -av storage.h5

Comment From: jreback

http://www.hdfgroup.org/hdf5-quest.html#gconc

I would not recommend what you are doing at all

it might work but I'll bet it's fragile

it requires a thread safe build of the hdf libraries

no idea how to figure that out (as the default is not thread safe)

this is true even for reading

see the PyTables documentation as well

Comment From: yikelu

Note: I redacted the column names. If you need them, let me know. There are 4 columns in each table, all the same types. The problem goes away if I do a time.sleep(5) between the async calls.

pd.show_versions ()

INSTALLED VERSIONS

commit: None python: 2.7.8.final.0 python-bits: 64 OS: Linux OS-release: 3.11.0-15-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.13.1 Cython: 0.20.1 numpy: 1.8.0 scipy: 0.13.3 statsmodels: 0.5.0 IPython: 1.1.0 sphinx: 1.2.1 patsy: 0.2.1 scikits.timeseries: None dateutil: 1.5 pytz: 2013b bottleneck: None tables: 3.1.0 numexpr: 2.3.1 matplotlib: 1.3.1 openpyxl: 1.8.2 xlrd: 0.9.2 xlwt: 0.7.5 xlsxwriter: 0.5.2 sqlalchemy: 0.9.2 lxml: 3.3.1 bs4: 4.3.1 html5lib: None bq: None apiclient: None

ptdump -av ul_quotes.h5 / (RootGroup) '' /._v_attrs (AttributeSet), 4 attributes: [CLASS := 'GROUP', PYTABLES_FORMAT_VERSION := '2.1', TITLE := '', VERSION := '1.0'] /10194182 (Group) '' /10194182._v_attrs (AttributeSet), 14 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0', data_columns := ['iProductID'], encoding := None, index_cols := [(0, 'index')], info := {1: {'type': 'Index', 'names': [None]}, 'index': {'index_name': 'quote_time'}}, levels := 1, nan_rep := 'nan', pandas_type := 'frame_table', pandas_version := '0.10.1', table_type := 'appendable_frame', values_cols := ['values_block_0', 'iProductID']] /10194182/table (Table(1495094,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Int64Col(shape=(4,), dflt=0, pos=1), "iProductID": Int64Col(shape=(), dflt=0, pos=2)} byteorder := 'little' chunkshape := (1365,) autoindex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_csi=False, "iProductID": Index(6, medium, shuffle, zlib(1)).is_csi=False} /10194182/table._v_attrs (AttributeSet), 15 attributes: [CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0, FIELD_1_NAME := 'values_block_0', FIELD_2_FILL := 0, FIELD_2_NAME := 'iProductID', NROWS := 1495094, TITLE := '', VERSION := '2.7', iProductID_dtype := 'int64', iProductID_kind := ['iProductID'], index_kind := 'integer', values_block_0_dtype := 'int64', /7626980 (Group) '' /7626980._v_attrs (AttributeSet), 14 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0', data_columns := ['iProductID'], encoding := None, index_cols := [(0, 'index')], info := {1: {'type': 'Index', 'names': [None]}, 'index': {'index_name': 'quote_time'}}, levels := 1, nan_rep := 'nan', pandas_type := 'frame_table', pandas_version := '0.10.1', table_type := 'appendable_frame', values_cols := ['values_block_0', 'iProductID']] /7626980/table (Table(149306,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Int64Col(shape=(4,), dflt=0, pos=1), "iProductID": Int64Col(shape=(), dflt=0, pos=2)} byteorder := 'little' chunkshape := (1365,) autoindex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_csi=False, "iProductID": Index(6, medium, shuffle, zlib(1)).is_csi=False} /7626980/table._v_attrs (AttributeSet), 15 attributes: [CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0, FIELD_1_NAME := 'values_block_0', FIELD_2_FILL := 0, FIELD_2_NAME := 'iProductID', NROWS := 149306, TITLE := '', VERSION := '2.7', iProductID_dtype := 'int64', iProductID_kind := ['iProductID'], index_kind := 'integer', values_block_0_dtype := 'int64', /7998575 (Group) '' /7998575._v_attrs (AttributeSet), 14 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0', data_columns := ['iProductID'], encoding := None, index_cols := [(0, 'index')], info := {1: {'type': 'Index', 'names': [None]}, 'index': {'index_name': 'quote_time'}}, levels := 1, nan_rep := 'nan', pandas_type := 'frame_table', pandas_version := '0.10.1', table_type := 'appendable_frame', values_cols := ['values_block_0', 'iProductID']] /7998575/table (Table(1527729,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Int64Col(shape=(4,), dflt=0, pos=1), "iProductID": Int64Col(shape=(), dflt=0, pos=2)} byteorder := 'little' chunkshape := (1365,) autoindex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_csi=False, "iProductID": Index(6, medium, shuffle, zlib(1)).is_csi=False} /7998575/table._v_attrs (AttributeSet), 15 attributes: [CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0, FIELD_1_NAME := 'values_block_0', FIELD_2_FILL := 0, FIELD_2_NAME := 'iProductID', NROWS := 1527729, TITLE := '', VERSION := '2.7', iProductID_dtype := 'int64', iProductID_kind := ['iProductID'], index_kind := 'integer', values_block_0_dtype := 'int64', /8335305 (Group) '' /8335305._v_attrs (AttributeSet), 14 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0', data_columns := ['iProductID'], encoding := None, index_cols := [(0, 'index')], info := {1: {'type': 'Index', 'names': [None]}, 'index': {'index_name': 'quote_time'}}, levels := 1, nan_rep := 'nan', pandas_type := 'frame_table', pandas_version := '0.10.1', table_type := 'appendable_frame', values_cols := ['values_block_0', 'iProductID']] /8335305/table (Table(2153256,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Int64Col(shape=(4,), dflt=0, pos=1), "iProductID": Int64Col(shape=(), dflt=0, pos=2)} byteorder := 'little' chunkshape := (1365,) autoindex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_csi=False, "iProductID": Index(6, medium, shuffle, zlib(1)).is_csi=False} /8335305/table._v_attrs (AttributeSet), 15 attributes: [CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0, FIELD_1_NAME := 'values_block_0', FIELD_2_FILL := 0, FIELD_2_NAME := 'iProductID', NROWS := 2153256, TITLE := '', VERSION := '2.7', iProductID_dtype := 'int64', iProductID_kind := ['iProductID'], index_kind := 'integer', values_block_0_dtype := 'int64', /8677657 (Group) '' /8677657._v_attrs (AttributeSet), 14 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0', data_columns := ['iProductID'], encoding := None, index_cols := [(0, 'index')], info := {1: {'type': 'Index', 'names': [None]}, 'index': {'index_name': 'quote_time'}}, levels := 1, nan_rep := 'nan', pandas_type := 'frame_table', pandas_version := '0.10.1', table_type := 'appendable_frame', values_cols := ['values_block_0', 'iProductID']] /8677657/table (Table(1938334,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Int64Col(shape=(4,), dflt=0, pos=1), "iProductID": Int64Col(shape=(), dflt=0, pos=2)} byteorder := 'little' chunkshape := (1365,) autoindex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_csi=False, "iProductID": Index(6, medium, shuffle, zlib(1)).is_csi=False} /8677657/table._v_attrs (AttributeSet), 15 attributes: [CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0, FIELD_1_NAME := 'values_block_0', FIELD_2_FILL := 0, FIELD_2_NAME := 'iProductID', NROWS := 1938334, TITLE := '', VERSION := '2.7', iProductID_dtype := 'int64', iProductID_kind := ['iProductID'], index_kind := 'integer', values_block_0_dtype := 'int64', /8992091 (Group) '' /8992091._v_attrs (AttributeSet), 14 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0', data_columns := ['iProductID'], encoding := None, index_cols := [(0, 'index')], info := {1: {'type': 'Index', 'names': [None]}, 'index': {'index_name': 'quote_time'}}, levels := 1, nan_rep := 'nan', pandas_type := 'frame_table', pandas_version := '0.10.1', table_type := 'appendable_frame', values_cols := ['values_block_0', 'iProductID']] /8992091/table (Table(1846241,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Int64Col(shape=(4,), dflt=0, pos=1), "iProductID": Int64Col(shape=(), dflt=0, pos=2)} byteorder := 'little' chunkshape := (1365,) autoindex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_csi=False, "iProductID": Index(6, medium, shuffle, zlib(1)).is_csi=False} /8992091/table._v_attrs (AttributeSet), 15 attributes: [CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0, FIELD_1_NAME := 'values_block_0', FIELD_2_FILL := 0, FIELD_2_NAME := 'iProductID', NROWS := 1846241, TITLE := '', VERSION := '2.7', iProductID_dtype := 'int64', iProductID_kind := ['iProductID'], index_kind := 'integer', values_block_0_dtype := 'int64',

Comment From: jreback

show tables.getHDF5Version()

their is a note in the PyTables site explaining this somewhere

you need a fairly new version for this to support concurrent access

you will have to ask / look there

Comment From: yikelu

jreback -- thanks for the information. So it seems this is not a problem with Pandas, but with HDF5 itself. However, I am specifically using multiprocessing (which is apparently safe) according to http://www.hdfgroup.org/hdf5-quest.html#gconc1

I will investigate this further and perhaps try the new version or re architect my code.

In [17]: tables.getHDF5Version ()
Out[17]: '1.8.9'

Comment From: jreback

in any event why would you want to access a store like this?

generally you read in data and then compute with it potentially writing to a new store (I assume I know that you CANNOT UNDER ANY CIRCUMSTANCES do concurrent writes at all - even across processes)

Comment From: yikelu

Yes, I understand the concurrent write problem, I have done true multi-threaded computing before. Also, I have run into the write mode restriction on HDFStore, so I'm aware you cannot even open the same store multiple times in write mode.

The re-architect for my code is straightforward, but I'm puzzled that you can't see the benefit of a multiple read? It's SIMD, with the multiple data being different chunks inside the HDFStore (retrieved in chunks by index).

Of course doing it this way is quite naive (as was my original implementation) -- better would be to do a single read then fanout the data after the read step, which is more or less what I am doing now.

Comment From: jreback

simd helps in some circumstances of course but usually more complicated - and u have to account for the time spawning processes as well as disk contention

I generally find that it is best to have a smaller number of processes doing more work

but to each his own

Comment From: yikelu

I see what you mean, mainly with disk contention. It's not a matter of more/less number of processes here.

Actually my use case didn't really need the concurrent reads in the first place, I had just coded it that way without thinking because it was the most straightforward processing pipeline. Would have saved me a lot of trouble actually if I had just avoided that in the first place.

Comment From: jreback

closing as not a pandas issue

Comment From: yikelu

Can this at least be documented in the HDF5 Section? Thanks.

Comment From: jreback

If you want to do a pull-request to add a point to this, I think would be ok: http://pandas.pydata.org/pandas-docs/stable/io.html#notes-caveats

(or possibly a new sub-section, say multi-processing/threading). I think would be ok.

I agree documentation is nice (though people rarely read it :)

Comment From: jreback

docs already look pretty good. closing.