Pandas BUG: read_hdf is not read-threadsafe

xref: https://github.com/pydata/pandas/issues/2397 xref #14263 (example)

I think that we should protect the file open/closes with a lock to avoid this problem.

Comment From: toobaz

I agree!: although my code assumes only write access, and so it would require some fixing and integration, do you think the approach makes sense?

Comment From: jreback

@toobaz your soln 'works', but that is not a recommended way to do it at all. HDF5 by definition should only every have a single writer.

This helps with the SWMR case: https://www.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html

all that said read_hdf is not thread-safe for reading I think because the file handle needs to be opened in the master thread.

Here's an example using a multi-proc threadpool; needs testing with a regular threadpool as well; this currently segfaults

import numpy as np
import pandas as pd
from pandas.util import testing as tm
from multiprocessing.pool import ThreadPool

path = 'test.hdf'
num_rows = 100000
num_tasks = 4

def make_df(num_rows=10000):

    df = pd.DataFrame(np.random.rand(num_rows, 5), columns=list('abcde'))
    df['foo'] = 'foo'
    df['bar'] = 'bar'
    df['baz'] = 'baz'
    df['date'] = pd.date_range('20000101 09:00:00',
                               periods=num_rows,
                               freq='s')
    df['int'] = np.arange(num_rows, dtype='int64')
    return df

print("writing df")
df = make_df(num_rows=num_rows)
df.to_hdf(path, 'df', format='table')

# single threaded
print("reading df - single threaded")
result = pd.read_hdf(path, 'df')
tm.assert_frame_equal(result, df)

# multip
def reader(arg):
    start, nrows = arg

    return pd.read_hdf(path,
                       'df',
                       mode='r',
                       start=start,
                       stop=start+nrows)

tasks = [
    (num_rows * i / num_tasks,
     num_rows / num_tasks) for i in range(num_tasks)
    ]

pool = ThreadPool(processes=num_tasks)

print("reading df - multi threaded")
results = pool.map(reader, tasks)
result = pd.concat(results)

tm.assert_frame_equal(result, df)

Comment From: dragonator4

Yes this is a duplicate of #14692. I have a small contribution to this discussion:

def reader(arg):
    start, nrows = arg
    with pd.HDFStore(path, 'r') as store:
        df = pd.read_hdf(store,
                         'df',
                         mode='r',
                         start=int(start),
                         stop=int(start+nrows))
    return df

Works without any problems in the code example above. It seems that as long as multiple connections are made to the same store, and only one query is placed through those connections, there is no error. Looking at the example I provided in #14692 again, and also the original version of reader given by @jreback above, multiple queries from the same connection cause all problems.

The fix may not involve placing locks on the store. Perhaps allowing df.read_hdf or other variants of accessing data from a store in read mode to make as many connection as required is the fix...

Comment From: jtf621

I found another example of the issue. See #15274.

Note the segfault failure rate depends on the presence or absence of compression on the hdf file.

Comment From: rbiswas4

I am running into this issue on both OSX and Linux (centos), trying to parallelize using joblib. even when I am not trying any write process part of SWMR. Is there a different way for a readonly solution? Is this a pandas issue or a h5.c issue? Thanks.

In my application I am trying to read different groups on different processors, and running into this error. I am using

pandas                    0.19.1
pytables                  3.4.2
joblib                    0.11

The code which works when n_jobs=1 and fails when n_jobs=2 is 2 or more.

store = pd.HDFStore('ddf_lcs.hdf', mode='r')
params = pd.read_hdf('ddf_params.hdf')

tileIDs = params.tileID.dropna().unique().astype(np.int)

def runGroup(i, store=store, minion_params=params):
      print('group ', i)
      if i is np.nan:
          return
      low = i - 0.5
      high = i + 0.5
      print('i, low and high are ', i, low, high)
      key = str(int(i))
      lcdf = store.select(key)
      print('this group has {0} rows and {1} unique values of var'.format(len(lcdf), lcdf.snid.unique().size))
Parallel(n_jobs=1)(delayed(runGroup)(t) for t in tileIDs[:10])
store.close()

The error message I get is :

/usr/local/miniconda/lib/python2.7/site-packages/tables/group.py:1213: UserWarning: problems loading leaf ``/282748/table``::

  HDF5 error back trace

  File "H5Dio.c", line 173, in H5Dread
    can't read data
  File "H5Dio.c", line 554, in H5D__read
    can't read data
  File "H5Dchunk.c", line 1856, in H5D__chunk_read
    error looking up chunk address
  File "H5Dchunk.c", line 2441, in H5D__chunk_lookup
    can't query chunk address
  File "H5Dbtree.c", line 998, in H5D__btree_idx_get_addr
    can't get chunk info
  File "H5B.c", line 340, in H5B_find
    unable to load B-tree node
  File "H5AC.c", line 1262, in H5AC_protect
    H5C_protect() failed.
  File "H5C.c", line 3574, in H5C_protect
    can't load entry
  File "H5C.c", line 7954, in H5C_load_entry
    unable to load entry
 File "H5Bcache.c", line 143, in H5B__load
    wrong B-tree signature

End of HDF5 error back trace
...

Comment From: jreback

@rbiswas4 you can't pass the handles in like this. you can try to open inside the passed function (as read-only) might work. this is a non-trivial problem (reading against HDF5 in a threadsafe manner via multiple processes). You can have a look at how dask solved (and still dealing with this).

Comment From: rbiswas4

you can try to open inside the passed function (as read-only)

@jreback Interesting. I thought I had trouble with that and had hence moved to hdfstore but it looks like I should be able to pass the filename to the function and read it with pd.read_hdf(filename, key=key) inside the function. Some initial tests seem to suggest that this is true indeed. That would solve the problem of reading. So if I am guessing correctly, the threadsafe mechanism is essentially in a lock on the open file process, and not on retrieval processes like get or select, which is what I thought HDFStore was for.

Also thank you for the pointer to the dask issues which discuss parallel write/save mechanisms.

Comment From: grantstephens

So I think I have just run into this issue but in a slightly different use case, I think. I have multiple threads that each do things with different hdf files using with pd.HDFStore. It seems to read the files fine, but when it comes to writing to them it fails with what looks like an error about the compression from pytables. Error: if complib is not None and complib not in tables.filters.all_complibs: AttributeError: module 'tables' has no attribute 'filters' I must just stress this, they are completely different files but I am trying to write to them using compression from different threads. I think it may be that I am trying to use the compression library at the same time, but any hints or advice would be appreciated.

Comment From: lafrech

For the record, this is what I do. It is probably obvious, but just in case it could help anyone, here it is.

HDF_LOCK = threading.Lock()
HDF_FILEPATH = '/my_file_path'

@contextmanager
def locked_store():
    with HDF_LOCK:
        with pd.HDFStore(HDF_FILEPATH) as store:
            yield store

Then

    def get(self, ts_id, t_start, t_end):
        with locked_store() as store:
            where = 'index>=t_start and index<t_end'
            return store.select(ts_id, where=where)

If several files are used, a dict of locks can be used to avoid locking the whole application. It can be a defaultdict with a new Lock as default value. The key should be the absolute filepath with no symlink, to be sure a file can't be accessed with two different locks.

Comment From: makmanalp

@lafrech you could even make HDF_FILEPATH just an argument to the context manager and make it more generic:

@contextmanager
def locked_store(*args, lock=threading.Lock, **kwargs):
    with lock():
        with pd.HDFStore(*args, **kwargs) as store:
            yield store

Comment From: lafrech

@makmanalp, won't this create a new lock for each call (thus defeating the purpose of the lock)?

We need to be sure that calls to the same file share the same lock and calls to different files don't.

I don't see how this is achieved in your example.

I assume I have to map the name, path or whatever identifier of the file to a lock instance.

Or maybe I am missing something.

Comment From: makmanalp

Er, my mistake entirely - I was just trying to point out that one can make it more reusable. Should be something like:

def make_locked_store(lock, filepath):
    @contextmanager
    def locked_store(*args, **kwargs):
        with lock:
            with pd.HDFStore(filepath, *args, **kwargs) as store:
                yield store
    return locked_store

Comment From: lafrech

If several files are used, a dict of locks can be used to avoid locking the whole application.

:warning: Warning :warning:

I've been trying this and had issues. It looks like the whole hdf5 lib is not thread-safe and it fails even when accessing two different files concurrently. I didn't take the time to check that, so I may be totally wrong, but beware if you try that.

I reverted my change, and I keep a single lock in the application rather than a lock per hdf5 file.

Since hdf5 is hierarchical, I guess lot of users put everything in a single file anyway. We ended up using lots of files mainly because hdf5 files tend to get corrupted so when restoring daily backups, we only loose data from a single file rather than from the whole base.

Comment From: schneiderfelipe

I just got the very same problem with PyTables 3.4.4 and Pandas 0.24.1. Downgrading to the ones provided by Ubuntu (3.4.2-4 and 0.22.0-4, respectively) solved the issue.

Comment From: joaoe

Hi.

In my project we have hundreds of HDF files, each files has many stores, each store 300000x400 cells. These are loaded selectively by our application both developing locally and by our dev/test/prod instances. These are all stored in a windows shared folder. Reading from a network share is a slow process, therefore it was a natural fix to read different stores in parallel.

My code

store = pd.HDFStore(fname, mode="r")
data = store[key]
store.close()

Someone suggested protecting the calls to HDFStore() and close() with a lock but the problem persists.

EOFError: Traceback (most recent call last):
  File "C:\...\test.py", line 75, in main
    data = store[key]
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 483, in __getitem__
    return self.get(key)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 671, in get
    return self._read_group(group)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 1349, in _read_group
    return s.read(**kwargs)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 2895, in read
    ax = self.read_index('axis%d' % i, start=_start, stop=_stop)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 2493, in read_index
    _, index = self.read_index_node(getattr(self.group, key), **kwargs)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 2591, in read_index_node
    data = node[start:stop]
  File "c:\python27_64bit\lib\site-packages\tables\vlarray.py", line 675, in __getitem__
    return self.read(start, stop, step)
  File "c:\python27_64bit\lib\site-packages\tables\vlarray.py", line 815, in read
    outlistarr = [atom.fromarray(arr) for arr in listarr]
  File "c:\python27_64bit\lib\site-packages\tables\atom.py", line 1228, in fromarray
    return six.moves.cPickle.loads(array.tostring())

Without the lock, the code crashes deep into the C code.

It is quite obvious that tables.file.open() is the problem here, as that function uses a global FileRegistry.

Suggestions how to fix:

flock the file. Use a shared lock for read-only access, exclusive lock for write access. This will protect access across processes. And this can well fail (feature not available in OS or file in a remote share)
Do not share file handlers. Each open call should produce a new independent file handles. That how every other file IO API works. FileRegistry can still keep track of file handlers, but should not in any way share them or close previous file handlers and opening new ones.
Protect the file handlers internally with a ReadWriteLock (which works the same as flock but in memory) https://www.oreilly.com/library/view/python-cookbook/0596001673/ch06s04.html so write operations don't clobber each other nor disrupt read operations.

Comment From: ZanSara

Is anyone working on this issue at the moment? I've also hit this issue and I am willing to do some work on it if none is in the process already.

Comment From: ZanSara

At PyTables we're now working on a simple fix that might be released soon, see issue #776.

Comment From: Han-aorweb

I had the same issue where either multithreading reading/writing in the same or different hdf5 file throw random crashes. I was able to avoid it by adding a global lock. Hope can help you if you have trouble.

Comment From: toobaz

I had the same issue where either multithreading reading/writing in the same or different hdf5 file throw random crashes. I was able to avoid it by adding a global lock. Hope can help you if you have trouble.

I agree, see https://stackoverflow.com/a/29014295/2858145 and https://github.com/pandas-dev/pandas/issues/9641#event-251904150