xref: https://github.com/pydata/pandas/issues/2397 xref #14263 (example)
I think that we should protect the file open/closes with a lock to avoid this problem.
Comment From: toobaz
I agree!: although my code assumes only write access, and so it would require some fixing and integration, do you think the approach makes sense?
Comment From: jreback
@toobaz your soln 'works', but that is not a recommended way to do it at all. HDF5 by definition should only every have a single writer.
This helps with the SWMR case: https://www.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html
all that said read_hdf
is not thread-safe for reading I think because the file handle needs to be opened in the master thread.
Here's an example using a multi-proc threadpool; needs testing with a regular threadpool as well; this currently segfaults
import numpy as np
import pandas as pd
from pandas.util import testing as tm
from multiprocessing.pool import ThreadPool
path = 'test.hdf'
num_rows = 100000
num_tasks = 4
def make_df(num_rows=10000):
df = pd.DataFrame(np.random.rand(num_rows, 5), columns=list('abcde'))
df['foo'] = 'foo'
df['bar'] = 'bar'
df['baz'] = 'baz'
df['date'] = pd.date_range('20000101 09:00:00',
periods=num_rows,
freq='s')
df['int'] = np.arange(num_rows, dtype='int64')
return df
print("writing df")
df = make_df(num_rows=num_rows)
df.to_hdf(path, 'df', format='table')
# single threaded
print("reading df - single threaded")
result = pd.read_hdf(path, 'df')
tm.assert_frame_equal(result, df)
# multip
def reader(arg):
start, nrows = arg
return pd.read_hdf(path,
'df',
mode='r',
start=start,
stop=start+nrows)
tasks = [
(num_rows * i / num_tasks,
num_rows / num_tasks) for i in range(num_tasks)
]
pool = ThreadPool(processes=num_tasks)
print("reading df - multi threaded")
results = pool.map(reader, tasks)
result = pd.concat(results)
tm.assert_frame_equal(result, df)
Comment From: dragonator4
Yes this is a duplicate of #14692. I have a small contribution to this discussion:
def reader(arg):
start, nrows = arg
with pd.HDFStore(path, 'r') as store:
df = pd.read_hdf(store,
'df',
mode='r',
start=int(start),
stop=int(start+nrows))
return df
Works without any problems in the code example above. It seems that as long as multiple connections are made to the same store, and only one query is placed through those connections, there is no error. Looking at the example I provided in #14692 again, and also the original version of reader
given by @jreback above, multiple queries from the same connection cause all problems.
The fix may not involve placing locks on the store. Perhaps allowing df.read_hdf
or other variants of accessing data from a store in read mode to make as many connection as required is the fix...
Comment From: jtf621
I found another example of the issue. See #15274.
Note the segfault failure rate depends on the presence or absence of compression on the hdf file.
Comment From: rbiswas4
I am running into this issue on both OSX and Linux (centos), trying to parallelize using joblib. even when I am not trying any write process part of SWMR. Is there a different way for a readonly solution? Is this a pandas issue or a h5.c issue? Thanks.
In my application I am trying to read different groups on different processors, and running into this error. I am using
pandas 0.19.1
pytables 3.4.2
joblib 0.11
The code which works when n_jobs=1
and fails when n_jobs=2
is 2 or more.
store = pd.HDFStore('ddf_lcs.hdf', mode='r')
params = pd.read_hdf('ddf_params.hdf')
tileIDs = params.tileID.dropna().unique().astype(np.int)
def runGroup(i, store=store, minion_params=params):
print('group ', i)
if i is np.nan:
return
low = i - 0.5
high = i + 0.5
print('i, low and high are ', i, low, high)
key = str(int(i))
lcdf = store.select(key)
print('this group has {0} rows and {1} unique values of var'.format(len(lcdf), lcdf.snid.unique().size))
Parallel(n_jobs=1)(delayed(runGroup)(t) for t in tileIDs[:10])
store.close()
The error message I get is :
/usr/local/miniconda/lib/python2.7/site-packages/tables/group.py:1213: UserWarning: problems loading leaf ``/282748/table``::
HDF5 error back trace
File "H5Dio.c", line 173, in H5Dread
can't read data
File "H5Dio.c", line 554, in H5D__read
can't read data
File "H5Dchunk.c", line 1856, in H5D__chunk_read
error looking up chunk address
File "H5Dchunk.c", line 2441, in H5D__chunk_lookup
can't query chunk address
File "H5Dbtree.c", line 998, in H5D__btree_idx_get_addr
can't get chunk info
File "H5B.c", line 340, in H5B_find
unable to load B-tree node
File "H5AC.c", line 1262, in H5AC_protect
H5C_protect() failed.
File "H5C.c", line 3574, in H5C_protect
can't load entry
File "H5C.c", line 7954, in H5C_load_entry
unable to load entry
File "H5Bcache.c", line 143, in H5B__load
wrong B-tree signature
End of HDF5 error back trace
...
Comment From: jreback
@rbiswas4 you can't pass the handles in like this. you can try to open inside the passed function (as read-only) might work. this is a non-trivial problem (reading against HDF5 in a threadsafe manner via multiple processes). You can have a look at how dask solved (and still dealing with this).
Comment From: rbiswas4
you can try to open inside the passed function (as read-only)
@jreback Interesting. I thought I had trouble with that and had hence moved to hdfstore
but it looks like I should be able to pass the filename to the function and read it with pd.read_hdf(filename, key=key)
inside the function. Some initial tests seem to suggest that this is true indeed. That would solve the problem of reading. So if I am guessing correctly, the threadsafe mechanism is essentially in a lock on the open file
process, and not on retrieval processes like get
or select
, which is what I thought HDFStore
was for.
Also thank you for the pointer to the dask issues which discuss parallel write/save mechanisms.
Comment From: grantstephens
So I think I have just run into this issue but in a slightly different use case, I think.
I have multiple threads that each do things with different hdf files using with pd.HDFStore. It seems to read the files fine, but when it comes to writing to them it fails with what looks like an error about the compression from pytables. Error:
if complib is not None and complib not in tables.filters.all_complibs:
AttributeError: module 'tables' has no attribute 'filters'
I must just stress this, they are completely different files but I am trying to write to them using compression from different threads. I think it may be that I am trying to use the compression library at the same time, but any hints or advice would be appreciated.
Comment From: lafrech
For the record, this is what I do. It is probably obvious, but just in case it could help anyone, here it is.
HDF_LOCK = threading.Lock()
HDF_FILEPATH = '/my_file_path'
@contextmanager
def locked_store():
with HDF_LOCK:
with pd.HDFStore(HDF_FILEPATH) as store:
yield store
Then
def get(self, ts_id, t_start, t_end):
with locked_store() as store:
where = 'index>=t_start and index<t_end'
return store.select(ts_id, where=where)
If several files are used, a dict of locks can be used to avoid locking the whole application. It can be a defaultdict
with a new Lock
as default value. The key should be the absolute filepath with no symlink, to be sure a file can't be accessed with two different locks.
Comment From: makmanalp
@lafrech you could even make HDF_FILEPATH just an argument to the context manager and make it more generic:
@contextmanager
def locked_store(*args, lock=threading.Lock, **kwargs):
with lock():
with pd.HDFStore(*args, **kwargs) as store:
yield store
Comment From: lafrech
@makmanalp, won't this create a new lock for each call (thus defeating the purpose of the lock)?
We need to be sure that calls to the same file share the same lock and calls to different files don't.
I don't see how this is achieved in your example.
I assume I have to map the name, path or whatever identifier of the file to a lock instance.
Or maybe I am missing something.
Comment From: makmanalp
Er, my mistake entirely - I was just trying to point out that one can make it more reusable. Should be something like:
def make_locked_store(lock, filepath):
@contextmanager
def locked_store(*args, **kwargs):
with lock:
with pd.HDFStore(filepath, *args, **kwargs) as store:
yield store
return locked_store
Comment From: lafrech
If several files are used, a dict of locks can be used to avoid locking the whole application.
:warning: Warning :warning:
I've been trying this and had issues. It looks like the whole hdf5 lib is not thread-safe and it fails even when accessing two different files concurrently. I didn't take the time to check that, so I may be totally wrong, but beware if you try that.
I reverted my change, and I keep a single lock in the application rather than a lock per hdf5 file.
Since hdf5 is hierarchical, I guess lot of users put everything in a single file anyway. We ended up using lots of files mainly because hdf5 files tend to get corrupted so when restoring daily backups, we only loose data from a single file rather than from the whole base.
Comment From: schneiderfelipe
I just got the very same problem with PyTables 3.4.4 and Pandas 0.24.1. Downgrading to the ones provided by Ubuntu (3.4.2-4 and 0.22.0-4, respectively) solved the issue.
Comment From: joaoe
Hi.
In my project we have hundreds of HDF files, each files has many stores, each store 300000x400 cells. These are loaded selectively by our application both developing locally and by our dev/test/prod instances. These are all stored in a windows shared folder. Reading from a network share is a slow process, therefore it was a natural fix to read different stores in parallel.
My code
store = pd.HDFStore(fname, mode="r")
data = store[key]
store.close()
Someone suggested protecting the calls to HDFStore()
and close()
with a lock but the problem persists.
EOFError: Traceback (most recent call last):
File "C:\...\test.py", line 75, in main
data = store[key]
File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 483, in __getitem__
return self.get(key)
File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 671, in get
return self._read_group(group)
File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 1349, in _read_group
return s.read(**kwargs)
File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 2895, in read
ax = self.read_index('axis%d' % i, start=_start, stop=_stop)
File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 2493, in read_index
_, index = self.read_index_node(getattr(self.group, key), **kwargs)
File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 2591, in read_index_node
data = node[start:stop]
File "c:\python27_64bit\lib\site-packages\tables\vlarray.py", line 675, in __getitem__
return self.read(start, stop, step)
File "c:\python27_64bit\lib\site-packages\tables\vlarray.py", line 815, in read
outlistarr = [atom.fromarray(arr) for arr in listarr]
File "c:\python27_64bit\lib\site-packages\tables\atom.py", line 1228, in fromarray
return six.moves.cPickle.loads(array.tostring())
Without the lock, the code crashes deep into the C code.
It is quite obvious that tables.file.open()
is the problem here, as that function uses a global FileRegistry.
Suggestions how to fix:
- flock the file. Use a shared lock for read-only access, exclusive lock for write access. This will protect access across processes. And this can well fail (feature not available in OS or file in a remote share)
- Do not share file handlers. Each open call should produce a new independent file handles. That how every other file IO API works. FileRegistry can still keep track of file handlers, but should not in any way share them or close previous file handlers and opening new ones.
- Protect the file handlers internally with a ReadWriteLock (which works the same as flock but in memory) https://www.oreilly.com/library/view/python-cookbook/0596001673/ch06s04.html so write operations don't clobber each other nor disrupt read operations.
Comment From: ZanSara
Is anyone working on this issue at the moment? I've also hit this issue and I am willing to do some work on it if none is in the process already.
Comment From: ZanSara
At PyTables we're now working on a simple fix that might be released soon, see issue #776.
Comment From: Han-aorweb
I had the same issue where either multithreading reading/writing in the same or different hdf5 file throw random crashes. I was able to avoid it by adding a global lock. Hope can help you if you have trouble.
Comment From: toobaz
I had the same issue where either multithreading reading/writing in the same or different hdf5 file throw random crashes. I was able to avoid it by adding a global lock. Hope can help you if you have trouble.
I agree, see https://stackoverflow.com/a/29014295/2858145 and https://github.com/pandas-dev/pandas/issues/9641#event-251904150