Problem description
pd.read_hdf() segfaults when called from multiple threads in read only mode.
From the documentation I expected that reading a HDF5 file in another thread would work. In the example each file is mapped out to its own thread and the thread is responsible for opening and closing the file. Note that when I used HDF5 files with compression the segfault happened everytime. When I was making the minimal example, I used no compression and fround it does not segfault about 80% of the time. The test code shows that when complib is set, but complevel is 0 (i.e no compression) the failure rate is 100%.
This problem is also reported in #12236
Test Code
Note this fails reliably when complib is not None. When complib=None the failure rate is ~ 20% for me in limited testing.
This is another example of # but shows effect of compression settings on the issue
import numpy as np
import pandas as pd
import concurrent.futures as cf
# Failure rate
# - Fails with Segmentation fault
COMPLIB, COMPLEVEL = None, 0 # 2 out of 10 tests
COMPLIB, COMPLEVEL = 'zlib', 9 # 3 out of 3 tests
COMPLIB, COMPLEVEL = 'zlib', 0 # 3 out of 3 tests
COMPLIB, COMPLEVEL = 'blosc', 0 # 3 out of 3 tests
COMPLIB, COMPLEVEL = 'blosc', 1 # 3 out of 3 tests
COMPLIB, COMPLEVEL = 'lzo', 0 # 3 out of 3 tests
COMPLIB, COMPLEVEL = 'bzip2', 0 # 3 out of 3 tests
# create test data
a = np.random.rand(1000000)
b = np.random.rand(1000000)
d1 = pd.DataFrame(dict(a=a*1, b=b*1))
d2 = pd.DataFrame(dict(a=a*2, b=b*2))
d3 = pd.DataFrame(dict(a=a*3, b=b*3))
d4 = pd.DataFrame(dict(a=a*4, b=b*4))
d1.to_hdf('d1.h5', '/data', complib=COMPLIB, complevel=COMPLEVEL)
d2.to_hdf('d2.h5', '/data', complib=COMPLIB, complevel=COMPLEVEL)
d3.to_hdf('d3.h5', '/data', complib=COMPLIB, complevel=COMPLEVEL)
d4.to_hdf('d4.h5', '/data', complib=COMPLIB, complevel=COMPLEVEL)
files = ['d{}.h5'.format(i) for i in range(1, 5)]
# map reads out to threads
e = cf.ThreadPoolExecutor()
futures = [e.submit(pd.read_hdf, f, '/data/', mode='r') for f in files]
Expected Output
a list of 4 futures that can be
Output of pd.show_versions()
Crash Report
Comment From: jreback
yep this is a duplicate. welcome to put locks around things to fix this.
Comment From: jtf621
@jreback Are you looking for a patch to pandas, or are you suggesting I put locks in my code as a workaround?
Comment From: jreback
a patch to pandas would be most welcome it would be somewhat similar to what you would do anyhow
need to keep a file lock