Pandas read_hdf is not read-threadsafe on MAC and Windows

Problem description

pd.read_hdf() segfaults when called from multiple threads in read only mode.

From the documentation I expected that reading a HDF5 file in another thread would work. In the example each file is mapped out to its own thread and the thread is responsible for opening and closing the file. Note that when I used HDF5 files with compression the segfault happened everytime. When I was making the minimal example, I used no compression and fround it does not segfault about 80% of the time. The test code shows that when complib is set, but complevel is 0 (i.e no compression) the failure rate is 100%.

This problem is also reported in #12236

Test Code

Note this fails reliably when complib is not None. When complib=None the failure rate is ~ 20% for me in limited testing.

This is another example of # but shows effect of compression settings on the issue

import numpy as np
import pandas as pd
import concurrent.futures as cf
                                # Failure rate
                                # - Fails with Segmentation fault
COMPLIB, COMPLEVEL = None, 0     # 2 out of 10 tests
COMPLIB, COMPLEVEL = 'zlib', 9   # 3 out of 3 tests
COMPLIB, COMPLEVEL = 'zlib', 0   # 3 out of 3 tests
COMPLIB, COMPLEVEL = 'blosc', 0  # 3 out of 3 tests
COMPLIB, COMPLEVEL = 'blosc', 1  # 3 out of 3 tests
COMPLIB, COMPLEVEL = 'lzo', 0    # 3 out of 3 tests
COMPLIB, COMPLEVEL = 'bzip2', 0  # 3 out of 3 tests

# create test data
a = np.random.rand(1000000)
b = np.random.rand(1000000)

d1 = pd.DataFrame(dict(a=a*1, b=b*1))
d2 = pd.DataFrame(dict(a=a*2, b=b*2))
d3 = pd.DataFrame(dict(a=a*3, b=b*3))
d4 = pd.DataFrame(dict(a=a*4, b=b*4))

d1.to_hdf('d1.h5', '/data', complib=COMPLIB, complevel=COMPLEVEL)
d2.to_hdf('d2.h5', '/data', complib=COMPLIB, complevel=COMPLEVEL)
d3.to_hdf('d3.h5', '/data', complib=COMPLIB, complevel=COMPLEVEL)
d4.to_hdf('d4.h5', '/data', complib=COMPLIB, complevel=COMPLEVEL)

files = ['d{}.h5'.format(i) for i in range(1, 5)]

# map reads out to threads
e = cf.ThreadPoolExecutor()
futures = [e.submit(pd.read_hdf, f, '/data/', mode='r') for f in files]

Expected Output

a list of 4 futures that can be

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.19.2 nose: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.12.0 scipy: None statsmodels: None xarray: None IPython: 5.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: 3.3.0 numexpr: 2.6.1 matplotlib: 2.0.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.4 boto: None pandas_datareader: None

Crash Report

python(7164,0x700000c10000) malloc: *** error for object 0x7f8ba3655e60: pointer being freed was not allocated *** set a breakpoint in malloc_error_break to debug Abort trap: 6

Comment From: jreback

yep this is a duplicate. welcome to put locks around things to fix this.

Comment From: jtf621

@jreback Are you looking for a patch to pandas, or are you suggesting I put locks in my code as a workaround?

Comment From: jreback

a patch to pandas would be most welcome it would be somewhat similar to what you would do anyhow

need to keep a file lock

Pandas read_hdf is not read-threadsafe on MAC and Windows

Problem description

Test Code

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`