Pandas read_hdf crash python process when use it in multithread code

one of my folder contains multiple h5 files, and I tried to load them into dataframes and then concat these df into one.

the python process crashes when the num_tasks>1, if I debug thread by thread, it works, in another, it crashes simply when two threads run at the same time, even though they read different files.

from multiprocessing.pool import ThreadPool
import pandas as pd 

num_tasks=2
def readjob(x):
    path = x
    return pd.read_hdf(path,"df",mode='r')

pool = ThreadPool(num_tasks)
results = pool.map(readjob,files)

Comment From: TomAugspurger

Could you make a reproducible example? files is undefined. Also, pd.show_versions.

Comment From: xmseraph

files is the array of string, contains the absolute paths of .h5 files, you will need code like this.

from os import listdir
from os.path import isfile, join
dir='where i store the h5 files'
files=[join(dir, f) for f in listdir(dir) if isfile(join(dir, f))]

INSTALLED VERSIONS

commit: None python: 2.7.12.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None

pandas: 0.18.1 nose: 1.3.7 pip: 8.1.2 setuptools: 26.1.1 Cython: 0.24.1 numpy: 1.11.1 scipy: 0.18.0 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.4.1 patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.6.1 blosc: None bottleneck: 1.1.0 tables: 3.2.2 numexpr: 2.6.1 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.2 lxml: 3.6.4 bs4: 4.5.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.13 pymysql: None psycopg2: None jinja2: 2.8 boto: 2.40.0 pandas_datareader: None

Comment From: TomAugspurger

Unfortunately, that still won't work for me since the directory 'where i store the h5 files' isn't on my computer. Can you make a small script to generate the HDF files needed? They shouldn't need to be large or that numerous.

Comment From: xmseraph

@TomAugspurger thank you for your reply, actually if i didn't create code to generate small files for you, I wouldn't notice this problem when I created H5 files

import numpy as np
import pandas as pd
from pandas.util import testing as tm
from multiprocessing.pool import ThreadPool

path = 'test.hdf'
path1 = 'test1.hdf'
files=[path,path1]
num_rows = 100000
num_tasks = 2

def make_df(num_rows=10000):

    df = pd.DataFrame(np.random.rand(num_rows, 5), columns=list('abcde'))
    df['foo'] = 'foo'
    df['bar'] = 'bar'
    df['baz'] = 'baz'
    df['date'] = pd.date_range('20000101 09:00:00',
                               periods=num_rows,
                               freq='s')
    df['int'] = np.arange(num_rows, dtype='int64')
    return df

print("writing df")
df = make_df(num_rows=num_rows)
df.to_hdf(path, 'df',complib='zlib',complevel=9,append=False,mode='w',format='t')
df.to_hdf(path1, 'df',complib='zlib',complevel=9,append=False,**mode='a'**,format='t')

def readjob(x):
    path = x
    return pd.read_hdf(path,"df",mode='r')

pool = ThreadPool(num_tasks)
results = pool.map(readjob,files)
print results

when I write to path1, i set the mode to append, the code crashes when the pool kicks in but if I write to path1 with mode='w', the code works. is this weird?

Comment From: jreback

duplicate of https://github.com/pydata/pandas/issues/12236

Comment From: xmseraph

the mode parameter doesn't fix the problem, after i tested the code more times, i found out it was just random to run through.