one of my folder contains multiple h5 files, and I tried to load them into dataframes and then concat these df into one.
the python process crashes when the num_tasks>1, if I debug thread by thread, it works, in another, it crashes simply when two threads run at the same time, even though they read different files.
from multiprocessing.pool import ThreadPool
import pandas as pd
num_tasks=2
def readjob(x):
path = x
return pd.read_hdf(path,"df",mode='r')
pool = ThreadPool(num_tasks)
results = pool.map(readjob,files)
Comment From: TomAugspurger
Could you make a reproducible example? files
is undefined. Also, pd.show_versions
.
Comment From: xmseraph
files is the array of string, contains the absolute paths of .h5 files, you will need code like this.
from os import listdir
from os.path import isfile, join
dir='where i store the h5 files'
files=[join(dir, f) for f in listdir(dir) if isfile(join(dir, f))]
INSTALLED VERSIONS
commit: None python: 2.7.12.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None
pandas: 0.18.1 nose: 1.3.7 pip: 8.1.2 setuptools: 26.1.1 Cython: 0.24.1 numpy: 1.11.1 scipy: 0.18.0 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.4.1 patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.6.1 blosc: None bottleneck: 1.1.0 tables: 3.2.2 numexpr: 2.6.1 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.2 lxml: 3.6.4 bs4: 4.5.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.13 pymysql: None psycopg2: None jinja2: 2.8 boto: 2.40.0 pandas_datareader: None
Comment From: TomAugspurger
Unfortunately, that still won't work for me since the directory 'where i store the h5 files'
isn't on my computer. Can you make a small script to generate the HDF files needed? They shouldn't need to be large or that numerous.
Comment From: xmseraph
@TomAugspurger thank you for your reply, actually if i didn't create code to generate small files for you, I wouldn't notice this problem when I created H5 files
import numpy as np
import pandas as pd
from pandas.util import testing as tm
from multiprocessing.pool import ThreadPool
path = 'test.hdf'
path1 = 'test1.hdf'
files=[path,path1]
num_rows = 100000
num_tasks = 2
def make_df(num_rows=10000):
df = pd.DataFrame(np.random.rand(num_rows, 5), columns=list('abcde'))
df['foo'] = 'foo'
df['bar'] = 'bar'
df['baz'] = 'baz'
df['date'] = pd.date_range('20000101 09:00:00',
periods=num_rows,
freq='s')
df['int'] = np.arange(num_rows, dtype='int64')
return df
print("writing df")
df = make_df(num_rows=num_rows)
df.to_hdf(path, 'df',complib='zlib',complevel=9,append=False,mode='w',format='t')
df.to_hdf(path1, 'df',complib='zlib',complevel=9,append=False,**mode='a'**,format='t')
def readjob(x):
path = x
return pd.read_hdf(path,"df",mode='r')
pool = ThreadPool(num_tasks)
results = pool.map(readjob,files)
print results
when I write to path1, i set the mode to append, the code crashes when the pool kicks in but if I write to path1 with mode='w', the code works. is this weird?
Comment From: jreback
duplicate of https://github.com/pydata/pandas/issues/12236
Comment From: xmseraph
the mode parameter doesn't fix the problem, after i tested the code more times, i found out it was just random to run through.