Code Sample
import os
import sys
import tempfile
import pandas as pd
from urllib.request import urlretrieve
# Settings
p_ext = 'https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz'
p_down = os.path.join(tempfile.gettempdir(), 'tmp_gene_info.gz')
p_out = os.path.join(tempfile.gettempdir(), 'tmp_gene_info.h5')
# Notify about usage of version of python and pandas, where error occurred
if sys.version.startswith('3.6.0 '):
print('You are using the same python version, where the problem occurred.')
if pd.__version__ == '0.19.2':
print('You are using the same pandas version, where the problem occurred.')
# Download dataset where error occurred
urlretrieve(p_ext, p_down)
# Get dataframe
df = pd.read_table(p_down, sep='\t', header=0)
df = df.drop_duplicates(['#tax_id', 'GeneID'])
# Attempt to save as Hdf5
# A process where kernel would die, approx. 2 seconds after memory usage
# approaches the limit of my machine (62GB out of 64GB)
df.to_hdf(
p_out,
'table',
mode='w',
append=True,
data_columns=['#tax_id', 'GeneID'])
Problem description
Kernel dies while exporting DataFrame to HDF5. This death occurs on two different machines of mine. The error occurs both, in iPython notebooks, and when running the script through command line.
I have not been facing problems with df.to_hdf on any other (and smaller) DataFrames (or subsets of the given DataFrame), which suggests that there might be some problem with the specific data set.
Prior to the death of the kernel, RAM usage shoots up to around 62GB of 64GB. Thus I am not sure, if my issue relates to some bug, or whether it would be a request for the implementation of a low-memory fall-back.
Expected Output
The DataFrame would become saved as an HDF5 file.
Output of pd.show_versions()
Comment From: jreback
pls show a reproducible example that does not rely on external data like this
see if u can create a frame programmatically that had the same dtypes and structure
Comment From: tstoeger
Unfortunately I have not been able to programmatically create a DataFrame, which would show the same behavior – nor did I face any similar problem with any other DataFrame.
Comment From: jreback
ok, pls show df.info()
and df.head()
before the .to_hdf
call.
Comment From: tstoeger
Output of df.info()
tax_id int64
GeneID int64 Symbol object LocusTag object Synonyms object dbXrefs object chromosome object map_location object description object type_of_gene object Symbol_from_nomenclature_authority object Full_name_from_nomenclature_authority object Nomenclature_status object Other_designations object Modification_date int64 dtypes: int64(3), object(12) memory usage: 2.0+ GB
Output of df.head()
#tax_id GeneID Symbol LocusTag Synonyms dbXrefs chromosome map_location description type_of_gene Symbol_from_nomenclature_authority Full_name_from_nomenclature_authority Nomenclature_status Other_designations Modification_date
0 7 5692769 NEWENTRY - - - - - Record to support submission of GeneRIFs for a... other - - - - 20160818
1 9 1246500 repA1 pLeuDn_01 - - - - putative replication-associated protein protein-coding - - - - 20160813
2 9 1246501 repA2 pLeuDn_03 - - - - putative replication-associated protein protein-coding - - - - 20160716
3 9 1246502 leuA pLeuDn_04 - - - - 2-isopropylmalate synthase protein-coding - - - - 20160903
4 9 1246503 leuB pLeuDn_05 - - - - 3-isopropylmalate dehydrogenase protein-coding - - - - 20150520
Comment From: jreback
@tstoeger so this is going to be very inefficiently stored. you can try categorizing columns as it seems you have lots of duplicate data (this is per-column). object
dtype is not very efficient; it will get expanded to fixed-width string types (which with compression is generally ok).
I would guess that you are running out of memory. Best to chunk-write.
closing as not-reproducible.