Code Sample


import os
import sys
import tempfile

import pandas as pd
from urllib.request import urlretrieve

# Settings
p_ext = 'https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz'
p_down = os.path.join(tempfile.gettempdir(), 'tmp_gene_info.gz')
p_out = os.path.join(tempfile.gettempdir(), 'tmp_gene_info.h5')

# Notify about usage of version of python and pandas, where error occurred
if sys.version.startswith('3.6.0 '):
    print('You are using the same python version, where the problem occurred.')

if pd.__version__ == '0.19.2':
    print('You are using the same pandas version, where the problem occurred.')

# Download dataset where error occurred
urlretrieve(p_ext, p_down)

# Get dataframe
df = pd.read_table(p_down, sep='\t', header=0)
df = df.drop_duplicates(['#tax_id', 'GeneID'])

# Attempt to save as Hdf5
# A process where kernel would die, approx. 2 seconds after memory usage
# approaches the limit of my machine (62GB out of 64GB)
df.to_hdf(
    p_out,
    'table',
    mode='w',
    append=True,
    data_columns=['#tax_id', 'GeneID'])

Problem description

Kernel dies while exporting DataFrame to HDF5. This death occurs on two different machines of mine. The error occurs both, in iPython notebooks, and when running the script through command line.

I have not been facing problems with df.to_hdf on any other (and smaller) DataFrames (or subsets of the given DataFrame), which suggests that there might be some problem with the specific data set.

Prior to the death of the kernel, RAM usage shoots up to around 62GB of 64GB. Thus I am not sure, if my issue relates to some bug, or whether it would be a request for the implementation of a low-memory fall-back.

Expected Output

The DataFrame would become saved as an HDF5 file.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 16.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.19.2 nose: 1.3.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.11.3 scipy: 0.18.1 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.5.1 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: 1.2.0 tables: 3.3.0 numexpr: 2.6.1 matplotlib: 2.0.0 openpyxl: 2.4.1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.2 bs4: 4.5.3 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.1.5 pymysql: None psycopg2: None jinja2: 2.9.4 boto: 2.45.0 pandas_datareader: None

Comment From: jreback

pls show a reproducible example that does not rely on external data like this

see if u can create a frame programmatically that had the same dtypes and structure

Comment From: tstoeger

Unfortunately I have not been able to programmatically create a DataFrame, which would show the same behavior – nor did I face any similar problem with any other DataFrame.

Comment From: jreback

ok, pls show df.info() and df.head() before the .to_hdf call.

Comment From: tstoeger

Output of df.info()

Int64Index: 16808216 entries, 0 to 16808215 Data columns (total 15 columns):

tax_id int64

GeneID int64 Symbol object LocusTag object Synonyms object dbXrefs object chromosome object map_location object description object type_of_gene object Symbol_from_nomenclature_authority object Full_name_from_nomenclature_authority object Nomenclature_status object Other_designations object Modification_date int64 dtypes: int64(3), object(12) memory usage: 2.0+ GB

Output of df.head()

#tax_id GeneID  Symbol  LocusTag    Synonyms    dbXrefs chromosome  map_location    description type_of_gene    Symbol_from_nomenclature_authority  Full_name_from_nomenclature_authority   Nomenclature_status Other_designations  Modification_date
0   7   5692769 NEWENTRY    -   -   -   -   -   Record to support submission of GeneRIFs for a...   other   -   -   -   -   20160818
1   9   1246500 repA1   pLeuDn_01   -   -   -   -   putative replication-associated protein protein-coding  -   -   -   -   20160813
2   9   1246501 repA2   pLeuDn_03   -   -   -   -   putative replication-associated protein protein-coding  -   -   -   -   20160716
3   9   1246502 leuA    pLeuDn_04   -   -   -   -   2-isopropylmalate synthase  protein-coding  -   -   -   -   20160903
4   9   1246503 leuB    pLeuDn_05   -   -   -   -   3-isopropylmalate dehydrogenase protein-coding  -   -   -   -   20150520

Comment From: jreback

@tstoeger so this is going to be very inefficiently stored. you can try categorizing columns as it seems you have lots of duplicate data (this is per-column). object dtype is not very efficient; it will get expanded to fixed-width string types (which with compression is generally ok).

I would guess that you are running out of memory. Best to chunk-write.

closing as not-reproducible.