Code Sample

Environment setup:

conda create -n bug_test python=3.8 pandas pytables numpy psutil
conda activate bug_test

Test code:

import psutil
import numpy as np
import pandas as pd
import os
import gc

random_np = np.random.randint(0, 1e16, size=(25000000,4))
random_df = pd.DataFrame(random_np)
random_df['Test'] = np.random.rand(25000000,1)
random_df.set_index(0, inplace=True)
random_df.sort_index(inplace=True)
random_df.to_hdf('test.h5', key='random_df', mode='w', format='table')

del random_np
del random_df
gc.collect()

initial_memory_usage = psutil.Process(os.getpid()).memory_info().rss

random_df = pd.read_hdf('test.h5')
print(f'Memory Usage According to Pandas: {random_df.__sizeof__()/1000000000:.2f}GB')
print(f'Real Memory Usage: {(psutil.Process(os.getpid()).memory_info().rss - initial_memory_usage)/1000000000:.2f}GB')

random_df.index = random_df.index.copy(deep=True)
print(f'Memory Usage After Temp Fix: {(psutil.Process(os.getpid()).memory_info().rss - initial_memory_usage)/1000000000:.2f}GB')

del random_df
gc.collect()
print(f'Memory Usage After Deleting Table: {(psutil.Process(os.getpid()).memory_info().rss - initial_memory_usage)/1000000000:.2f}GB')

Problem description

The above code generates a 1GB df table with mixed data types and saves it to a HDF5 file in "table" format.

Loading the HDF5 file back, we expect it to use 1GB of memory instead it uses 1.8GB of memory. I have found that the issue is with the index of the df. If I do a deep copy and replace it with the copy, the excessive memory usage goes away and memory usage is 1GB as expected.

I have initially encountered this issue when using Pandas 1.0.5 but I have tested this on Pandas 1.1.3 and the issue still exists.

When I was investigating the bug by going through the code of Pandas 1.0.5, I noticed that PyTables was used to read the HDF5 file and returns a NumPy structured array. Dataframes were created using this NumPy array and pd.concat was used to combine them into a single df. The pd.concat makes a copy of the original data instead of just pointing to the NumPy array. However, the index of the combined table still points to the original NumPy array. I think this explains the excessive memory usage because GC cannot collect the NumPy array since there still a reference to it.

Due to significant code changes to the read_hdf in Pandas 1.13, I did not have time to find out if this is still the same problem or another problem.

Expected Output

1GB df table should use 1GB of memory instead of 1.8GB

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : db08276bc116c438d3fdee492026f8223584c477 python : 3.8.5.final.0 python-bits : 64 OS : Linux OS-release : 4.19.104-microsoft-standard Version : #1 SMP Wed Feb 19 06:37:35 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.1.3 numpy : 1.19.2 pytz : 2020.1 dateutil : 2.8.1 pip : 20.2.4 setuptools : 50.3.0.post20201006 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : 2.7.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : 3.6.1 tabulate : None xarray : None xlrd : None xlwt : None numba : None

Comment From: trevorkask

take

Comment From: lithomas1

Looks like the index is a view of a structured array (which is what PyTables returns to us).

It looks like that view is created here, since adding .copy() to values here seems to fix it

https://github.com/pandas-dev/pandas/blob/32b33085bf6eb278ac385cf7009140bd6012a51e/pandas/io/pytables.py#L2059-L2060

I'll need to dig deeper to figure out why the view is persisting through the concat.

Comment From: lithomas1

50673