Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Let's say I have a hdf5 and csv that contain a single column/dataset of equivalent string data. When I read it in via hdf5

foo = pd.DataFrame()
dataset = h5py.File(file)[column][:] # dtype = S10, length = 10 million
foo['a'] = dataset # dtype is still S10, described in issue #52617
foo['a'] = foo['a'].astype(object)

At this point, the memory explodes to something colossal. By contrast, when I do

foo = pd.read_csv(large_file)

The memory stays really low, as though it is interning the strings in the read_csv codepath, or more likely it is doing some internal optimization such as column.astype(category).astype(object) for all the string columns.

Can you confirm whether this is true? If not, can you describe why there is a discrepancy?

Lastly, in the former case, can you recommend a similar optimization? I'm nervous about casting everything to categories, because in the off chance is a tremendous number of unique strings, it could blow out the memory even worse. ```

Issue Description

See above

Expected Behavior

See above

Installed Versions

Replace this line with the output of pd.show_versions() INSTALLED VERSIONS ------------------ commit : 66e3805b8cabe977f40c05259cc3fcf7ead5687d python : 3.8.13.final.0 python-bits : 64 OS : Linux OS-release : 4.18.0-348.23.1.el8_5.x86_64 Version : #1 SMP Tue Apr 12 11:20:32 EDT 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.3.5 numpy : 1.23.3 pytz : 2022.4 dateutil : 2.8.2 pip : 22.2.2 setuptools : 65.4.1 Cython : 0.29.32 pytest : 7.0.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.8.0 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.5.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : 1.3.5 fsspec : 2022.8.2 fastparquet : None gcsfs : None matplotlib : 3.6.0 numexpr : 2.8.3 odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : 9.0.0 pyxlsb : None s3fs : None scipy : 1.9.1 sqlalchemy : 1.4.41 tables : 3.6.1 tabulate : 0.9.0 xarray : 2022.9.0 xlrd : 2.0.1 xlwt : None numba : 0.56.2

Comment From: phofl

Please use a newer pandas, e.g. 2.0

Comment From: zbs

Unfortunately, my organization has constraints. Do you know if there is any type of string caching in the read_csv codepath?

Comment From: lithomas1

I believe that we do intern strings in the read_csv code (at least for the C engine), not by using Categoricals though.

I would recommend reading this blog post by Wes (the original author of pandas) for more context https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/ (It is about Apache Arrow, but the section "Reducing pandas memory use when converting from Arrow" is relevant)

Comment From: zbs

Thank you so much! It’s nice to confirm my suspicions:

For many years, the pandas.read_csv function has relied on a trick to limit the amount of string memory allocated. Because pandas uses arrays of PyObject* pointers to refer to objects in the Python heap, we can avoid creating multiple strings with the same value, instead reusing existing objects and incrementing their reference counts

Shot in the dark, but is that type of object de-duplication available for a numpy array?

Comment From: lithomas1

Shot in the dark, but is that type of object de-duplication available for a numpy array?

Categorical is probably your best bet. In fact, there is an example in the user guide there for string data. This link might be helpful for you. https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#

In the event that you don't have duplicates, though, you would pay the extra price of storing the codes array (that identify which category an element belongs to).