Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Let's say I have a hdf5 and csv that contain a single column/dataset of equivalent string data. When I read it in via hdf5
foo = pd.DataFrame()
dataset = h5py.File(file)[column][:] # dtype = S10, length = 10 million
foo['a'] = dataset # dtype is still S10, described in issue #52617
foo['a'] = foo['a'].astype(object)
At this point, the memory explodes to something colossal. By contrast, when I do
foo = pd.read_csv(large_file)
The memory stays really low, as though it is interning the strings in the read_csv codepath, or more likely it is doing some internal optimization such as column.astype(category).astype(object)
for all the string columns.
Can you confirm whether this is true? If not, can you describe why there is a discrepancy?
Lastly, in the former case, can you recommend a similar optimization? I'm nervous about casting everything to categories, because in the off chance is a tremendous number of unique strings, it could blow out the memory even worse. ```
Issue Description
See above
Expected Behavior
See above
Installed Versions
Comment From: phofl
Please use a newer pandas, e.g. 2.0
Comment From: zbs
Unfortunately, my organization has constraints. Do you know if there is any type of string caching in the read_csv codepath?
Comment From: lithomas1
I believe that we do intern strings in the read_csv code (at least for the C engine), not by using Categoricals though.
I would recommend reading this blog post by Wes (the original author of pandas) for more context https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/ (It is about Apache Arrow, but the section "Reducing pandas memory use when converting from Arrow" is relevant)
Comment From: zbs
Thank you so much! It’s nice to confirm my suspicions:
For many years, the pandas.read_csv function has relied on a trick to limit the amount of string memory allocated. Because pandas uses arrays of PyObject* pointers to refer to objects in the Python heap, we can avoid creating multiple strings with the same value, instead reusing existing objects and incrementing their reference counts
Shot in the dark, but is that type of object de-duplication available for a numpy array?
Comment From: lithomas1
Shot in the dark, but is that type of object de-duplication available for a numpy array?
Categorical is probably your best bet. In fact, there is an example in the user guide there for string data. This link might be helpful for you. https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#
In the event that you don't have duplicates, though, you would pay the extra price of storing the codes array (that identify which category an element belongs to).