Pandas Managing pandas's hash table footprint

Many operations in pandas trigger the population of a hash table from the underlying index values. This can be very costly in memory usage, especially for very long Series / DataFrames.

Example:

s = pd.Series(np.random.randn(10000000))
s[100]

Note that after indexing into s, the first time this happens a hash table s.index._engine is populated. So something larger than 80MB of space probably gets used up by this.

We can have a whole discussion about the hash table infra and how it could be better than it is now (I may spend some time on this myself soon). For the time being, one solution would be to have a weakref dictionary someplace (or something of that nature) that enables us to globally destroy all hash tables.

Separately, the hash table scheme can be modified to have much smaller memory footprint than it does now -- keys can be integers stored as uint32_t resulting in roughly 4 bytes per index value. Right now there are two arrays: one for hash keys, another for hash values.

motivated by http://stackoverflow.com/questions/18070520/pandas-memory-usage-when-reindexing

Comment From: wesm

Separately, the memory usage of MultiIndex is totally unacceptable in reindexing operations (the creation of an array of python tuples --> hash table is super inefficient). More work to do there too. cc @njsmith

Comment From: jreback

So the garbage collector DOES get the memory, but if we add a __del__ it it removed 'faster'

Current master

In [1]: import gc

In [2]: def f():
   ...:     s = Series(randn(100000))
   ...:     s[100]
   ...:     

In [3]: %memit -r 10 f()
maximum of 10: 76.613281 MB per loop

In [4]: %memit -r 10 f()
maximum of 10: 105.117188 MB per loop

In [5]: %memit -r 10 f()
maximum of 10: 132.832031 MB per loop

In [6]: gc.collect()
Out[6]: 260

In [7]: %memit -r 10 f()
maximum of 10: 76.640625 MB per loop

In [8]: %memit -r 10 f()
maximum of 10: 104.425781 MB per loop

In [9]: %memit -r 10 f()
maximum of 10: 132.210938 MB per loop

Adding to core/series.py

    def __del__(self):
        if self._index is not None:
            self._index._cleanup()
            self._index = None

In [1]: import gc

In [2]: def f():
   ...:     s = Series(randn(100000))
   ...:     s[100]
   ...:     

In [3]: %memit -r 10 f()
maximum of 10: 58.546875 MB per loop

In [4]: %memit -r 10 f()
maximum of 10: 66.179688 MB per loop

In [5]: %memit -r 10 f()
maximum of 10: 74.593750 MB per loop

In [6]: gc.collect()
Out[6]: 260

In [7]: %memit -r 10 f()
maximum of 10: 59.378906 MB per loop

In [8]: %memit -r 10 f()
maximum of 10: 67.007812 MB per loop

In [9]: %memit -r 10 f()
maximum of 10: 74.648438 MB per loop

Comment From: njsmith

If you want memory to be released more promptly, then your best bet is to get rid of the reference cycle, not add a del method. You clearly have some reference cycle somewhere since that's the only situation in which memory is * not* freed immediately, and also the only situation in which the gc gets involved at all. But if you have a del method on an object in a cycle, then it can't be freed at all (!), so del is a risky tool to be using to manage cycles. On 7 Aug 2013 02:49, "jreback" notifications@github.com wrote:

So the garbage collector DOES get the memory, but if we add a del it it removed 'faster'

Current master

In [1]: import gc

In [2]: def f(): ...: s = Series(randn(100000)) ...: s[100] ...:

In [3]: %memit -r 10 f() maximum of 10: 76.613281 MB per loop

In [4]: %memit -r 10 f() maximum of 10: 105.117188 MB per loop

In [5]: %memit -r 10 f() maximum of 10: 132.832031 MB per loop

In [6]: gc.collect() Out[6]: 260

In [7]: %memit -r 10 f() maximum of 10: 76.640625 MB per loop

In [8]: %memit -r 10 f() maximum of 10: 104.425781 MB per loop

In [9]: %memit -r 10 f() maximum of 10: 132.210938 MB per loop

Adding to core/series.py

def __del__(self): if self._index is not None: self._index._cleanup() self._index = None

In [1]: import gc

In [2]: def f(): ...: s = Series(randn(100000)) ...: s[100] ...:

In [3]: %memit -r 10 f() maximum of 10: 58.546875 MB per loop

In [4]: %memit -r 10 f() maximum of 10: 66.179688 MB per loop

In [5]: %memit -r 10 f() maximum of 10: 74.593750 MB per loop

In [6]: gc.collect() Out[6]: 260

In [7]: %memit -r 10 f() maximum of 10: 59.378906 MB per loop

In [8]: %memit -r 10 f() maximum of 10: 67.007812 MB per loop

In [9]: %memit -r 10 f() maximum of 10: 74.648438 MB per loop

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/4491#issuecomment-22224808 .

Comment From: wesm

Tabled for pandas 2.0