Pandas col.replace(dict) takes too much memory

I've code which run fine about ~10 month ago and now fails due to running out of memory. The code basically does this:

groups = [[1,2,3][222,333,444, 555], ...]  # from a 1.4MB json file
# about 30k such groups, results in ~70k replacements
for replace_list in groups:
        replacement = sorted(replace_list)[0]
        for replace_id in replace_list:
            if replace_id == replacement:
                continue
            replace_dict[replace_id] = replacement
# len(_col) == 974757
_col = df[column].astype("int64")
_col.replace(replace_dict, inplace=True)

I've now split the replacement_dict into 2k chunks and and this works (takes about 20 seconds for each chunk).

Comment From: jreback

can you provide the file, pandas, numpy version that it was better on and what you are trying on now?

also, try wi/o the in-place operation

Comment From: jankatins

The replacement file is no problem, but the dataframe is internal and about 400MB csv unziped :-(

Now I do it similar to this:

groups = [[1,2,3][222,333,444, 555], ...]  # from a 1.4MB json file
# about 30k such groups, results in ~70k replacements
count = 0
_col = df[column].astype("int64")
replace_dict = {}
for replace_list in groups:
        count +=1
        replacement = sorted(replace_list)[0]
        for replace_id in replace_list:
            if replace_id == replacement:
                continue
            replace_dict[replace_id] = replacement
        if count > 2000:
            _col = _col.replace(replace_dict)
            replace_dict = {}
            count = 0
_col = _col.replace(replace_dict) # for the last chunk

This works (takes about 11min) and 4.7GB of RAM are in use.

pandas._np_version
Out[3]: '1.8.0'

pandas.__version__
Out[4]: '0.13.1-489-g7ffa655'

I don't remember the numpy/pandas version from 10 month ago (The replacement file was generated on 17.05.2013, so during that week it worked) :-( It was probably whatever was on pandas master and the latest numpy release (I compile pandas from head and install numpy from cgolhlkes installers).

Comment From: jreback

@cpcloud ?

Comment From: dsm054

Something seems strange here.

@JanSchulz: how long does _col.apply(lambda x: replace_dict.get(x, x)) take?

Comment From: jankatins

WTF:

replacements_by_XXX.json (16:35:42):  groups: 31612 replacements: 77846 =>  1.36s
replacements_by_XXXY.json (16:35:43):  groups: 19576 replacements: 33845 =>  1.03s
replacements_manual.json (16:35:44):  groups: 16 replacements: 17 =>  864ms
Wall time: 3.26 s

This took 20min with the above "2k chunks" code.

Comment From: jreback

@JanSchulz I don't understand your problem exactly, but instead of replacing, why wouldn't you use these as indexers.

e.g. say you are replacing string to numbers; just construct a list (or series) of the strings, then do s.take(numbers) (or s.iloc)?

will be orders of magnitude faster

Comment From: dsm054

@JanSchulz: are those the numbers you get when you use the apply/get combination?

@jreback: I think the issue is that the following doesn't feel right at all:

>>> to_replace = {i: 10**7+i for i in range(10**6)}
>>> series_length = 10**3
>>> col = pd.Series(np.random.randint(0, 10**6, series_length))
>>> 
>>> %time slow = col.replace(to_replace)
CPU times: user 30 s, sys: 0 ns, total: 30 s
Wall time: 30 s
>>> %time fast = col.apply(lambda x: to_replace.get(x,x))
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 1.53 ms
>>> 
>>> (slow == fast).all()
True

We should be getting O(n) values * O(1) lookup, making this very fast, and we're not. [Okay, I haven't varied the parameters enough to pick up the actual scaling -- could just be a really big constant, from what I've shown.]

Comment From: jreback

hmm.....seems doing lots of work inside....

if you do this:

df = DataFrame({ 0 : col })
df.replace({ 0 : to_replace })

much faster....

prob a bug

Comment From: dsm054

I don't actually trust {0: to_replace} at the moment; see the discussion in #6689.

Comment From: cpcloud

this is very weird ... looking now

Comment From: cpcloud

prun of @dsm054's example shows that BlockManager.replace_list is to blame...figuring out the line ... lprun 1.0b3 breaks with newest ipython :cry:

Comment From: jankatins

@jreback each row in my data set has a "owner" and but sometimes (actually quite a lot of times) one owner has received several IDs and in that step I want to merge them into one. So in the final dataframe, the ID has to be set to the right (or at least to only one) value, so that I can df.groupby("id").

The shorter numbers are from the apply/get (2 seconds vs 20 mins from "chunked replace").

Comment From: jankatins

Anyway: the initial problem is not so much the time it takes but that it now takes much more memory than in last year with ~0.11.

Comment From: dsm054

I'd bet a shiny penny (now a collector's item in Canada) that the time and memory issues aren't independent.

Comment From: cpcloud

Time seems to be linear in the number of keys

if __name__ == '__main__':
    import time
    import numpy as np
    from pandas import Series
    start = 1
    stop = 10 ** 6 + 1
    step = 10 ** 5
    sizes = np.arange(start, stop, step)
    to_rep = {n: {i: 10 ** 5 + i for i in range(n)} for n in sizes}
    col = Series(np.random.randint(0, 10 ** 6, size=10 ** 3))

    times = Series(dtype=float, index=to_rep.keys())

    for k, v in to_rep.iteritems():
        tic = time.time()
        col.replace(v)
        toc = time.time() - tic
        times[k] = toc

    times.plot(ax=ax)

replace

Comment From: dsm054

I think you can see the linearity in the code. Suppressing irrelevant detail, the structure looks like

for blk in self.blocks:
     for i, (s, d) in enumerate(zip(src_list, dest_list)):
           result = blk.replace(args)

and so it calls a replace (or a putmask on the non-object branch) for every key, value pair. It looks like we're trying to leverage numpy to speed things up. That might be good in the few-keyval long array regime, but it's going to be beaten by a simple listcomp + dict lookup pretty quickly in others.

Comment From: cpcloud

Yep .... was just looking at that block can't really get too deep into it right now can look this evening.

Comment From: jreback

@cpcloud any luck with this?

Comment From: cpcloud

I haven't really looked at this. Getting the dtype issue fixed from #6689 more important i think ... i'm actually working on the that right now

Comment From: cpcloud

I've just done %memit between v0.11.0 and master and I can't see any kind of memory usage difference between these two commits.

Commits between the date provided by @JanSchulz and the v0.11.0 tag date (2013-04-22):

git log --oneline --no-merges --after={2013-04-22} --before={2013-05-17} --name-only

shows

f9eea30 RLS: Version 0.11
RELEASE.rst
doc/source/v0.11.0.txt
setup.py

f24b923 BUG: respect passed chunksize in read_csv when using get_chunk function. close #3406
RELEASE.rst
pandas/io/parsers.py
pandas/io/tests/test_parsers.py

d385d25 DOC: Adding parameters to frequencies, offsets (issue #2916)
pandas/tseries/frequencies.py
pandas/tseries/offsets.py

so it looks like nothing was touched regarding DataFrame.replace() (e.g., making it faster)

memit numero uno

v0.11.0

In [5]: pd.version.version
Out[5]: '0.11.0'

In [6]: paste
from pandas import Series
from numpy.random import randint


def rep_test():
    s = Series(randint(10 ** 6, size=1000))
    to_rep = {i: 10 ** 5 + i for i in range(10 ** 6)}
    return s.replace(to_rep)
## -- End pasted text --

In [7]: %load_ext memory_profiler

In [8]: %memit rep_test()
peak memory: 1511.13 MiB, increment: 1422.84 MiB

y dos

In [3]: pd.version.version
Out[4]: '0.13.1-594-ge84efe5'

In [4]: %memit rep_test()
peak memory: 1381.62 MiB, increment: 1287.60 MiB

Comment From: cpcloud

Wow changing rep_test from this

def rep_test():
    s = Series(randint(10 ** 6, size=1e3))
    to_rep = {i: 10 ** 5 + i for i in range(10 ** 6)}
    return s.replace(to_rep)

def rep_test():
    s = Series(randint(10 ** 6, size=1e6))
    to_rep = {i: 10 ** 5 + i for i in range(7 * 10 ** 4)}
    return s.replace(to_rep)

crashes the interpreter. Back to the drawing board.

Comment From: cpcloud

ah i see the issue ... replace with dicts is now using replace list ... before the ndarray refactor Series had its own replace method that was different than the generic one. some nice stuff in the series replace method i may move some of that old code into current replace

Comment From: cpcloud

@JanSchulz can you provide a sample dataset that is a faithful "reproduction" of the one you're using? Meaning, enough info so that I can create a fake frame to test the issue with. I still get huge memory usage for replace even in v0.11.0 so i'm not sure where your issue lies. The issue is that when there's an overlap between keys and values a mask is built up for each replacement value and stored in a dict which is where the bulk of the memory usage is coming from, this is how it worked in v0.11.0, so i'm not sure why you weren't seeing this before

Comment From: cpcloud

@JanSchulz also are you using the exact same dataset to test this as when you used this with pandas 0.11?

Comment From: jankatins

The sample is basically this:

record, authorid
1, 1111111111
2, 2222222222
3, 3333333333
4, 4444444444
5, 5555555555
6, 6666666666
7, 1111111111 # a second entry of author 111111111
....[about 1mio of such lines]...

The replacements would then replace 1111111111 with 2222222222 or 4444444444 and 5555555555 with 6666666666 ("merge two or more authors into one"), but leave some authorids alone. I've about 70k such replacements. The author IDs were int64.

The data is the same, only the mergelist was slightly different (found about 20 "manual" merges, the automatic merges were 70k).

Comment From: cpcloud

@JanSchulz can you try your replacement on pandas version 0.11.0 and report back to let me know if there's a difference? I'm having trouble isolating the memory issue without the exact dataset.

Also, it would be great if you could come up with a reproducible example that doesn't depend on your dataset. I can't reproduce the changes in memory usage (i.e., this has been a memory hog since at least v0.11.0) with the examples I've shown above. Without such an example, I'm essentially shooting in the dark.

Comment From: cpcloud

@jreback can i push to 0.15? i can't repro this

Comment From: chedatomasz

Has there been any progress on this? I feel like it could be a great improvement, as - to the best of my knowledge - there is currently no fast way to replace a DataFrame column based on a dict, and that is a fairly common use case

Comment From: mmcdermott

Any update on this? It would be greatly appreciated.

Comment From: jreback

@mmcdermott this is a very old issue - check whether this still exists in master - if not investigations welcome

Comment From: mmcdermott

I found this issue because my code was OOM when I tried to run df[col].replace(replace_dict), where replace_dict is a very large dictionary. As suggested by this thread, I replaced this code with df[col].apply(lambda x: replace_dict.get(x, x)) and it works fine, so the issue is still present.

Comment From: DCunnama

I can confirm this is still happening, just ran into this

Comment From: mmcdermott

The "apply" solution works fine for me to resolve this issue, but someone also recently pointed me to the map function: https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html which I've been told (but haven't tested myself) doesn't have the memory leak/footprint issue of replace with a large dict, in case it helps others.

Comment From: DCunnama

Thanks, the "apply" solution works for me too, I will give the map function a try too.

Comment From: jbrockmendel

Best guess based on having spent a lot of time in the relevant code recently (but not having done any profiling specific to this issue):

replace_dict.items() ends up getting passed to Block.replace_list. There we pre-compute (I think the pre-compute bit was an optimization more recent than this issue) a mask for each key in replace_dict. So we end up with len(replace_dict) ndarrays of block.values.size bytes each.

In the OP replace_dict had about 70k entries and len(df) was 974757 then we're looking at about 68GB worth of ndarrays.

Option 1) avoid that pre-allocation at potential speed hit Option 2) when len(replace_dict) >> len(values) go down a different path that just iterates over values once doing a dict lookup for each one.

Comment From: iKolt

This code works fine:

df['col'] = df['col'].map(replacing_dict).combine_first(df['col']) or df['col'] = df['col'].map(replacing_dict).fillna(df['col'])

Usage of fillna is faster for small Series (checked some atrificial cases - 2x speedup) than combine_first, but combine_first is more self-commenting for me and the difference in speed is fading than Series is ~10K lines or more

Comment From: lithomas1

Taking a look at this.