Pandas Variable deletion consumes a lot of memory

Hi team,

I have been having issues with pandas memory management. Specifically, there is an (at least for me) unavoidable peak of memory which occurs when attempting to remove variables from a data set. It should be (almost) free! I am getting rid of part of the data, but it still needs to allocate a big amount of memory producing MemoryErrors.

Just to give you a little bit of context, I am working with a DataFrame which contains 33M of rows and 500 columns (just a big one!), almost all of them numeric, in a machine with 360GB of RAM. The whole data set fits in memory and I can successfully apply some transformations to the variables. The problem comes when I need to drop a 10% of the columns contained in the table. It just produces a big peak of memory leading to a MemoryError. Before performing this operation, there are more than 80GB of memory available!.

I tried to use the following methods for removing the columns and all of them failed.

drop() with or without inplace parameter
pop()
reindex()
reindex_axis()
del df[column] in a loop over the columns to be removed
__delitem__(column) in a loop over the columns to be removed
pop() and drop() in a loop over the columns to be removed.
I also tried to reasign the columns overwritting the data frame using indexing with loc()and iloc() but it does not help.

I found that the drop method with inplace is the most efficient one but it still generates a huge peak.

I would like to discuss if there is there any way of implementing (or is it already implemented by any chance) a method for more efficiently removing variables without generating more memory consumption...

Thank you Iván

Comment From: gfyoung

xref #16529 : This touches upon a larger question of whether we want to deprecate / remove the inplace parameter, which has been a point of contention in terms of the future of pandas.

@ivallesp : Do you by any chance have code / data that could be used to replicate this issue?

Comment From: ivallesp

@gfyoung Sure, find it attached. Just to make it clear, the usage of the inplace parameter does not change anything in terms of memory usage. Can I help with something? Is there any idea of how to improve the drop function or how to design a more efficient function? I would like to collaborate on this :D

I profile using the memory profiler extension of Jupyter Notebooks.

import pandas as pd
from sklearn.datasets import make_classification

N_FEATURES=100
N_SAMPLES=1000000
x=make_classification(n_samples=1000000, n_features=100)[0]
df = pd.DataFrame(x[0], columns = ["VAR_%s"%x for x in range(N_FEATURES)])

# Begining of code to profile ------------------------------------
df.drop(df.columns[0:50], inplace=True, axis=1)
# End of code to profile -----------------------------------------

Comment From: jreback

Just to make it clear, the usage of the inplace parameter does not change anything in terms of memory usage.

where is it stated that this actually does anything w.r.t. memory usage? virtually all inplace operations make a copy and then re-assign the data.

It may release the memory, depending on IF the underlying data was a view or a copy.

In [32]: df = pd.DataFrame(np.random.randn(100000, 10))

In [33]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 10 columns):
0    100000 non-null float64
1    100000 non-null float64
2    100000 non-null float64
3    100000 non-null float64
4    100000 non-null float64
5    100000 non-null float64
6    100000 non-null float64
7    100000 non-null float64
8    100000 non-null float64
9    100000 non-null float64
dtypes: float64(10)
memory usage: 7.6 MB

In [34]: df.drop([0, 1], axis=1, inplace=True)

In [35]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
2    100000 non-null float64
3    100000 non-null float64
4    100000 non-null float64
5    100000 non-null float64
6    100000 non-null float64
7    100000 non-null float64
8    100000 non-null float64
9    100000 non-null float64
dtypes: float64(8)
memory usage: 6.1 MB

You are much more likely though to release memory if you use a more idiomatic.

df = df.drop(..., axis=1)

This removes the top-level reference to the original frame. Note that none of this actually will garbage collect (and nothing will release the memory back to the os).

Comment From: ivallesp

I know the inplace parameter is not helping avoiding the memory increase. I just measured it!. Although the inplace name suggests that no copy is made.

Anyway, this was not the topic of this conversation. Closing the issue does not help solving it, it is just hiding the dirty under the mat... It would be better to read my main message. The problem is that there is not a way of deleting variables in a big DataFrame without generating a huge peak of memory, this is a big problem guys.

In addition, again, regarding to your comment @jreback, I do not have problems releasing memory, I have a highly unexpected peak of memory.

Best, Iván

Comment From: jreback

this is not going to be solved in pandas 1. Data of a single dtype is blocked, creating a a view on that does not release the memory (and that is what you are doing). You can do this.

df =....

df2 = df.drop(...., axis=1)
del dfd

Comment From: alvarouc

Is there any update on this issue? SO far two contradicting solutions have been proposed.

You are much more likely though to release memory if you use a more idiomatic.

df = df.drop(..., axis=1) This removes the top-level reference to the original frame. Note that none of this actually will garbage collect (and nothing will release the memory back to the os).

and

You can do this.

df =....

df2 = df.drop(...., axis=1) del dfd

What is the best way to delete a column without running out of memory?

Comment From: giangdaotr

We encountered the same issue, and just to reiterate, it's the problem with huge memory peak during the drop which leads to MemoryError, NOT the problem with memory release.

Comment From: ianozsvald

@giangdaotr I've made a demo to show the cost of using del df[col] vs df.drop(...), the del solution in my example is indeed very expensive. I wonder if the block manager is duplicating RAM under certain conditions (which @jreback notes above). Demo here https://github.com/ianozsvald/ipython_memory_usage/blob/master/src/ipython_memory_usage/examples/example_usage_np_pd.ipynb (see In[16] onwards).

Personally I'm keen to know more because reasoning about memory using in Pandas (and when/if you get a view or a copy) is pretty tricky, I'm using my ipython_memory_usage tool to try to build up some demos. I'm happy to collect use cases here: https://github.com/ianozsvald/ipython_memory_usage/issues/30