Similar to an observation on reddit I noticed that there is a huge performance difference between the default pandas pd.options.mode.chained_assignment = 'warn'
over setting it to None
.
Code Sample
import time
import pandas as pd
import numpy as np
def gen_data(N=10000):
df = pd.DataFrame(index=range(N))
for c in range(10):
df[str(c)] = np.random.uniform(size=N)
df["id"] = np.random.choice(range(500), size=len(df))
return df
def do_something_on_df(df):
""" Dummy computation that contains inplace mutations """
for c in range(df.shape[1]):
df[str(c)] = np.random.uniform(size=df.shape[0])
return 42
def run_test(mode="warn"):
pd.options.mode.chained_assignment = mode
df = gen_data()
t1 = time.time()
for key, group_df in df.groupby("id"):
do_something_on_df(group_df)
t2 = time.time()
print("Runtime: {:10.3f} sec".format(t2 - t1))
if __name__ == "__main__":
run_test(mode="warn")
run_test(mode=None)
Problem description
The run times vary a lot depending on the whether the SettingWithCopyWarning
is enabled or disable. I tried with a few different Pandas/Python versions:
Debian VM, Python 3.6.2, pandas 0.21.0
Runtime: 46.693 sec
Runtime: 0.731 sec
Debian VM, Python 2.7.9, pandas 0.20.0
Runtime: 101.204 sec
Runtime: 0.622 sec
Ubuntu (host), Python 2.7.3, pandas 0.21.0
Runtime: 35.363 sec
Runtime: 0.517 sec
Ideally, there should not be such a big penalty for SettingWithCopyWarning
.
From profiling results it looks like the reason might be this call to gc.collect
.
Output of pd.show_versions()
Comment From: jreback
of course, this has to run the garbage collector. You can certainly just disable them. This wont' be fixed in pandas 2.
Comment From: bluenote10
It would probably be helpful to document the performance impact more clearly. This can have subtle side effects, which are very hard to find. I only noticed it, because a Dask/Distributed computation was much slower than expected (use case documented on SO)
Comment From: jreback
and if u want to put up a PR would be happy to take it
Comment From: TomAugspurger
Has this been fixed in the meantime? Running the script from the original post, I see
../foo.py:15: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[str(c)] = np.random.uniform(size=df.shape[0])
Runtime: 0.749 sec
Runtime: 0.668 sec
I'm using pandas master, Python 3.7 on MacOS.
Comment From: TomAugspurger
https://github.com/pandas-dev/pandas/issues/27031 seems to be the most likely fix. Thanks Jeff. (https://github.com/pandas-dev/pandas/issues/27585 possibly helped, but that's less certain).
Comment From: jreback
this should be ok now @TomAugspurger is that not what u r seeing?
Comment From: TomAugspurger
I'm seeing that it's fixed now, just wanted to clarify since we had some Dask users reporting issues (but they're likely on an older pandas).