Similar to an observation on reddit I noticed that there is a huge performance difference between the default pandas pd.options.mode.chained_assignment = 'warn' over setting it to None.

Code Sample

import time
import pandas as pd
import numpy as np

def gen_data(N=10000):
    df = pd.DataFrame(index=range(N))
    for c in range(10):
        df[str(c)] = np.random.uniform(size=N)
    df["id"] = np.random.choice(range(500), size=len(df))
    return df

def do_something_on_df(df):
    """ Dummy computation that contains inplace mutations """
    for c in range(df.shape[1]):
        df[str(c)] = np.random.uniform(size=df.shape[0])
    return 42

def run_test(mode="warn"):
    pd.options.mode.chained_assignment = mode

    df = gen_data()

    t1 = time.time()
    for key, group_df in df.groupby("id"):
        do_something_on_df(group_df)
    t2 = time.time()
    print("Runtime: {:10.3f} sec".format(t2 - t1))

if __name__ == "__main__":
    run_test(mode="warn")
    run_test(mode=None)

Problem description

The run times vary a lot depending on the whether the SettingWithCopyWarning is enabled or disable. I tried with a few different Pandas/Python versions:

Debian VM, Python 3.6.2, pandas 0.21.0
Runtime:     46.693 sec
Runtime:      0.731 sec

Debian VM, Python 2.7.9, pandas 0.20.0
Runtime:    101.204 sec
Runtime:      0.622 sec

Ubuntu (host), Python 2.7.3, pandas 0.21.0
Runtime:     35.363 sec
Runtime:      0.517 sec

Ideally, there should not be such a big penalty for SettingWithCopyWarning.

From profiling results it looks like the reason might be this call to gc.collect.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Linux OS-release: 3.16.0-4-amd64 machine: x86_64 processor: byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.21.0 pytest: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.13.3 scipy: None pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: jreback

of course, this has to run the garbage collector. You can certainly just disable them. This wont' be fixed in pandas 2.

Comment From: bluenote10

It would probably be helpful to document the performance impact more clearly. This can have subtle side effects, which are very hard to find. I only noticed it, because a Dask/Distributed computation was much slower than expected (use case documented on SO)

Comment From: jreback

and if u want to put up a PR would be happy to take it

Comment From: TomAugspurger

Has this been fixed in the meantime? Running the script from the original post, I see

../foo.py:15: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[str(c)] = np.random.uniform(size=df.shape[0])
Runtime:      0.749 sec
Runtime:      0.668 sec

I'm using pandas master, Python 3.7 on MacOS.

Comment From: TomAugspurger

https://github.com/pandas-dev/pandas/issues/27031 seems to be the most likely fix. Thanks Jeff. (https://github.com/pandas-dev/pandas/issues/27585 possibly helped, but that's less certain).

Comment From: jreback

this should be ok now @TomAugspurger is that not what u r seeing?

Comment From: TomAugspurger

I'm seeing that it's fixed now, just wanted to clarify since we had some Dask users reporting issues (but they're likely on an older pandas).