Pandas .loc[...] = value returns SettingWithCopyWarning

Code Sample

# My code
df.loc[0, 'column_name'] = 'foo bar'

Problem description

This code in Pandas 20.3 throws SettingWithCopyWarning and suggests to

"Try using .loc[row_indexer,col_indexer] = value instead".

I am already doing so, looks like there is a little bug. I use Jupyter. Thank you! :)

Output of `pd.show_versions()`

------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 8.1 machine: AMD64 processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 35.0.2 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: None xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: TomAugspurger

@NadiaRom Can you provide a full example? It's hard to say for sure, but I suspect that df came from an operation that may be a view or copy. For example:

In [8]: df = pd.DataFrame({"A": [1, 2], "B": [3, 4], "C": [4, 5]})

In [9]: df1 = df[['A', 'B']]

In [10]: df1.loc[0, 'A'] = 5
/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py:180: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
/Users/taugspurger/Envs/pandas-dev/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3.6

So we're updating df1 correctly. The ambiguity is whether or not df will be updated as well. I think a similar thing is happening to you, but without a reproducible example it's hard to say for sure.

Comment From: NadiaRom

@TomAugspurger Here is the code, in general, I never assign values to pandas without .loc

df = pd.read_csv('df_unicities.tsv', sep='\t')
df.replace({'|': '--'}, inplace=True)

df_c = df.loc[df.encountry == country, : ]

df_c['sort'] = (df_c.encities_ua == 'all').astype(int) # new column
df_c['sort'] += (df_c.encities_foreign == 'all').astype(int)
df_c.sort_values(by='sort', inplace=True)

# ---end of chunk, everything is fine ---

if df_c.encities_foreign.str.contains('all').sum() < len(df_c):
    df_c.loc[df_c.encities_foreign.str.contains('all'), 'encities_foreign'] = 'other'
    df_c.loc[df_c.cities_foreign.str.contains('всі'), 'cities_foreign'] = 'інші'
else:
    df_c.loc[df_c.encities_foreign.str.contains('all'), 'encities_foreign'] = country
    df_c.loc[df_c.cities_foreign.str.contains('всі'), 'cities_foreign'] = df_c.country.iloc[0]

if df_c.encities_ua.str.contains('all').sum() < len(df_c):
    df_c.loc[df_c.encities_ua.str.contains('all'), 'encities_ua'] = 'other'
    df_c.loc[df_c.cities_ua.str.contains('всі'), 'cities_ua'] = 'інші'
else:
    df_c.loc[df_c.encities_ua.str.contains('all'), 'encities_ua'] = 'Ukraine'
    df_c.loc[df_c.cities_ua.str.contains('всі'), 'cities_ua'] = 'Україна'

# Warning after it

Thank you for rapid answer!

Comment From: CRiddler

The issue here is that you're slicing you dataframe first with .loc in line 4. The attempting to assign values to that slice.

df_c = df.loc[df.encountry == country, :]

Pandas isn't 100% sure if you want to assign values to just your df_c slice, or have it propagate all the way back up to the original df. To avoid this when you first assign df_c make sure you tell pandas that it is its own data frame (and not a slice) by using

df_c = df.loc[df.encountry == country, :].copy()

Doing this will fix your error. I'll tack on a brief example to help explain the above since I've noticed a lot of users get confused by pandas in this aspect.

Example with made up data

>>> import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,5], 'B':list('QQQCC')})
>>> df
   A  B
0  1  Q
1  2  Q
2  3  Q
3  4  C
4  5  C
>>> df.loc[df['B'] == 'Q', 'new_col'] = 'hello'
>>> df
   A  B new_col
0  1  Q   hello
1  2  Q   hello
2  3  Q   hello
3  4  C     NaN
4  5  C     NaN

So the above works as we expect! Now lets try an example that mirrors what you attempted to do with your data.

>>> df = pd.DataFrame({'A':[1,2,3,4,5], 'B':list('QQQCC')})
>>> df_q = df.loc[df['B'] == 'Q']
>>> df_q
   A  B
0  1  Q
1  2  Q
2  3  Q
>>> df_q.loc[df['A'] < 3, 'new_col'] = 'hello'
/Users/riddellcd/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py:337: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)

>>> df_q
   A  B new_col
0  1  Q   hello
1  2  Q   hello
2  3  Q     NaN

Looks like we hit the same error! But it changed df_q as we expected! This is because df_q is a slice of df so, even though we're using .loc[] df_q pandas is warning us that it won't propagate the changes up to df. To avoid this, we need to be more explicit and say that df_q is its own dataframe, separate from df by explicitly declaring it so.

Lets start back from df_q but use .copy() this time.

>>> df_q = df.loc[df['B'] == 'Q'].copy()
>>> df_q
   A  B
0  1  Q
1  2  Q
2  3  Q

Lets try to reassign our value now!
>>> df_q.loc[df['A'] < 3, 'new_col'] = 'hello'
>>> df_q
   A  B new_col
0  1  Q   hello
1  2  Q   hello
2  3  Q     NaN

This works without an error because we've told pandas that df_q is separate from df

If you in fact do want these changes to df_c to propagate up to df thats another point entirely and will answer if you want.

Comment From: NadiaRom

@CRiddler Great, thank you! As you mentioned, chained .loc has never returned unexpected results. As I understand, .copy() ensures Pandas that we treat selected df_sliced_once as separate object and do not intend to change initial full df. Please correct if I mixed up smth.

Comment From: jreback

documentation is here http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy and @CRiddler has a nice expl. you should in general NOT use inplace at all.

Comment From: persep

If you in fact do want these changes to df_c to propagate up to df thats another point entirely and will answer if you want.

@CRiddler Thanks your answer is better than the ones in Stack Overflow could you add when you want to propagate to the initial dataframe or give an indication of how it is done?

Comment From: CRiddler

@persep In general I don't like turning issues into stackoverflow threads for help, but it seems that this issue has gotten a fair bit of attention since last posting so I'll go ahead and post my method of tackling this type of problem in pandas. I typically do this by not subsetting the dataframe into separate variables, but I instead turn masks into variables- then combine masks as needed and set values based on those masks to ensure the changes happen in the original dataframe, and not to some copy floating around.

Original data:

>>>import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,5], 'B':list('QQQCC')})
>>> df
   A  B
0  1  Q
1  2  Q
2  3  Q
3  4  C
4  5  C

Remember that creating a temporary dataframe will NOT propagate changes
As shown in the previous example, this makes changes to only to df_q and raises a pandas warning (not copied/pasted here). AND does NOT propagate any changes to df

>>> df_q = df.loc[df["B"] == "Q"]
>>> df_q.loc[df["A"] < 3, "new_column"] = "hello"

# df remains unchanged because we only made changes to `df_q`
>>> df
   A  B
0  1  Q
1  2  Q
2  3  Q
3  4  C
4  5  C

To my knowledge, there is no way to use the same code as above and force changes to propagate back to the original dataframe.

However, if we change our thinking a bit and work with masks instead of full-on subsets we can achieve the desired result. While this isn't necessarily "propagating" changes to the original dataframe from a subset, we are ensuring that any changes we do make happen in the original dataframe df. To do this, we create masks first, then apply them when we want to make a change to that subset of df

>>> q_mask = df["B"] == "Q"
>>> a_mask = df["A"] < 3

# Combine masks (in this case we used "&") to achieve what a nested subset would look like
#  In the same step we add in our item assignment. Instructing pandas to create a new column in `df` and assign
#  the value "hello" to the rows in `df` where `q_mask` & `a_mask` overlap.
>>> df.loc[q_mask & a_mask, "new_col"] = "hello"

# Successful "propagation" of new values to the original dataframe
>>> df
   A  B new_col
0  1  Q   hello
1  2  Q   hello
2  3  Q     NaN
3  4  C     NaN
4  5  C     NaN

Lastly, if we ever wanted to see what df_q would look like we can always subset it from the original dataframe using our q_mask

>>> df.loc[q_mask, :]
   A  B new_col
0  1  Q   hello
1  2  Q   hello
2  3  Q     NaN

While this isn't necessarily "propagating" changes from df_q to df we achieve the same result. Actual propagation would need to be explicitly done and would be less efficient than just working with masks.

Comment From: persep

@CRiddler Thanks, you've been very helpful

Comment From: linehammer

The first thing you should understand is that SettingWithCopyWarning is a warning, and not an error. You can safely disable this warning with the following assignment.

pd.options.mode.chained_assignment = None

The real problem behind the warning is that it is generally difficult to predict whether a view or a copy is returned. When filtering Pandas DataFrames , it is possible slice/index a frame to return either a view or a copy. A "View" is a view of the original data, so modifying the view may modify the original data. While, a "Copy" is a replication of data from the original, any changes made to the copy will not affect original data, and any changes made to the original data will not affect the copy.

Comment From: ntjess

@CRiddler thanks for the detailed explanation. What happens if the original dataframe is out of scope? I.e.

def update_values(filtered):
  # Filtered is the result of a 'loc' call
  new_value = result_from_function_body()
  set_indexes = some_computation()
  filtered.loc[set_indexes, 'new_col'] = new_value

Does this mean there is no way for update_values to work? In this setup, a mask can't be used since we don't have access to the reference dataframe, right?

Pandas .loc[...] = value returns SettingWithCopyWarning

Problem description

Output of pd.show_versions()

Example with made up data

Output of `pd.show_versions()`