Code Sample
# My code
df.loc[0, 'column_name'] = 'foo bar'
Problem description
This code in Pandas 20.3 throws SettingWithCopyWarning and suggests to
"Try using .loc[row_indexer,col_indexer] = value
instead".
I am already doing so, looks like there is a little bug. I use Jupyter. Thank you! :)
Output of pd.show_versions()
Comment From: TomAugspurger
@NadiaRom Can you provide a full example? It's hard to say for sure, but I suspect that df
came from an operation that may be a view or copy. For example:
In [8]: df = pd.DataFrame({"A": [1, 2], "B": [3, 4], "C": [4, 5]})
In [9]: df1 = df[['A', 'B']]
In [10]: df1.loc[0, 'A'] = 5
/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/indexing.py:180: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
/Users/taugspurger/Envs/pandas-dev/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#!/Users/taugspurger/Envs/pandas-dev/bin/python3.6
So we're updating df1
correctly. The ambiguity is whether or not df
will be updated as well. I think a similar thing is happening to you, but without a reproducible example it's hard to say for sure.
Comment From: NadiaRom
@TomAugspurger Here is the code, in general, I never assign values to pandas without .loc
df = pd.read_csv('df_unicities.tsv', sep='\t')
df.replace({'|': '--'}, inplace=True)
df_c = df.loc[df.encountry == country, : ]
df_c['sort'] = (df_c.encities_ua == 'all').astype(int) # new column
df_c['sort'] += (df_c.encities_foreign == 'all').astype(int)
df_c.sort_values(by='sort', inplace=True)
# ---end of chunk, everything is fine ---
if df_c.encities_foreign.str.contains('all').sum() < len(df_c):
df_c.loc[df_c.encities_foreign.str.contains('all'), 'encities_foreign'] = 'other'
df_c.loc[df_c.cities_foreign.str.contains('всі'), 'cities_foreign'] = 'інші'
else:
df_c.loc[df_c.encities_foreign.str.contains('all'), 'encities_foreign'] = country
df_c.loc[df_c.cities_foreign.str.contains('всі'), 'cities_foreign'] = df_c.country.iloc[0]
if df_c.encities_ua.str.contains('all').sum() < len(df_c):
df_c.loc[df_c.encities_ua.str.contains('all'), 'encities_ua'] = 'other'
df_c.loc[df_c.cities_ua.str.contains('всі'), 'cities_ua'] = 'інші'
else:
df_c.loc[df_c.encities_ua.str.contains('all'), 'encities_ua'] = 'Ukraine'
df_c.loc[df_c.cities_ua.str.contains('всі'), 'cities_ua'] = 'Україна'
# Warning after it
Thank you for rapid answer!
Comment From: CRiddler
The issue here is that you're slicing you dataframe first with .loc
in line 4. The attempting to assign values to that slice.
df_c = df.loc[df.encountry == country, :]
Pandas isn't 100% sure if you want to assign values to just your df_c
slice, or have it propagate all the way back up to the original df
. To avoid this when you first assign df_c
make sure you tell pandas that it is its own data frame (and not a slice) by using
df_c = df.loc[df.encountry == country, :].copy()
Doing this will fix your error. I'll tack on a brief example to help explain the above since I've noticed a lot of users get confused by pandas in this aspect.
Example with made up data
>>> import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,5], 'B':list('QQQCC')})
>>> df
A B
0 1 Q
1 2 Q
2 3 Q
3 4 C
4 5 C
>>> df.loc[df['B'] == 'Q', 'new_col'] = 'hello'
>>> df
A B new_col
0 1 Q hello
1 2 Q hello
2 3 Q hello
3 4 C NaN
4 5 C NaN
So the above works as we expect! Now lets try an example that mirrors what you attempted to do with your data.
>>> df = pd.DataFrame({'A':[1,2,3,4,5], 'B':list('QQQCC')})
>>> df_q = df.loc[df['B'] == 'Q']
>>> df_q
A B
0 1 Q
1 2 Q
2 3 Q
>>> df_q.loc[df['A'] < 3, 'new_col'] = 'hello'
/Users/riddellcd/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py:337: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[key] = _infer_fill_value(value)
>>> df_q
A B new_col
0 1 Q hello
1 2 Q hello
2 3 Q NaN
Looks like we hit the same error! But it changed df_q
as we expected! This is because df_q
is a slice of df
so, even though we're using .loc[] df_q
pandas is warning us that it won't propagate the changes up to df
. To avoid this, we need to be more explicit and say that df_q
is its own dataframe, separate from df
by explicitly declaring it so.
Lets start back from df_q
but use .copy()
this time.
>>> df_q = df.loc[df['B'] == 'Q'].copy()
>>> df_q
A B
0 1 Q
1 2 Q
2 3 Q
Lets try to reassign our value now!
>>> df_q.loc[df['A'] < 3, 'new_col'] = 'hello'
>>> df_q
A B new_col
0 1 Q hello
1 2 Q hello
2 3 Q NaN
This works without an error because we've told pandas that df_q
is separate from df
If you in fact do want these changes to df_c
to propagate up to df
thats another point entirely and will answer if you want.
Comment From: NadiaRom
@CRiddler Great, thank you!
As you mentioned, chained .loc
has never returned unexpected results. As I understand, .copy()
ensures Pandas that we treat selected df_sliced_once
as separate object and do not intend to change initial full df
. Please correct if I mixed up smth.
Comment From: jreback
documentation is here http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy and @CRiddler has a nice expl. you should in general NOT use inplace
at all.
Comment From: persep
If you in fact do want these changes to
df_c
to propagate up todf
thats another point entirely and will answer if you want.
@CRiddler Thanks your answer is better than the ones in Stack Overflow could you add when you want to propagate to the initial dataframe or give an indication of how it is done?
Comment From: CRiddler
@persep In general I don't like turning issues into stackoverflow threads for help, but it seems that this issue has gotten a fair bit of attention since last posting so I'll go ahead and post my method of tackling this type of problem in pandas. I typically do this by not subsetting the dataframe into separate variables, but I instead turn masks into variables- then combine masks as needed and set values based on those masks to ensure the changes happen in the original dataframe, and not to some copy floating around.
Original data:
>>>import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,5], 'B':list('QQQCC')})
>>> df
A B
0 1 Q
1 2 Q
2 3 Q
3 4 C
4 5 C
Remember that creating a temporary dataframe will NOT propagate changes
As shown in the previous example, this makes changes to only to df_q
and raises a pandas warning (not copied/pasted here). AND does NOT propagate any changes to df
>>> df_q = df.loc[df["B"] == "Q"]
>>> df_q.loc[df["A"] < 3, "new_column"] = "hello"
# df remains unchanged because we only made changes to `df_q`
>>> df
A B
0 1 Q
1 2 Q
2 3 Q
3 4 C
4 5 C
To my knowledge, there is no way to use the same code as above and force changes to propagate back to the original dataframe.
However, if we change our thinking a bit and work with masks instead of full-on subsets we can achieve the desired result. While this isn't necessarily "propagating" changes to the original dataframe from a subset, we are ensuring that any changes we do make happen in the original dataframe df
. To do this, we create masks first, then apply them when we want to make a change to that subset of df
>>> q_mask = df["B"] == "Q"
>>> a_mask = df["A"] < 3
# Combine masks (in this case we used "&") to achieve what a nested subset would look like
# In the same step we add in our item assignment. Instructing pandas to create a new column in `df` and assign
# the value "hello" to the rows in `df` where `q_mask` & `a_mask` overlap.
>>> df.loc[q_mask & a_mask, "new_col"] = "hello"
# Successful "propagation" of new values to the original dataframe
>>> df
A B new_col
0 1 Q hello
1 2 Q hello
2 3 Q NaN
3 4 C NaN
4 5 C NaN
Lastly, if we ever wanted to see what df_q would look like we can always subset it from the original dataframe using our q_mask
>>> df.loc[q_mask, :]
A B new_col
0 1 Q hello
1 2 Q hello
2 3 Q NaN
While this isn't necessarily "propagating" changes from df_q
to df
we achieve the same result. Actual propagation would need to be explicitly done and would be less efficient than just working with masks.
Comment From: persep
@CRiddler Thanks, you've been very helpful
Comment From: linehammer
The first thing you should understand is that SettingWithCopyWarning is a warning, and not an error. You can safely disable this warning with the following assignment.
pd.options.mode.chained_assignment = None
The real problem behind the warning is that it is generally difficult to predict whether a view or a copy is returned. When filtering Pandas DataFrames , it is possible slice/index a frame to return either a view or a copy. A "View" is a view of the original data, so modifying the view may modify the original data. While, a "Copy" is a replication of data from the original, any changes made to the copy will not affect original data, and any changes made to the original data will not affect the copy.
Comment From: ntjess
@CRiddler thanks for the detailed explanation. What happens if the original dataframe is out of scope? I.e.
def update_values(filtered):
# Filtered is the result of a 'loc' call
new_value = result_from_function_body()
set_indexes = some_computation()
filtered.loc[set_indexes, 'new_col'] = new_value
Does this mean there is no way for update_values
to work? In this setup, a mask can't be used since we don't have access to the reference dataframe, right?