In general almost all DataFrame and Series methods return new data and thus make a copy if needed (if there was no calculation / data didn't change). But some methods allow you to avoid making this copy with an explicit copy
keyword, which defaults to copy=True
, but which you can change to copy=False
manually to avoid the copy.
Example:
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
# by default a method returns a copy
>>> df2 = df.rename(columns=str.upper)
>>> df2.iloc[0, 0] = 100
>>> df
a b
0 1 3
1 2 4
# explicitly ask not to make a copy
>>> df3 = df.rename(columns=str.upper, copy=False)
>>> df3.iloc[0, 0] = 100
>>> df
a b
0 100 3
1 2 4
Now, if Copy-on-Write is enabled, the above behaviour shouldn't happen (because we are updating one dataframe (df
) through changing another dataframe (df3
)).
In this specific case of rename
, it actually already doesn't work anymore like that, and df
is not updated:
>>> pd.options.mode.copy_on_write = True
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df3 = df.rename(columns=str.upper, copy=False)
>>> df3.iloc[0, 0] = 100
>>> df
a b
0 1 3
1 2 4
This is because of how it is implemented under the hood in rename
, using result = self.copy(deep=copy)
, and so this always was already taking a shallow copy of the calling dataframe. With CoW enabled, a "shallow" copy doesn't exist anymore in the original meaning, but now essentially is a "lazy copy with CoW".
But for some other methods, this is actually not yet working
There are several issues/questions here:
- Are we OK with
copy=False
now actually meaning a "lazy" copy for all those methods?- I don't think there is any alternative with the current CoW semantics, but just to make this explicit and track this, because we 1) should document this (it's a breaking change) and potentially add future warnings for this at some point, and 2) ensure this behaviour is correctly happening for all methods that have a
copy
keyword.
- I don't think there is any alternative with the current CoW semantics, but just to make this explicit and track this, because we 1) should document this (it's a breaking change) and potentially add future warnings for this at some point, and 2) ensure this behaviour is correctly happening for all methods that have a
- The case of manually passing
copy=True
should still give an actual hard / "eager" copy?- Probably yes (if we keep the keyword, see 3) below), but we should also ensure to test this when CoW is enabled.
- If (in the future with CoW enabled) the default will now be to not return a copy, is it still worth it to keep the
copy
keyword?- Currently the default is
copy=True
, and so people will typically mostly use it explicitly to setcopy=False
. Butcopy=False
will become the default in the future, and so will not be needed anymore to specify explicitly. - People can still use
copy=True
in the future to ensure they get a "eager" copy (and not delay the copy / trigger a copy later on). But is that use case worth it to keep the keyword around? (they can always do.copy()
instead)
- Currently the default is
DataFrame/Series methods that have a copy
keyword (except for the constructors):
- [ ]
align
- [ ]
astype
- [ ]
infer_objects
- [ ]
merge
- [ ]
reindex
- [ ]
reindex_like
- [ ]
rename
- [ ]
rename_axis
- [ ]
set_axis
(only added in 1.5) - [ ]
set_flags
(default False) - [ ]
swapaxes
- [ ]
swaplevel
- [ ]
to_numy
(default False) - [ ]
to_timestamp
- [ ]
transpose
(default False) - [ ]
truncate
- [ ]
tz_convert
- [ ]
tz_localize
- [ ]
pd.concat
xref CoW overview issue: https://github.com/pandas-dev/pandas/issues/48998
Comment From: jbrockmendel
If (in the future with CoW enabled) the default will now be to not return a copy, is it still worth it to keep the copy keyword?
If we keep the no-CoW mode, then we should keep the copy keyword, and add it where relevant xref #48141. Ideally we'll end up with just one mode.
Are we OK with copy=False now actually meaning a "lazy" copy for all those methods?
I'm ambivalent here. The other reasonable alternative would be to for copy=False
to clear the refs (something like (result._mgr.refs = None
) which I can see being nice for advanced users, but confusing for everyone else.
People can still use copy=True in the future to ensure they get a "eager" copy (and not delay the copy / trigger a copy later on). But is that use case worth it to keep the keyword around? (they can always do .copy() instead)
Better to tell them to do .copy() instead.
Comment From: phofl
If we keep the no-CoW mode, then we should keep the copy keyword, and add it where relevant xref https://github.com/pandas-dev/pandas/issues/48141. Ideally we'll end up with just one mode.
Yeah agreed, I'd prefer one mode as well. Should make things significantly easier to test. Also agree with getting rid of the keyword in this case.
If we decide to do that, I would also prefer copy=True meaning a lazy copy as well, makes the transition easier when we remove it altogether
Comment From: jorisvandenbossche
Yes, I am also assuming that eventually we only have a single copy/view mode.
If we decide to do that, I would also prefer copy=True meaning a lazy copy as well, makes the transition easier when we remove it altogether
And so basically already start to ignore the copy
keyword altogether in those methods when CoW is enabled? That seems fine to me, given that the plan now is to remove that keyword later on (and users can do an explicit .copy()
if they really want).
Note: https://github.com/pandas-dev/pandas/pull/51464 is doing exactly that.