Pandas API / CoW: copy/view behaviour when constructing DataFrame/Series from a numpy array

Context: with the Copy-on-Write implementation (see https://github.com/pandas-dev/pandas/issues/36195 / detailed proposal doc, and overview follow up issue https://github.com/pandas-dev/pandas/issues/48998), we generally want to avoid that updating (setting) values in one DataFrame/Series also changes another DataFrame/Series. When constructing a DataFrame/Series from existing DataFrame/Series objects this can be handled correctly (the newly constructed object keeps track of the original ones, ensuring to correctly trigger the Copy-on-Write mechanism when needed). However, when constructing a DataFrame/Series from a numpy array, this doesn't work (the CoW mechanism works on the level of BlockManager and blocks).

Currently the DataFrame(..)/Series(..) constructors work "zero-copy" for a numpy array input (i.e. viewing the original ndarray). While it would be nice to keep behaviour for performance reasons, this also creates some problems for maintaining the copy/view rules under Copy-on-Write.

There are several aspects to this behaviour (I am showing the examples for Series, but the same is true for constructing DataFrame from a 2D array):

1) Updating the Series/DataFrame also updates the numpy array:

>>> arr = np.array([1, 2, 3])
>>> s = pd.Series(arr)
>>> s.iloc[0] = 100
>>> arr
array([100,   2,   3])

2) Updating the numpy array also updates the Series/DataFrame:

>>> arr = np.array([1, 2, 3])
>>> s = pd.Series(arr)
>>> arr[0] = 100
>>> s
0    100
1      2
2      3
dtype: int64

3) When creating multiple pandas objects viewing the same numpy array, updating the one will also update the other:

>>> arr = np.array([1, 2, 3])
>>> s1 = pd.Series(arr)
>>> s2 = pd.Series(arr)
>>> s1.iloc[0] = 100
>>> s2
Out[13]: 
0    100
1      2
2      3
dtype: int64

The first two cases could maybe be seen as corner cases (and documented as such, assuming changing the numpy array after DataFrame/Series creation is an "advanced" use case), although IMO ideally we should avoid that users can accidentally mutate DataFrame/Series in this way. But the third case clearly violates the principles of the copy/view rules under the CoW proposal.

I can think of two ways to fix this:

1) We always copy input arrays in the constructors (i.e. effectively changing the default to copy=True). * We will still need some way to create the Series/DataFrame without a copy (for example internally, if we know the array was just created and isn't passed by the user), and we could maybe use the copy=False option for this? (but we will have to clearly document that you then circumvent CoW and we don't give guarantees about the resulting copy/view behaviour when the array or dataframe/series gets mutated. * Ideally we would avoid this for performance reasons, I think, at least for Series (and DataFrame in case the input array has the correct memory layout). But for typical 2D numpy arrays, making a copy by default could actually improve performance: https://github.com/pandas-dev/pandas/issues/50756 2) We keep using a view by default (like we can do for DataFrame/Series input to the constructors), but use some other mechanism to enforce correct CoW behaviour. * One idea would be to change the input array to read-only (arr.flags.writeable = False). This would avoid the user to accidentally mutate the array and that way also update the pandas object (you can still change the flags, but that should give a hint to the user about what it is doing), and at the same time that would trigger pandas to make a copy of the array when the user tries to mutate the DataFrame/Series (triggering CoW, but using a different mechanism).

Comment From: phofl

I'd be in favour of 1. Changing the default to copy=True but still allowing copy=False (and clearly documenting the caveats). I am also very much in favour of changing the memory layout in the process of making a copy.

Comment From: MarcoGorelli

@jorisvandenbossche any reason to move this back to the 2.0 milestone? Is it a blocker?

Comment From: phofl

this will be finished by #50776