Pandas DOC: improve docs for broadcasting behavior

The following result was quite surprising to me:

>>> df = pd.DataFrame({'a': np.arange(3),
                       'b': np.ones(3)})
>>> df + [np.ones(3), np.ones(3)]
                 a                b
0  [1.0, 1.0, 1.0]  [2.0, 2.0, 2.0]
1  [2.0, 2.0, 2.0]  [2.0, 2.0, 2.0]
2  [3.0, 3.0, 3.0]  [2.0, 2.0, 2.0]

It looks like if you add a list to a dataframe, pandas casts it to a one-dimensional Series, where the entries of the series are array objects. The addition is then broadcasted down the columns such that each entry of the result is an array.

This behavior is quite counter-intuitive to me: is it a bug? If not, should it be documented?

Comment From: gwerbin

The broadcasting behavior is documented, albeit (IMO) very poorly: http://pandas.pydata.org/pandas-docs/stable/basics.html?highlight=broadcast#matching-broadcasting-behavior

Here's a simpler demo:

df = pd.DataFrame({'x': np.arange(3), 'x': np.arange(3) + 1}, index=['a', 'b', 'c'])

# explicit behavior
df.add(pd.Series([1, 10]))
df.add(pd.Series([1, 10], index=df.columns))
df.add(pd.Series([1, 10, 100], index=df.index), axis='index')

# implicit conversion behavior
df.add([1, 10])
df.add([1, 10], 'columns')
df.add([1, 10, 100], axis='index')

# DataFrame.__add__() is equivalent to DataFrame.add(..., axis='index')
df + pd.Series([1, 10], index=df.columns)
df + [1, 10]

As for turning the results into an array, that's just a consequence of the fact that the Series elements are arrays. The elements of the Series are broadcast according to the above, rules, and from there addition is applied naively. In this case, a number plus an array is an array, so the resulting elements are arrays.

Comment From: jreback

very happy to have a PR to improve that section. keep in mind that we don't necessarily want to make it too long.

Comment From: gwerbin

@jreback it would help if the font size in the code samples were smaller and the text area were wider. It's physically hard to read right now, especially if you're trying to scan for a particular piece of information. Compare docs.python.org.

Using integers instead of random numbers in the examples will also help comprehension significantly.

Comment From: jakevdp

Thanks! I do understand why it happens after digging through the code, but I would say that in an ecosystem where broadcasting is such a commonly-used pattern, it's quite confusing for a 2D object to be implicitly treated as a 1D object array containing 1D entries.

Comment From: gwerbin

@jakevdp That's not exactly what's happening. The list on the RHS is blindly cast to Series, and a Series of arrays is fundamentally a 1D object. It is being broadcast to form a 2D object by copying the first dimension along the second:

a00 a01       b0 b1
a10 a11   +   b0 b1
a20 a21       b0 b1

This behavior is consistent with the behavior in NumPy, where it generalizes to arbitrary dimensions.

It just so happens that, in this case, aij is itself an array. Pandas itself is agnostic with regard to how each aij + bj calculation takes place. It just so happens that there is another broadcasting step that happens between aij and bj.

Comment From: jakevdp

I disagree that this is consistent with numpy broadcasting. Numpy implicitly treats a list of arrays as a 2D array:

>>> arr = np.random.randint(0, 10, size=(3, 4))
>>> L = [np.random.randint(0, 10, 4) for i in range(3)]
>>> arr + L
array([[ 9,  6,  6, 10],
       [ 7, 13,  5,  5],
       [ 7, 15, 17,  3]])

2D object + 2D object = 2D object

Pandas, on the other hand, treats a list of arrays as a 1D series of anonymous objects:

>>> df = pd.DataFrame(arr.T)
>>> df + L
               0              1                 2
0  [9, 8, 8, 11]  [7, 15, 6, 9]      [7, 7, 9, 3]
1   [7, 6, 6, 9]  [5, 13, 4, 7]  [15, 15, 17, 11]
2   [7, 6, 6, 9]  [6, 14, 5, 8]  [15, 15, 17, 11]
3  [8, 7, 7, 10]  [3, 11, 2, 5]      [7, 7, 9, 3]

Effectively, the result is 2D object + 2D object = 3D object. I found that to be quite surprising, due to the inconsistency with NumPy's behavior.

The result is even more silly if you try to add a list of series to a dataframe:

>>> L = [pd.Series(row) for row in L]
>>> df + L

The result is a 2D dataframe, each element of which is a 1D Series, and I can't imagine a scenario in which that is what the user would want.

Since adding a list of numbers to a dataframe is basically equivalent to doing df[i] + L[i] within each column, I had expected that adding a list of Series objects to a dataframe would also be the equivalent of doing df[i] + L[i] for each column, in analogy to the way that NumPy broadcasting works (modulo the standard column-first vs. row-first difference).

Does that make sense why a user might be surprised by this behavior?

Comment From: jakevdp

Overall, I think the biggest issue here is that this operation is doing so much implicitly, leading the user to make assumptions about what should happen... I think the most consistent option would be to make dataframe + list raise a ValueError in all cases, but that ship has probably sailed.

Comment From: gwerbin

I see, thanks for clarifying. I'm not a Pandas dev myself but version 0.21 I would hope is not too late to improve an API. I suppose a PR is therefore in order here, but in the meantime the best we can do is document the behavior better.

Unfortunately I managed to break the restructuredText rendering on my fork, and I haven't had the chance to fix it. I'll have some time to work on it over the weekend, and I'll make sure to highlight this case.

Comment From: VincentLa

I can work on this if no one else has started.

Comment From: gwerbin

@VincentLa I had started a pass at this but never finished. I do intend to work on it again at some point, but it's here in case you'd like to pick it up: https://github.com/gwerbin/pandas/tree/patch-1

Comment From: VincentLa

Thanks @gwerbin if you're planning on working on it then I'll let you finish!

Comment From: gwerbin

On the contrary, help would be appreciated. I won't have time to work on it for quite a while.

Here is the diff so you can see how far I got: https://github.com/gwerbin/pandas/compare/master...gwerbin:patch-1

It's wordy "first draft" quality stuff. Definitely a long way to go before it's acceptable documentation.

Comment From: techy4shri

is this issue still open for contribution?

Comment From: ghost

Chưa hiểu mấy cái này lắm. Xin ít giới thiêu mở đầu được không ????

Vào CN, 22 thg 12, 2024 lúc 03:17 Garima @.***> đã viết:

is this issue still open for contribution?

— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/18857#issuecomment-2558228215, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKUN6KD6J5ZNEJ5ICS7RTRL2GXEELAVCNFSM6AAAAABUA5MFIWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJYGIZDQMRRGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Comment From: that-ar-guy

Hi everyone, is this issue still open for contribution? I'd like to work on improving the documentation for broadcasting behavior.