First of all, if I missed a point, please feel free to comment.
Using arithmetic operations on pd.DataFrames is sometimes a mouthful. Take the following example, where columns a
and b
should be multiplied by the column c
:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(3, 3), columns=list('abc'))
df[['a', 'b']] * df['c']
Apparently this doesn't work as expected. Instead one has to use either pd.Dataframe.mul()
, which brings up poor legibility, or pd.Dataframe.values
, which yields long lines and therefore also results in poor legibility:
# using pd.DataFrame.mul()
df[['a', 'b']].mul(df['c'], axis='index')
# This is quite short, but does not work...
df[['a', 'b']] * df[['c']].values
# .. you have to use numpy arrays instead
df[['a', 'b']].values * df[['c']].values
Surely, the last call in this example returns a numpy array, but in my case thats the only thing I'm interested in, since I'm rewrapping my data at a later stage.
I'm proposing a new short indexer for operating on values, sth like:
df.v[['a', 'b']] * df.v[['c']]
# which returns the same as
df[['a', 'b']].values * df[['c']].values
Or even more sophisticated:
df[['a', 'b']] * df.v[['c']]
# which returns the same as
df[['a', 'b']].mul(df['c'], axis='index')
Btw the same goes for all other arithmetic operators.
Comment From: jreback
- this would expose internal implementation detail (users would have to understand numpy )
- make code code more obscure / unreadable
- make the api more complex (we have another indexer, what is the reason???)
Apparently this doesn't work as expected. Instead one has to use either pd.Dataframe.mul(), which bbroadcasting a multiplication is
why do you think this should work this way? The point is to align operations on the index by default
Comment From: TomAugspurger
This is basically the same as @shoyer's point in https://github.com/pandas-dev/pandas/issues/10000#issuecomment-236238297 right?
IIRC the current behavior of dataframe
* series
is to match the behavior of NumPy to broadcast the last index (columns)?
I think expecting
df[['a', 'b']] * df['c']
to return
In [20]: df[['a', 'b']].mul(df['c'], axis=0)
Out[20]:
a b
0 1.726545 0.391649
1 -2.189975 -1.825123
2 -0.098067 0.015623
is perfectly reasonable. That said, this would be a big API change, with no clear way of deprecation.
Comment From: shoyer
In my experience, the best way to write such arithmetic currently is something like (df[['a', 'b']].T * df['c']).T
(which is hardly ideal).
I think this would be reasonable behavior to change for pandas 2.0 but probably not before.
I'm not excited about the proposal here, which feels like a work-around for fundamentally broken broadcasting behavior rather than a fix of the root cause.
Comment From: jreback
@shoyer if you want to create an issue for pandas 2 would be great.
closing this one as no-action in pandas 1.0
Comment From: shoyer
See https://github.com/pandas-dev/pandas2/issues/30
Comment From: skycaptain
Thanks for the discussion here.
I'm not excited about the proposal here, which feels like a work-around for fundamentally broken broadcasting behavior rather than a fix of the root cause.
My proposal was afaik a minor fix for a common problem, which people like me have now. But, I've learned, that even this addition would mean a lot of trouble/confusion to others. So, I agree with @shoyer and @jreback that this issue is reasonable, but also too profound.