related #6360
Based on this SO question
Consider the following two Series:
x = pd.DataFrame({'year':[1,1,1,1,2,2,2,2],
'country':['A','A','B','B','A','A','B','B'],
'prod':[1,2,1,2,1,2,1,2],
'val':[10,20,15,25,20,30,25,35]})
x = x.set_index(['year','country','prod']).squeeze()
y = pd.DataFrame({'year':[1,1,2,2],'prod':[1,2,1,2],
'mul':[10,0.1,20,0.2]})
y = y.set_index(['year','prod']).squeeze()
which look like:
year country prod
1 A 1 10
2 20
B 1 15
2 25
2 A 1 20
2 30
B 1 25
2 35
year prod
1 1 10.0
2 0.1
2 1 20.0
2 0.2
I find it to be an extremely common task, to perform binary operations by distributing the values of y
over a particular level of x
. For example, I'd like to multiply all values of product 1 in year 1 by 10.0, regardless of country
.
The required result is therefore as follows:
year country prod
1 A 1 100.0
2 2.0
B 1 150.0
2 2.5
2 A 1 400.0
2 6.0
B 1 500.0
2 7.0
The binary operation .mul() doesn't work as expected:
>>> x.mul(y, level=['year','prod'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 334, in f
return self._binop(other, op, level=level, fill_value=fill_value)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 2075, in _binop
this, other = self.align(other, level=level, join='outer')
File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 2570, in align
return_indexers=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/index.py", line 954, in join
return_indexers=return_indexers)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/index.py", line 1058, in _join_level
raise Exception('Join on level between two MultiIndex objects '
Exception: Join on level between two MultiIndex objects is ambiguous
To create the required result, the user currently has to do this:
x = x.reset_index('country').sort_index()
x.val = x.val * y
x = x.reset_index().set_index(['year',
'country',
'prod']).sortlevel()
Comment From: jreback
@roblevy can you put your solution in the top section as well (then its easy to create expected result)
Comment From: roblevy
Done!
Comment From: jreback
gr8! love for you to work on this! otherwise prob won't be tackled for a while. Its not too tricky and would get you a lot of exposure to the code.
Comment From: roblevy
:D Great
On 5 December 2013 23:18, jreback notifications@github.com wrote:
gr8! love for you to work on this! otherwise prob won't be tackled for a while. Its not too tricky and would get you a lot of exposure to the code.
— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/5645#issuecomment-29949006 .
Comment From: ghost
I think the example result has editing errors in it, where did the 400 come from? doesn't match your description of the operation.
A very powerful operation to have added. Is there a sugary way to do it you had in mind? I like the idea anyway.
Comment From: roblevy
Hi @y-p. The example result is correct.
The 400 is x[year 2, country A, product 1] * y[year 2, product 1] = 20 * 20 = 400
Notice how the values of x
are multiplied by different numbers from y
depending on year
and product
.
As far as I'm concerned, this should "just work" when the user does x.mul(y)
Comment From: ghost
Thanks, I see now.
Comment From: 0o-de-lally
@roblevy @jreback This feature is an important component to a project i'm working on. Any chance we can get this implemented on the .14 release as well? I see that a related bug was fixed for the same release. Seems like they belong together. How can I help get this released sooner?
Comment From: jreback
@keyscores well, this is dependent on #6360 so that needs to be fixed first that said, you can simply use the soln above. Its just as efficient, if not syntax friendly, but you can wrap in a function
of course a PR would be welcom
Comment From: 0o-de-lally
@jreback I'll try the solution above. I can donate a small bounty as an incentive to move #6360 up to v0.14 , would that help? :)
Comment From: jreback
@keyscores unfortunately no....:) that IS being worked on though
Comment From: 0o-de-lally
Ok it was worth a try. At least you know you have a cheerleader for that fix. Would love to see it in .14 :)
Comment From: roblevy
This may or may not be relevant, but I've found a comparatively neat workaround for this problem: unstack
the index level(s) which doesn't (don't) appear in the other Series
.
In [81]: x.unstack('country').mul(y, axis=0).stack()
Out[81]:
year prod country
1 1 A 100.0
B 150.0
2 A 2.0
B 2.5
2 1 A 400.0
B 500.0
2 A 6.0
B 7.0
dtype: float64
This is not the best possible solution since it changes the order of the index of x
which is unexpected, but it should be pretty efficient.
Comment From: jreback
yes this would be quite efficient
In [31]: x.unstack('country').mul(y, axis=0).stack().reorder_levels(x.index.names)
Out[31]:
year country prod
1 A 1 100.0
B 1 150.0
A 2 2.0
B 2 2.5
2 A 1 400.0
B 1 500.0
A 2 6.0
B 2 7.0
dtype: float64
Comment From: roblevy
Looks to me like you might (sometimes at least) need a .sortlevel()
on the end of that command.
In [41]: z = pd.DataFrame(dict(country=['A', 'A', 'B', 'B'], prod=[1,2,1,2], val=[1,2,3,4]))
In [42]: z = z.set_index(['country', 'prod']).squeeze()
In [43]: x.unstack('year').mul(z, axis=0).stack().reorder_levels(x.index.names)
Out[43]:
year country prod
1 A 1 10
2 A 1 20
1 A 2 40
2 A 2 60
1 B 1 45
2 B 1 75
1 B 2 100
2 B 2 140
dtype: int64
Comment From: znicholls
This appears to have been fixed somewhere, in the latest master you can do
import pandas as pd
x = pd.DataFrame({'year':[1,1,1,1,2,2,2,2],
'country':['A','A','B','B','A','A','B','B'],
'prod':[1,2,1,2,1,2,1,2],
'val':[10,20,15,25,20,30,25,35]})
x = x.set_index(['year','country','prod']).squeeze()
y = pd.DataFrame({'year':[1,1,2,2],'prod':[1,2,1,2],
'mul':[10,0.1,20,0.2]})
y = y.set_index(['year','prod']).squeeze()
x.mul(y, axis=0)
# result below
year prod country
1 1 A 100.0
B 150.0
2 A 2.0
B 2.5
2 1 A 400.0
B 500.0
2 A 6.0
B 7.0
dtype: float64
Comment From: mroeschke
Closing as this seems addressed