Suppose I have two DataFrames:
left = pd.DataFrame({'key': list("abcd"), 'value': [1, 2, 3, 4]})
right = pd.DataFrame({'key': list("aceb"), 'value': [5, 6, 7, 8]})
Columns with with the same name are given suffixes in a merge()
:
In [8]: pd.merge(left, right, on='key')
Out[8]:
key value_x value_y
0 a 1 5
1 b 2 8
2 c 3 6
I would like to invoke a function instead:
In [9]: pd.merge(left, right, on='key', combine=lambda x, y: x + y)
Out[9]:
key value
0 a 6
1 b 10
2 c 9
In a left/right/outer join, missing keys imply no function call (just take the single value):
In [10]: pd.merge(left, right, on='key', how='outer', combine=lambda x, y: x + y)
Out[10]:
key value
0 a 6
1 b 10
2 c 9
3 d 4
4 e 7
Only one of combine
or suffixes
parameter can be specified. Of course, the user can simply request an override by taking only the right's values if both keys are present:
In [11]: pd.merge(left, right, on='key', how='left', combine=lambda x, y: y)
Out[11]:
key value
0 a 5
1 b 8
2 c 6
3 d 4
Comment From: TomAugspurger
How would this work with multiple shared columns?, e.g.
python
left = pd.DataFrame({'key': list("abcd"), 'value': [1, 2, 3, 4], 'value2': list('aaaa')})
right = pd.DataFrame({'key': list("aceb"), 'value': [5, 6, 7, 8], 'value2': list('aaaa')})
I could see two options:
- The
x
,y
passed are either 1-d arrays or Series (what's the index?key
?), andcombine
can be a single function called on each pair, or a dictionary -> function API like{'value': lambda x, y: x + y, 'value2': lambda x, y: ...}
- The
x
andy
passed is a DataFrame of all the cols fromleft
andright
that have overlap
Comment From: chrisaycock
@TomAugspurger I like the first option. If the user just gives a single function, then that function is applied to all columns; a dict means applying the specified function to the given column.
Comment From: mroeschke
Thanks for the request but it seems there hasn't been much development or interest from the community or core dev team in a while so closing for now. Can reopen when there's more interest