The following occurs when using pandas.merge for an (left) outer join with the left_index=True and right_on="something" options. The index of the resulting DataFrame contains duplicate values and the "something" column contains new (compared to before) values that look suspiciously like what the index should be.

Is this expected behavior, or could this be a bug?

A minimal illustration is:


In [2]: pandas.__version__
Out[2]: '0.14.0'

In [3]: df1 = pandas.DataFrame(zeros(8).reshape(4,2), columns=['A', 'B'])

In [4]: df2 = pandas.DataFrame(arange(4).reshape(2,2), columns=['C', 'D'])

In [5]: df1
Out[5]: 
   A  B
0  0  0
1  0  0
2  0  0
3  0  0

In [6]: df2
Out[6]: 
   C  D
0  0  1
1  2  3

In [7]: pandas.merge(df1, df2, left_index=True, right_on='C', how='left')
Out[7]: 
   A  B  C   D
0  0  0  0   1
1  0  0  1 NaN
1  0  0  2   3
1  0  0  3 NaN

What I would expect as the result is the following and can be constructed by adding an extra column carrying the left index, doing the join with left_on instead of left_index, and dropping the extra column again:


In [8]: df1['i'] = pandas.Series(df1.index)

In [9]: pandas.merge(df1, df2, left_on='i', right_on='C', how='left').drop(['i'],axis=1)
Out[9]: 
   A  B   C   D
0  0  0   0   1
1  0  0 NaN NaN
2  0  0   2   3
3  0  0 NaN NaN

Comment From: jreback

closing as stale. pls reopen if still an issue.

Comment From: cwwalter

I am also still seeing this in 0.20.1. Is this expected behavior?

Comment From: cwwalter

In case anyone finds this in the future like I did, the answer is the following:

pandas.merge(df1, df2, left_index=True, right_on='C', how='left')

is not replacing the index with 'C' (which is what you want to join on). You can accomplish what you want with either:

pd.merge(df1, df2, left_index=True, right_on='C', how='left').set_index('C')

or

df1.join(df2.set_index('C'), how='left')

Note that the 'C' column disappears since you are joining them. However, It is still not obvious to me why the index winds up as [0, 1, 1, 1] in the first case in this issue.

Comment From: gfyoung

However, It is still not obvious to me why the index winds up as [0, 1, 1, 1] in the first case in this issue.

True, I'm a little perplexed by that myself. Re-opening in case anyone wants to tackle.

Comment From: shadiakiki1986

The 1st 1 and the last 1 seem to just be the last index in the dataframe. Extending the example to demonstrate

import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.zeros(8).reshape(4,2), columns=['A', 'B'])
df2 = pd.DataFrame(np.arange(4*4).reshape(2*4,2), columns=['C', 'D'])
pd.merge(df1, df2, left_index=True, right_on='C', how='left')

yields

  A B C D
0 0.0 0.0 0 1.0
7 0.0 0.0 1 NaN
1 0.0 0.0 2 3.0
7 0.0 0.0 3 NaN

where the 7 now is just the last row's index

Comment From: jreback

this result is correct, the index values on the left are being joined with the values in C

Comment From: danieljy

@jreback, while the join is happening, the resultant index is misleading.

As @shadiakiki1986 says, for the rows in left (df1) whose index doesn't correspond to any C in right (df2), the resultant index is being set simply to the last element in right (df2) which looks like a bug.

Seems like given that this is a left join, the original index should be maintained (equal to what C looks like in the final result as is).