Pandas method df.merge is resetting the index

method df.merge will throw away / reset the index of df. I dont think this is desired result.

here's an example:

df = pd.DataFrame({'key' : [1,2,3]}, index=['very','important','index'])
df2 = pd.DataFrame({'key' : [3,2,1], 'data':['c','b','a']}, index=[100,200,300])
df.merge(df2, on='key')

the result will lose the very important index in df and instead will have a generic 0 1 2 index.

   key data
0    1    a
1    2    b
2    3    c

here is a method based on merge that will preseve the original index (which should be the intended result IMHO)

def merge_with_perfect_index(df, df2, on, how):
    original_index_name = df.index.name
    TEMP_INDEX = 'temp_index_aviad'
    return df.rename_axis(TEMP_INDEX).reset_index().merge(df2, on=on, how=how).set_index(TEMP_INDEX).rename_axis(original_index_name)

merge_with_perfect_index(df, df2, on='key', how='left')

which outputs

                    key  data
very           1      a
important      2      b
index          3      c

tested on 1.5.3

Comment From: samukweku

have a look at the docs regarding merge and caveats on index preservation

Comment From: aviadr1

I see, thanks for the quick reply! I usually read the reference API docs and havent read through the user guide. this appears rather as a footnote in the user guide, and I found this behavior to be unexpected.

note, that in the example, the join is in practice a 1-to-1 join as the keys are unique.

IMHO having this documented in the merge reference to having a parameter to control this behavior would be great. for instance a reindexing parameter could be handy. it could have the values: - left keep the index from the left dataframe. if this is impossible due to many-on-many then raise error - auto - current behavior: keep index if merging on unique indexes, or reset otherwise - reset - always reset

having the parameter available in merge would be an opportunity to reference this gotcha

in my ML project, this issue caused training data to leak into the test data because I was relying on indexes to keep them apart.