method df.merge
will throw away / reset the index of df
. I dont think this is desired result.
here's an example:
df = pd.DataFrame({'key' : [1,2,3]}, index=['very','important','index'])
df2 = pd.DataFrame({'key' : [3,2,1], 'data':['c','b','a']}, index=[100,200,300])
df.merge(df2, on='key')
the result will lose the very important index
in df
and instead will have a generic 0 1 2
index.
key data
0 1 a
1 2 b
2 3 c
here is a method based on merge that will preseve the original index (which should be the intended result IMHO)
def merge_with_perfect_index(df, df2, on, how):
original_index_name = df.index.name
TEMP_INDEX = 'temp_index_aviad'
return df.rename_axis(TEMP_INDEX).reset_index().merge(df2, on=on, how=how).set_index(TEMP_INDEX).rename_axis(original_index_name)
merge_with_perfect_index(df, df2, on='key', how='left')
which outputs
key data
very 1 a
important 2 b
index 3 c
tested on 1.5.3
Comment From: samukweku
have a look at the docs regarding merge
and caveats on index preservation
Comment From: aviadr1
I see, thanks for the quick reply! I usually read the reference API docs and havent read through the user guide. this appears rather as a footnote in the user guide, and I found this behavior to be unexpected.
note, that in the example, the join is in practice a 1-to-1 join as the keys are unique.
IMHO having this documented in the merge
reference to having a parameter to control this behavior would be great.
for instance a reindexing
parameter could be handy. it could have the values:
- left
keep the index from the left dataframe. if this is impossible due to many-on-many then raise error
- auto
- current behavior: keep index if merging on unique indexes, or reset otherwise
- reset
- always reset
having the parameter available in merge
would be an opportunity to reference this gotcha
in my ML project, this issue caused training data to leak into the test data because I was relying on indexes to keep them apart.