Pandas Losing index and sort on merge

I'm almost positive I've filed an issue about this before, but I don't find it. This is a pretty common operation when working with panel data. You want to merge in some constant values into a DataFrame that has repeated keys but unique index.

df = pd.util.testing.makeDataFrame() # yes!
df["group"] = 0
df.ix[range(0, 20, 2), "group"] = 1.
df["time"] = np.random.randint(5, size=len(df))
df_x = pd.DataFrame(np.c_[np.r_[np.arange(5), np.arange(5)], 
                        np.repeat([0,1], 5)], 
                    columns=["time", "group"])
df_x["constant_info"] = np.random.randn(len(df_x))

df.merge(df_x, on=["time", "group"])

Where'd my index go? Where'd my sort go, so I could even recover the index?

[~/]
[65]: pd.version.version
[65]: '0.13.1-254-g150f323'

Comment From: jreback

reset the index first.

I am not sure this well defined actually. Because the shape of the resultant could change I am not sure it is that easy to propogate the index, but could be a bug.

In [29]: df.reset_index().merge(df_x, on=["time", "group"])
Out[29]: 
         index         A         B         C         D  group  time  constant_info
0   67U6Qr4tZ3  0.588197 -0.157200 -0.289734 -1.666943      1     0       0.119489
1   A0OncCRC4B -0.372781 -0.916348 -1.045884  1.073101      1     0       0.119489
2   PTn2oHbQwX -1.214873  0.148770 -0.912763 -0.682711      0     3      -0.368596
3   QIXTSYuRt8  1.142137 -1.156337 -0.241504  0.184532      0     3      -0.368596
4   GA2H70ii7r -1.168154  1.288484  0.384694 -1.709554      0     3      -0.368596
5   9zCy65irsl  0.742336 -0.519756  2.286992 -0.992639      0     3      -0.368596
6   JBYulSbxFz -0.305541 -0.291382 -0.056819 -1.060126      1     3      -1.214391
7   zcwADabKd6 -0.139200  1.261100  2.633415  0.430764      1     3      -1.214391
8   LdIgdqggFM -0.443637  0.568010 -1.806676  2.413111      0     0       0.645687
9   pgVKNmKAj2 -0.205737 -0.176864 -0.049452 -0.015245      0     0       0.645687
10  8wepkCgkMj -0.063876  1.157569  0.993645 -0.132804      0     0       0.645687
11  7A8MaDJoI4  0.110686 -0.245378  0.642752 -1.354833      0     0       0.645687
12  nUYhh9XadC  0.329432  1.631661  1.883624  0.299788      0     0       0.645687
13  lOIdNjZcbi  0.255259 -0.894358  1.452045  0.175334      0     0       0.645687
14  GFSNSFrV0M -0.699924 -1.604428 -0.357788  2.047454      1     2       0.120312
15  z25eVDRRNc  0.582153 -1.990730  0.934239 -0.619487      1     2       0.120312
16  VOCrP3g19I -1.402226 -1.409673 -0.094738 -0.064140      0     2       0.200803
17  YnVQ5Id67z -0.392426 -0.004004 -0.044901 -1.242439      0     2       0.200803
18  0QJn4pAnxN -1.613268  2.038299  1.443572 -0.436414      0     2       0.200803
19  39jjtXL37y -0.027252 -0.063240  1.096697  1.559221      0     2       0.200803
20  ezYa3CqIsY -0.973928 -0.902517 -1.763653 -1.711358      0     2       0.200803
21  aDEKCCnLIz  0.176168  0.559791 -2.198903  0.050573      0     4      -0.291175
22  8hzneduK1g -0.340049  0.095846  0.710264  1.660862      0     4      -0.291175
23  4PdmoVHVIT  0.015929  0.423310 -0.211383  1.335053      0     4      -0.291175
24  LrrXeBfYwT -0.128707 -0.118413 -0.036632  0.557884      1     1       1.144252
25  SAQP6I7DUP  2.453141  2.470560  0.100641  0.413540      1     1       1.144252
26  Md780GPvIT  3.401016  0.450771 -0.171816  0.825083      1     1       1.144252
27  LG24ITGSVY -1.013176  0.468748 -0.613433 -0.799757      0     1      -1.193142
28  56yc13pOAb -0.302066 -0.835660 -0.446994  1.586293      0     1      -1.193142
29  3GOkikZGMA  0.060595 -0.991693 -2.306191  0.997766      1     4      -0.637071

[30 rows x 8 columns]

Comment From: jseabold

Thanks. Better than what I came up with

grouped = df.groupby(['time', 'group'])
df_x.set_index(['time', 'group'], inplace=True)
for group, frame in grouped:
    for var in df_x.columns.diff(frame.columns):
        df.ix[frame.index, var] = df_x.ix[group, var]

I should add that I left out a how='left' in the example, which should be well defined unless there are duplicate keys in the RHS. I'm not saying merge should definitely handle this, but there should be something for this. It's similar to the use case I had when Chang and I looked at adding an update method. Maybe it should go there or similar.

Comment From: jseabold

reset_index gets tricky with MultiIndex.

Comment From: jreback

their is also a left_on keyword (defaults to False), which you could set if you want to merge on the index (I don't think it works with multi-index though).

In general the code is much simpler if the indexes are not involved in a merge directly (e.g. they are reset first), then just do on.

merge is a beast!

Comment From: jseabold

I don't want to merge on the index though I don't think bc. the RHS doesn't have one.

I tend to think in terms of many to 1 and 1 to many merges like Stata, while this also targets SQL-like merge.

http://www.stata.com/help.cgi?merge

Maybe a merge_many that handles m:1 and 1:m would be nicer than trying to shoehorn things into merge. I dunno.

Comment From: jseabold

Yeah, this becomes more difficult for named indices and MultiIndex. My method groupby loop should still work here though.

Comment From: jreback

closing as stale. pls reopen if still an issue.

Comment From: marcdelabarrera

This is an old issue, but is it still the case that when I have a panel data where the index is the date, and I want to add a column to it using a 1 to 1 matching on col, the best practice is:

df.reset_index().merge(other_table, on='col').set_index('date')

For example, I have a dataset with date, the subject_code, and how many students were in that semester. date is the index and I have another table with the subject_code and subject_name. I want to add subject_name to the previous table. Is there any reason why it does not exist a keep_index=True option?