I'm almost positive I've filed an issue about this before, but I don't find it. This is a pretty common operation when working with panel data. You want to merge in some constant values into a DataFrame that has repeated keys but unique index.
df = pd.util.testing.makeDataFrame() # yes!
df["group"] = 0
df.ix[range(0, 20, 2), "group"] = 1.
df["time"] = np.random.randint(5, size=len(df))
df_x = pd.DataFrame(np.c_[np.r_[np.arange(5), np.arange(5)],
np.repeat([0,1], 5)],
columns=["time", "group"])
df_x["constant_info"] = np.random.randn(len(df_x))
df.merge(df_x, on=["time", "group"])
Where'd my index go? Where'd my sort go, so I could even recover the index?
[~/]
[65]: pd.version.version
[65]: '0.13.1-254-g150f323'
Comment From: jreback
reset the index first.
I am not sure this well defined actually. Because the shape of the resultant could change I am not sure it is that easy to propogate the index, but could be a bug.
In [29]: df.reset_index().merge(df_x, on=["time", "group"])
Out[29]:
index A B C D group time constant_info
0 67U6Qr4tZ3 0.588197 -0.157200 -0.289734 -1.666943 1 0 0.119489
1 A0OncCRC4B -0.372781 -0.916348 -1.045884 1.073101 1 0 0.119489
2 PTn2oHbQwX -1.214873 0.148770 -0.912763 -0.682711 0 3 -0.368596
3 QIXTSYuRt8 1.142137 -1.156337 -0.241504 0.184532 0 3 -0.368596
4 GA2H70ii7r -1.168154 1.288484 0.384694 -1.709554 0 3 -0.368596
5 9zCy65irsl 0.742336 -0.519756 2.286992 -0.992639 0 3 -0.368596
6 JBYulSbxFz -0.305541 -0.291382 -0.056819 -1.060126 1 3 -1.214391
7 zcwADabKd6 -0.139200 1.261100 2.633415 0.430764 1 3 -1.214391
8 LdIgdqggFM -0.443637 0.568010 -1.806676 2.413111 0 0 0.645687
9 pgVKNmKAj2 -0.205737 -0.176864 -0.049452 -0.015245 0 0 0.645687
10 8wepkCgkMj -0.063876 1.157569 0.993645 -0.132804 0 0 0.645687
11 7A8MaDJoI4 0.110686 -0.245378 0.642752 -1.354833 0 0 0.645687
12 nUYhh9XadC 0.329432 1.631661 1.883624 0.299788 0 0 0.645687
13 lOIdNjZcbi 0.255259 -0.894358 1.452045 0.175334 0 0 0.645687
14 GFSNSFrV0M -0.699924 -1.604428 -0.357788 2.047454 1 2 0.120312
15 z25eVDRRNc 0.582153 -1.990730 0.934239 -0.619487 1 2 0.120312
16 VOCrP3g19I -1.402226 -1.409673 -0.094738 -0.064140 0 2 0.200803
17 YnVQ5Id67z -0.392426 -0.004004 -0.044901 -1.242439 0 2 0.200803
18 0QJn4pAnxN -1.613268 2.038299 1.443572 -0.436414 0 2 0.200803
19 39jjtXL37y -0.027252 -0.063240 1.096697 1.559221 0 2 0.200803
20 ezYa3CqIsY -0.973928 -0.902517 -1.763653 -1.711358 0 2 0.200803
21 aDEKCCnLIz 0.176168 0.559791 -2.198903 0.050573 0 4 -0.291175
22 8hzneduK1g -0.340049 0.095846 0.710264 1.660862 0 4 -0.291175
23 4PdmoVHVIT 0.015929 0.423310 -0.211383 1.335053 0 4 -0.291175
24 LrrXeBfYwT -0.128707 -0.118413 -0.036632 0.557884 1 1 1.144252
25 SAQP6I7DUP 2.453141 2.470560 0.100641 0.413540 1 1 1.144252
26 Md780GPvIT 3.401016 0.450771 -0.171816 0.825083 1 1 1.144252
27 LG24ITGSVY -1.013176 0.468748 -0.613433 -0.799757 0 1 -1.193142
28 56yc13pOAb -0.302066 -0.835660 -0.446994 1.586293 0 1 -1.193142
29 3GOkikZGMA 0.060595 -0.991693 -2.306191 0.997766 1 4 -0.637071
[30 rows x 8 columns]
Comment From: jseabold
Thanks. Better than what I came up with
grouped = df.groupby(['time', 'group'])
df_x.set_index(['time', 'group'], inplace=True)
for group, frame in grouped:
for var in df_x.columns.diff(frame.columns):
df.ix[frame.index, var] = df_x.ix[group, var]
I should add that I left out a how='left'
in the example, which should be well defined unless there are duplicate keys in the RHS. I'm not saying merge should definitely handle this, but there should be something for this. It's similar to the use case I had when Chang and I looked at adding an update method. Maybe it should go there or similar.
Comment From: jseabold
reset_index
gets tricky with MultiIndex.
Comment From: jreback
their is also a left_on
keyword (defaults to False), which you could set if you want to merge on the index (I don't think it works with multi-index though).
In general the code is much simpler if the indexes are not involved in a merge directly (e.g. they are reset first), then just do on
.
merge is a beast!
Comment From: jseabold
I don't want to merge on the index though I don't think bc. the RHS doesn't have one.
I tend to think in terms of many to 1 and 1 to many merges like Stata, while this also targets SQL-like merge.
http://www.stata.com/help.cgi?merge
Maybe a merge_many
that handles m:1 and 1:m would be nicer than trying to shoehorn things into merge. I dunno.
Comment From: jseabold
Yeah, this becomes more difficult for named indices and MultiIndex. My method groupby loop should still work here though.
Comment From: jreback
closing as stale. pls reopen if still an issue.
Comment From: marcdelabarrera
This is an old issue, but is it still the case that when I have a panel data where the index is the date, and I want to add a column to it using a 1 to 1 matching on col
, the best practice is:
df.reset_index().merge(other_table, on='col').set_index('date')
For example, I have a dataset with date
, the subject_code
, and how many students were in that semester. date
is the index and I have another table with the subject_code
and subject_name
. I want to add subject_name
to the previous table. Is there any reason why it does not exist a keep_index=True
option?