Code Sample
pandas.DataFrame.duplicated(subset=None, keep='first')
Problem description
Pandas have a built-in method for denoting duplicated records. However, it is not able to denote duplicated record by some combination of keys. dataFrame a
index1 index2 val1 val2
a y 1 1
a y 1 2
b z 2 2
c x 3 3
dataFrame b
index1 index2 val3
a y 1
b z 2
there are nodupkey for the first glance. but this could cause issues when performing merging with validate='1:1' criteria.
pd.merge(a, b, validate='1:1')
we expect that both data set is unique by index1 and index2. But in dataFrame a, the records are not unique which cause merging failed. In SAS, duplicate in by group can be found by executing
proc sort data=sample nouniqueky out=dup;
by var1 var2;
run;
Expected Output
Expected add a parameter such as groupby or index
Output of pd.show_versions()
Comment From: jreback
you would to show an example frame which does what you want. from your description subset
does exactly this.
Comment From: 77QingLiu
updated. @jreback ,please take a look
Comment From: jorisvandenbossche
As @jreback said, it still seems you want the subset
keyword:
In [31]: df
Out[31]:
index1 index2 val1 val2
0 a y 1 1
1 a y 1 2
2 b z 2 2
3 c x 3 3
In [32]: df.duplicated(subset=['index1', 'index2'])
Out[32]:
0 False
1 True
2 False
3 False
dtype: bool
Comment From: TomAugspurger
Let us know if subset isn't what you wanted.
Comment From: 77QingLiu
@jorisvandenbossche , Thanks for your advice. but this is not I want. For example,
In [1]: d = {'index1': ['a', 'a', 'b', 'c'], 'index2': ['y', 'y', 'z', 'x'], 'val1': [1, 1, 2, 3], 'val2': [1, 2, 3, 3]}
df = pd.DataFrame(data=d)
Out [1]:
index1 index2 val1 val2
0 a y 1 1
1 a y 1 2
2 b z 2 3
3 c x 3 3
In [2]: df.duplicated(subset=['index1', 'index2'])
Out [2]:
0 False
1 True
2 False
3 False
** Expected output **
0 False
1 False
2 False
3 False
** which can be achieved by: **
In [3]: df.groupby(['index1', 'index2']).apply(lambda x:x.duplicated())
Out [3]:
index1 index2
a y 0 False
1 False
b z 2 False
c x 3 False
Target: Identify duplicate record in each group combination @TomAugspurger
Comment From: jorisvandenbossche
** which can be achieved by: ** In [3]: df.groupby(['index1', 'index2']).apply(lambda x:x.duplicated())
Well, that solves your problem then? I think directly using groupby makes more sense than starting to add a groupby
keyword to a lot of functions.
What we could consider is adding a duplicated
method directly on groupby(), so the apply is not needed. But in this case the apply way works OK, so not sure if it is needed.
Comment From: 77QingLiu
Yes, the groupby
solution solve my problem. But I think it's not straightforward enough. Adding duplicated
function in groupby
would be a reasonable solution.
Comment From: jorisvandenbossche
Adding duplicated function in groupby would be a reasonable solution.
@77QingLiu PR would certainly be welcome!