Pandas Denoting duplicated by specific combination of index

Code Sample

pandas.DataFrame.duplicated(subset=None, keep='first')

Problem description

Pandas have a built-in method for denoting duplicated records. However, it is not able to denote duplicated record by some combination of keys. dataFrame a

index1  index2  val1  val2
a      y      1      1
a      y      1      2
b      z      2      2
c      x      3      3

dataFrame b

index1  index2  val3
a      y      1
b      z      2

there are nodupkey for the first glance. but this could cause issues when performing merging with validate='1:1' criteria.

pd.merge(a, b, validate='1:1')

we expect that both data set is unique by index1 and index2. But in dataFrame a, the records are not unique which cause merging failed. In SAS, duplicate in by group can be found by executing

proc sort data=sample nouniqueky out=dup; 
    by var1 var2; 
run;

Expected Output

Expected add a parameter such as groupby or index

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.21.1 pytest: 3.2.1 pip: 9.0.1 setuptools: 36.5.0.post20170921 Cython: 0.26.1 numpy: 1.13.3 scipy: 0.19.1 pyarrow: None xarray: None IPython: 6.1.0 sphinx: 1.6.3 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.1.0 openpyxl: 2.4.8 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.2 lxml: 4.1.0 bs4: 4.6.0 html5lib: 0.999999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: 0.5.0

Comment From: jreback

you would to show an example frame which does what you want. from your description subset does exactly this.

Comment From: 77QingLiu

updated. @jreback ，please take a look

Comment From: jorisvandenbossche

As @jreback said, it still seems you want the subset keyword:

In [31]: df
Out[31]: 
  index1 index2  val1  val2
0      a      y     1     1
1      a      y     1     2
2      b      z     2     2
3      c      x     3     3

In [32]: df.duplicated(subset=['index1', 'index2'])
Out[32]: 
0    False
1     True
2    False
3    False
dtype: bool

Comment From: TomAugspurger

Let us know if subset isn't what you wanted.

Comment From: 77QingLiu

@jorisvandenbossche , Thanks for your advice. but this is not I want. For example,

In [1]: d = {'index1': ['a', 'a', 'b', 'c'], 'index2': ['y', 'y', 'z', 'x'], 'val1': [1, 1, 2, 3], 'val2': [1, 2, 3, 3]}
          df = pd.DataFrame(data=d)
Out [1]:
  index1 index2  val1  val2
0      a      y     1     1
1      a      y     1     2
2      b      z     2     3
3      c      x     3     3

In [2]: df.duplicated(subset=['index1', 'index2'])
Out [2]: 
0    False
1     True
2    False
3    False

** Expected output **
0    False
1    False
2    False
3    False

** which can be achieved by: **
In [3]: df.groupby(['index1', 'index2']).apply(lambda x:x.duplicated())
Out [3]:
index1  index2   
a       y       0    False
                1    False
b       z       2    False
c       x       3    False

Target: Identify duplicate record in each group combination @TomAugspurger

Comment From: jorisvandenbossche

** which can be achieved by: ** In [3]: df.groupby(['index1', 'index2']).apply(lambda x:x.duplicated())

Well, that solves your problem then? I think directly using groupby makes more sense than starting to add a groupby keyword to a lot of functions. What we could consider is adding a duplicated method directly on groupby(), so the apply is not needed. But in this case the apply way works OK, so not sure if it is needed.

Comment From: 77QingLiu

Yes, the groupby solution solve my problem. But I think it's not straightforward enough. Adding duplicated function in groupby would be a reasonable solution.

Comment From: jorisvandenbossche

Adding duplicated function in groupby would be a reasonable solution.

@77QingLiu PR would certainly be welcome!

Pandas Denoting duplicated by specific combination of index

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`