The dataset named 'changes' was obtained from a merge by RID:
changes.head()
DX_m12 | RID | DX_m12 | |
---|---|---|---|
0 | 1 | 3 | 1 |
1 | 0 | 4 | 0 |
2 | 0 | 6 | 0 |
3 | 0 | 8 | 0 |
4 | 1 | 10 | 1 |
I've used the two following commands with the intention of identifying lines in which DX_bl and DX_m12 are different:
print(changes[~changes.duplicated(subset = ['DX_bl','DX_m12'], keep=False)])
DX_bl | RID | DX_m12 | |
---|---|---|---|
64 | 64 | 1 | 167 |
print(changes.drop_duplicates(subset = ['DX_bl','DX_m12']))
DX_bl | RID | DX_m12 | |
---|---|---|---|
0 | 0 | 1 | 3 |
1 | 1 | 0 | 4 |
10 | 10 | 0 | 30 |
64 | 64 | 1 | 167 |
If keep=False, it should return lines 10 (RID 30) and 64 (RID 64). But, as we can see, it loses the information of line 10 (RID 30).
On the other hand, if 'keep' is let in default option (keep = 'first') it wrongly returns lines 0 (RID 3) and 1 (RID 4).
Is there a bug in pandas' duplicated/drop_duplicates??
Comment From: jreback
pls provide a copy-pastable (runnable) example, pics are not useful.
Comment From: jreback
if you have an example pls post.