Enhancement description
The present implementation of df.duplicated()
(and hence df.drop_duplicates()
) only has two options for users that wish to keep exactly one from each set of duplicates ('first'
and 'last'
). In some use cases, if the data is already ordered in some way, these options can potentially introduce bias. It would be useful to have an option that allows for the duplicate selected to be randomly (but also deterministically) chosen.
Details
In our use case, we are able to produce the desired result by other means: we wish to remove all-but-one events with a duplicate in the 'eventNumber' column, so we introduce an additional 'random number' column using pd.utils.hash_pandas_object(df, index=False)
. We then sort the dataframe by this column, apply df.drop_duplicates()
using either keep='first' or keep='last', and then sort by index again (thanks to @chrisburr for this solution).
By using a hash instead of a standard RNG, the numbers used in the sorting are deterministic & repeatable. It also remains stable when entries are added/removed but the rest are not modified, which is desirable not not necessarily required. But this method requires the hash having knowledge of another subset of the columns in which there are no duplicates, and so would require the underlying functions (the duplicated_{{dtype}}
functions in pandas/_libs/hashtable_func_helper.pxi.in
) to receive an additional argument.
Comment From: WillAyd
Not sure this is the right place for it. Why would you just generate a random sorting before the call to duplicated?
Comment From: mroeschke
Agreed with Will. Since there doesn't seem to be much traction or appetite for this feature from the community going to close for now. Happy to revisit if there larger interest from the community
Comment From: ophirSarusi
Is this the place to show my interest? I would be happy to have the option to keep a random row when I use df.drop_duplicates()
, instead of the first or last. This is mostly useful when performing Monte Carlo type operations on the dataset and I want the sampling to not include duplicates of certain columns.
Currently the only way for me to achieve what I need is to shuffle the rows --> drop suplicates --> re-sort the indexes. If my indexes are not sorted integers to begin with then that is a bigger mess for me...
Anyway, wanted to show my support for this feature and demonstrate a scenario in which it would be used.
Comment From: orlandombaa
Hello
I am triying to solve a similar problem where it will be very useful to keep random or deterministic position, not just the first and last position.
Comment From: kruhtemazo
I also need solution for this. I use non optimal idx = np.random.permutation(np.arange(len(df))) df = df.iloc[idx].drop_duplicates(subset=['RANDOM.ID']) In our dataset random id of a person is repated measure. We would like drop_duplicates to be random and have random_state so our code would have reproducibility and same sklearn metrics after each run. (In ML if there is repeated‐measurement we should use mixed effect linear regression. Testing assumptions for linear regression is cumbersome and sometimes we do not meet all assumptions we need to. So we have to use other resourse. But if we always use first or last occurence of repeated meassure it is not good (because some time has passed between measurring) so randomising that is nice option). I am new to programming so this would help. Thank you.