Code Sample, a copy-pastable example if possible
>>> s = pd.Series([10,9,8,7,7,7,6])
>>> s.nlargest(4)
0 10
1 9
2 8
3 7
dtype: int64
Problem description
The docstrings list False
as one of the possible argument values for keep
. pandas raises a ValueError
when attempting to use this parameter.
Expected Output
It would be nice to have nlargest work like this.
>>> s.nlargest(4, keep=False)
0 10
1 9
2 8
3 7
4 7
5 7
Output of pd.show_versions()
Comment From: gfyoung
@tdpetrou : Thanks for reporting this! Unfortunately, we have already decided to go and remove this option from the docs. (#18559)
If you have strong feelings about including this as an option, please respond, and we can discuss.
Comment From: tdpetrou
@gfyoung This doesn't make sense to me. The keep
option is only important whenever there are ties for the last value. Arguably, the most useful thing to do is to keep all of the ties which would be keep=False
. This option still exists in the docstrings in 0.21.
Also, the duplicated
method has same three options so it matches that. Additionally, it was just recently removed without discussion.
edit: A better value for the parameter would be keep='all'
Comment From: gfyoung
@tdpetrou : keep=False
in my mind would mean to drop them (the opposite of what you want), but yes, keep='all'
would make more sense. The reason for dropping the option in the docs was because the implementation has only supported the first two options (the docs look like copy + paste at some point).
True that there was no discussion to this, but a simple doc-change like #18559 went in without objections from anyone. Everyone (including yourself) could have objected if need be, though to be fair, unless you're tracking pandas
changes regularly, this would have been hard to catch.
The good news is that we haven't released this yet, so we do have time to implement a "third" option for keep
if need be. @jreback : what are your thoughts?
Comment From: tdpetrou
@gfyoung Thanks for the response. I proposed a simple solution to this in #18656. It was very easy and basically already done. I think the original reason for using False
was to match duplicated
(and possibly others that I can't remember right now)?
Regardless, it just makes a lot of sense (to me) that you would want to keep the nlargest/nsmallest plus ties and the fix is easy and already complete.
Comment From: gfyoung
Your case for using False
is a little shaky IMO because it doesn't much sense (as you yourself even have acknowledged) in the context of this function. Semantics trump this sort of API consistency.
Your implementation is indeed simple but the surfacing of it will require discussion, should other maintainers believe that we should actually have such an option.
Comment From: tdpetrou
I'm not in favor of False
and would prefer 'all'
. The only thing I am pushing for is to make the functionality for keeping all ties possible.
Comment From: summerela
The option to keep all ties returned from nlargest would be wonderful. Right now I have to use this function in a groupby to find the top two counts for each group, then go back and find all rows that match the top two scores. It would be faster and more efficient to just have the function return all rows that match the top n largest/smallest values and I'm not sure why this feature was removed. Pretty please can you put it back? :)
Comment From: WillAyd
@summerela the feature as described here was implemented in 0.24. If that's not what you are looking for better to open a new issue at this point
Comment From: summerela
I'm running Dask v 1.1.1 and the only allowed arguments for keep are "first" and "last".
Comment From: WillAyd
OK if you can open up a new issue with code sample including output of pd.show_versions someone can take a look
Comment From: summerela
Can do. Thank you!
Comment From: artinthetrees
it would be useful also to have a keep option of "none"!
s = pd.Series([10,9,8,7,7,7,6]) s.nlargest(4) 0 10 1 9 2 8 dtype: int64