Feature Type
-
[ ] Adding new functionality to pandas
-
[X] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
Reductions like sum
and prod
accept a min_count
parameter to indicate a minimum size for a Series/DataFrame before the reduction can produce a non-NA value. The current implementation checks if the value is positive and filters accordingly. However, this check allows the user to pass negative values, which are treated the same as having passed 0. This allows user error to pass through unquestioned, but the most likely reason for a negative value here is that the user made an error and warning them would be helpful.
Feature Description
It would be nice to raise a ValueError if min_count < 0. I am happy to make a pull request if the maintainers agree.
Alternative Solutions
N/A
Additional Context
No response
Comment From: mroeschke
+1 to raise if min_count < 0
is passed.
Comment From: rhshadrach
If min_count
is specified by a hard coded value, then I agree we can infer that a negative value is likely erroneous. However, if it is determined algorithmically, there may be cases where the determination is a negative number. E.g.
necessary_observations: int = some_expression(df) - some_other_expression(df)
df.sum(min_count=necessary_observations)
If we forbid min_count
from being negative, then this needs to be
necessary_observations: int = some_expression(df) - some_other_expression(df)
if necessary_observations < 0
necessary_observations = 0
df.sum(min_count=necessary_observations)
I do think saying df.sum(min_count=-1)
logically makes sense - you need at least -1 values and so any number of values (0 or above) satisfies.
I have to imagine this would be quite a rare use case, and the adjustment of it isn't difficult. So while I lean toward leaving the logic as-is, I'm not -1 here.
Comment From: vyasr
I do think saying df.sum(min_count=-1) logically makes sense - you need at least -1 values and so any number of values (0 or above) satisfies.
While I agree that under some circumstances this usage could be logical, I think it's more likely that allowing this usage ends up allowing more insidious bugs to persist where the calling code does not intend to allow necessary_observations
to be negative and would benefit from seeing an early failure from pandas rather than attempting to use that value later in a way that produce invalid results without an error. Obviously it isn't pandas responsibility to fix the calling code, but forcing users to validate the input in this way would IMO catch enough errors to be worth the minor additional hassle of having to specify min_count = max(0, expr)
.