Code Sample, a copy-pastable example if possible
import pandas as pd
foo = pd.Series([1, 2, 3,])
bar = pd.Series([1, 2, 0])
baz = foo / bar
cut = pd.cut(baz, 8, duplicates='drop')
*** ValueError: missing values must be missing in the same location both left and right sides
Problem description
Having an 'inf' value in a Series seems to cause pandas.cut to fail with this error: *** ValueError: missing values must be missing in the same location both left and right sides
I saw bug #19768 already, but that was fixed by PR 19833 in Feb, and I'm using 0.23.4 which was released on August 3, 2018. Also there's #5483, which was fixed a long time ago.
Expected Output
'inf' should probably be similar to the current NA-handling behavior: it should at least not raise an exception, and just drop that as a usable value.
Output of pd.show_versions()
Comment From: jschendel
This should raise, as bins
should not be specified as an integer when the input data contains infinity. The error message could certainly be improved though.
A more concise example of the error in question:
In [2]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=3)
---------------------------------------------------------------------------
ValueError: Bin edges must be unique: array([nan, inf, inf, inf]).
You can drop duplicate edges by setting the 'duplicates' kwarg
In [3]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=3, duplicates='drop')
---------------------------------------------------------------------------
ValueError: missing values must be missing in the same location both left and right sides
Note that duplicates='drop'
is just delaying the error from [2]
.
As to why this is invalid, from the documentation for cut
we have the following description for bins
:
bins : int, sequence of scalars, or pandas.IntervalIndex The criteria to bin by. - int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.
Specifying bins
as an integer means getting that number of equal width bins that the span the range of the input data, but when your input data contains infinity then the range is infinite, so each bucket would also need to be infinite, which doesn't make sense.
The way to handle this would be to specify bins
using one of the alternative options, or to use qcut
:
In [4]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=[-1, 2, np.inf])
Out[4]:
[(-1.0, 2.0], (-1.0, 2.0], (-1.0, 2.0], (2.0, inf], (2.0, inf], (2.0, inf]]
Categories (2, interval[float64]): [(-1.0, 2.0] < (2.0, inf]]
In [5]: pd.qcut([0, 1, 2, 3, 4, np.inf], q=3)
Out[5]:
[(-0.001, 1.667], (-0.001, 1.667], (1.667, 3.333], (1.667, 3.333], (3.333, inf], (3.333, inf]]
Categories (3, interval[float64]): [(-0.001, 1.667] < (1.667, 3.333] < (3.333, inf]]
Comment From: vaughnkoch
That makes sense, thanks for sharing.
What do you think of having it just ignore inf values (and if there no binnable values, then raise?) If you have inf values in the source data, the caller has to do extra work to remove them anyways, which pandas could do as well. Then you could just put that in the docs that inf is converted to NA.
Comment From: jschendel
What do you think of having it just ignore inf values (and if there no binnable values, then raise?)
I'd be a bit hesitant to silently ignore infinite values. This would make the handling of infinity a bit inconsistent within cut
since infinity is valid for other ways of using cut
. It would also make it harder for users to diagnose if they're using the wrong combination of parameters with cut
, or if something is wrong with their assumptions about their data (infinity being present when it shouldn't). Additionally, I'm not aware of any other methods that handle infinity in a similar way (could be forgetting something), so it'd be setting a new precedence within the codebase in regards to infinity.
All that being said, I wouldn't necessarily be opposed to changing the behavior in that regard if a consensus is reached to support it. I think it would warrant a larger discussion though. Feel free to open a new issue regarding the proposed behavior if it's something you feel strongly about, so it has more visibility to the other devs.
Comment From: vaughnkoch
What about having the default be to raise (with an informative error message), but to add an additional optional parameter to pd.cut which would then ignore infs? That way, if someone uses the option, they're explicitly saying infs can be ignored, and if they get the error otherwise, it'll be an easy change instead of having to put an extra line to drop infs.
Comment From: jschendel
Yes, that would be a more viable alternative. Would still like a new issue opened regarding it for more discussion. To be honest, I don't think this would be a high priority item to be filled, unless it's something you'd be willing to contribute, given that it's a rather specific request that I don't think would be frequently used. Also keep in mind that there's a bit of maintenance burden associated with adding keyword arguments like this; the user workaround is a single line of code, but implementing this in the codebase requires quite a bit more work, and would need to be maintained going forward.
Comment From: HansBambel
@jschendel Sadly using qcut also results in an error:
pd.qcut([1,2,3,4,5,-np.inf, np.inf], q=3, duplicates="drop")
results in ValueError: missing values must be missing in the same location both left and right sides
I was expecting the first and last bin to contain np.inf. This was working in pandas 1.1.5.
Similar to https://github.com/pandas-dev/pandas/issues/11113