Pandas Numpy 'inf' values cause pandas.cut to fail

Code Sample, a copy-pastable example if possible

import pandas as pd
foo = pd.Series([1, 2, 3,])
bar = pd.Series([1, 2, 0])
baz = foo / bar
cut = pd.cut(baz, 8, duplicates='drop')

*** ValueError: missing values must be missing in the same location both left and right sides

Problem description

Having an 'inf' value in a Series seems to cause pandas.cut to fail with this error: *** ValueError: missing values must be missing in the same location both left and right sides

I saw bug #19768 already, but that was fixed by PR 19833 in Feb, and I'm using 0.23.4 which was released on August 3, 2018. Also there's #5483, which was fixed a long time ago.

Expected Output

'inf' should probably be similar to the current NA-handling behavior: it should at least not raise an exception, and just drop that as a usable value.

Output of `pd.show_versions()`

pd.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.23.4 pytest: 3.10.0 pip: 18.1 setuptools: 40.2.0 Cython: 0.29.2 numpy: 1.15.1 scipy: 1.1.0 pyarrow: 0.11.0 xarray: None IPython: 6.5.0 sphinx: 1.8.2 patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.3 openpyxl: 2.5.10 xlrd: 1.1.0 xlwt: None xlsxwriter: None lxml: 4.2.4 bs4: None html5lib: 1.0.1 sqlalchemy: 1.2.11 pymysql: None psycopg2: 2.7.5 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: jschendel

This should raise, as bins should not be specified as an integer when the input data contains infinity. The error message could certainly be improved though.

A more concise example of the error in question:

In [2]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=3)
---------------------------------------------------------------------------
ValueError: Bin edges must be unique: array([nan, inf, inf, inf]).
You can drop duplicate edges by setting the 'duplicates' kwarg

In [3]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=3, duplicates='drop')
---------------------------------------------------------------------------
ValueError: missing values must be missing in the same location both left and right sides

Note that duplicates='drop' is just delaying the error from [2].

As to why this is invalid, from the documentation for cut we have the following description for bins:

bins : int, sequence of scalars, or pandas.IntervalIndex The criteria to bin by. - int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

Specifying bins as an integer means getting that number of equal width bins that the span the range of the input data, but when your input data contains infinity then the range is infinite, so each bucket would also need to be infinite, which doesn't make sense.

The way to handle this would be to specify bins using one of the alternative options, or to use qcut:

In [4]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=[-1, 2, np.inf])
Out[4]:
[(-1.0, 2.0], (-1.0, 2.0], (-1.0, 2.0], (2.0, inf], (2.0, inf], (2.0, inf]]
Categories (2, interval[float64]): [(-1.0, 2.0] < (2.0, inf]]

In [5]: pd.qcut([0, 1, 2, 3, 4, np.inf], q=3)
Out[5]:
[(-0.001, 1.667], (-0.001, 1.667], (1.667, 3.333], (1.667, 3.333], (3.333, inf], (3.333, inf]]
Categories (3, interval[float64]): [(-0.001, 1.667] < (1.667, 3.333] < (3.333, inf]]

Comment From: vaughnkoch

That makes sense, thanks for sharing.

What do you think of having it just ignore inf values (and if there no binnable values, then raise?) If you have inf values in the source data, the caller has to do extra work to remove them anyways, which pandas could do as well. Then you could just put that in the docs that inf is converted to NA.

Comment From: jschendel

What do you think of having it just ignore inf values (and if there no binnable values, then raise?)

I'd be a bit hesitant to silently ignore infinite values. This would make the handling of infinity a bit inconsistent within cut since infinity is valid for other ways of using cut. It would also make it harder for users to diagnose if they're using the wrong combination of parameters with cut, or if something is wrong with their assumptions about their data (infinity being present when it shouldn't). Additionally, I'm not aware of any other methods that handle infinity in a similar way (could be forgetting something), so it'd be setting a new precedence within the codebase in regards to infinity.

All that being said, I wouldn't necessarily be opposed to changing the behavior in that regard if a consensus is reached to support it. I think it would warrant a larger discussion though. Feel free to open a new issue regarding the proposed behavior if it's something you feel strongly about, so it has more visibility to the other devs.

Comment From: vaughnkoch

What about having the default be to raise (with an informative error message), but to add an additional optional parameter to pd.cut which would then ignore infs? That way, if someone uses the option, they're explicitly saying infs can be ignored, and if they get the error otherwise, it'll be an easy change instead of having to put an extra line to drop infs.

Comment From: jschendel

Yes, that would be a more viable alternative. Would still like a new issue opened regarding it for more discussion. To be honest, I don't think this would be a high priority item to be filled, unless it's something you'd be willing to contribute, given that it's a rather specific request that I don't think would be frequently used. Also keep in mind that there's a bit of maintenance burden associated with adding keyword arguments like this; the user workaround is a single line of code, but implementing this in the codebase requires quite a bit more work, and would need to be maintained going forward.

Comment From: HansBambel

@jschendel Sadly using qcut also results in an error: pd.qcut([1,2,3,4,5,-np.inf, np.inf], q=3, duplicates="drop") results in ValueError: missing values must be missing in the same location both left and right sides

I was expecting the first and last bin to contain np.inf. This was working in pandas 1.1.5.

Similar to https://github.com/pandas-dev/pandas/issues/11113

Pandas Numpy 'inf' values cause pandas.cut to fail

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`