Pandas qcut does not handle infinite values correctly

Calling qcut with infinite values in a pandas Series should be a well-defined operation, but it tends to produce wrong results or raise (un-obvious) exceptions. I'm using the following snippet to test:

data = range(10) + [np.inf] * n
s = pd.Series(data, index=data)
pd.qcut(s, [0.1, 0.9])

When called with n=1, it produces the following result:

0.000000       NaN
1.000000    [1, 9]
2.000000    [1, 9]
3.000000    [1, 9]
4.000000    [1, 9]
5.000000    [1, 9]
6.000000    [1, 9]
7.000000    [1, 9]
8.000000    [1, 9]
9.000000    [1, 9]
inf            NaN
dtype: category
Categories (1, object): [[1, 9]]

I don't think that the 0 value and the inf should get assigned to NaN bins. When called with n=2, it now produces:

0.000000           NaN
1.000000           NaN
2.000000    [1.1, inf]
3.000000    [1.1, inf]
4.000000    [1.1, inf]
5.000000    [1.1, inf]
6.000000    [1.1, inf]
7.000000    [1.1, inf]
8.000000    [1.1, inf]
9.000000    [1.1, inf]
inf         [1.1, inf]
inf         [1.1, inf]
dtype: category
Categories (1, object): [[1.1, inf]]

Again, the binning looks suspicious to me... And when called with n >= 3, I get the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-29-db4904bb94b0> in <module>()
      1 data = range(10) + [np.inf] * 3
      2 s = pd.Series(data, index=data)
----> 3 pd.qcut(s, [0.1, 0.9])

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in qcut(x, q, labels, retbins, precision)
    167     bins = algos.quantile(x, quantiles)
    168     return _bins_to_cuts(x, bins, labels=labels, retbins=retbins,precision=precision,
--> 169                          include_lowest=True)
    170 
    171 

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _bins_to_cuts(x, bins, right, labels, retbins, precision, name, include_lowest)
    201                 try:
    202                     levels = _format_levels(bins, precision, right=right,
--> 203                                             include_lowest=include_lowest)
    204                 except ValueError:
    205                     increases += 1

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _format_levels(bins, prec, right, include_lowest)
    240         levels = []
    241         for a, b in zip(bins, bins[1:]):
--> 242             fa, fb = fmt(a), fmt(b)
    243 
    244             if a != b and fa == fb:

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in <lambda>(v)
    236 def _format_levels(bins, prec, right=True,
    237                    include_lowest=False):
--> 238     fmt = lambda v: _format_label(v, precision=prec)
    239     if right:
    240         levels = []

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _format_label(x, precision)
    274                     return '%d' % (-whole - 1)
    275                 else:
--> 276                     return '%d' % (whole + 1)
    277 
    278             if 'e' in val:

TypeError: %d format: a number is required, not numpy.float64

... which doesn't look very related to the cause at first glance. What is happening here is that the value passed to _format_label() and then to the % operator is a NaN, which is doesn't support.

Comment From: ron819

@chrish42 Are you still facing this issue?

Comment From: fangchenli

n = 1
data = list(range(10)) + [np.inf] * n
s = pd.Series(data, index=data)
result = pd.qcut(s, [0.1, 0.9])

result:

/opt/homebrew/Caskroom/miniforge/base/envs/pandas-dev/lib/python3.8/site-packages/numpy/lib/function_base.py:4011: RuntimeWarning: invalid value encountered in multiply
  lerp_interpolation = asanyarray(add(a, diff_b_a*t, out=out))
Traceback (most recent call last):
  File "test.py", line 10, in <module>
    result = pd.qcut(s, [0.1, 0.9])
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/reshape/tile.py", line 376, in qcut
    fac, bins = _bins_to_cuts(
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/reshape/tile.py", line 441, in _bins_to_cuts
    labels = _format_labels(
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/reshape/tile.py", line 583, in _format_labels
    return IntervalIndex.from_breaks(breaks, closed=closed)
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/indexes/interval.py", line 255, in from_breaks
    array = IntervalArray.from_breaks(
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/arrays/interval.py", line 415, in from_breaks
    return cls.from_arrays(breaks[:-1], breaks[1:], closed, copy=copy, dtype=dtype)
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/arrays/interval.py", line 492, in from_arrays
    return cls._simple_new(
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/arrays/interval.py", line 335, in _simple_new
    result._validate()
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/arrays/interval.py", line 601, in _validate
    raise ValueError(msg)
ValueError: missing values must be missing in the same location both left and right sides

Comment From: HansBambel

I am facing the same issue: pd.qcut([1,2,3,4,5,-np.inf, np.inf], q=3, duplicates="drop") results in ValueError: missing values must be missing in the same location both left and right sides

I was expecting the first and last bin to contain np.inf. This was working in pandas 1.1.5