Calling qcut with infinite values in a pandas Series should be a well-defined operation, but it tends to produce wrong results or raise (un-obvious) exceptions. I'm using the following snippet to test:
data = range(10) + [np.inf] * n
s = pd.Series(data, index=data)
pd.qcut(s, [0.1, 0.9])
When called with n=1, it produces the following result:
0.000000 NaN
1.000000 [1, 9]
2.000000 [1, 9]
3.000000 [1, 9]
4.000000 [1, 9]
5.000000 [1, 9]
6.000000 [1, 9]
7.000000 [1, 9]
8.000000 [1, 9]
9.000000 [1, 9]
inf NaN
dtype: category
Categories (1, object): [[1, 9]]
I don't think that the 0 value and the inf should get assigned to NaN bins. When called with n=2, it now produces:
0.000000 NaN
1.000000 NaN
2.000000 [1.1, inf]
3.000000 [1.1, inf]
4.000000 [1.1, inf]
5.000000 [1.1, inf]
6.000000 [1.1, inf]
7.000000 [1.1, inf]
8.000000 [1.1, inf]
9.000000 [1.1, inf]
inf [1.1, inf]
inf [1.1, inf]
dtype: category
Categories (1, object): [[1.1, inf]]
Again, the binning looks suspicious to me... And when called with n >= 3, I get the following exception:
TypeError Traceback (most recent call last)
<ipython-input-29-db4904bb94b0> in <module>()
1 data = range(10) + [np.inf] * 3
2 s = pd.Series(data, index=data)
----> 3 pd.qcut(s, [0.1, 0.9])
C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in qcut(x, q, labels, retbins, precision)
167 bins = algos.quantile(x, quantiles)
168 return _bins_to_cuts(x, bins, labels=labels, retbins=retbins,precision=precision,
--> 169 include_lowest=True)
170
171
C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _bins_to_cuts(x, bins, right, labels, retbins, precision, name, include_lowest)
201 try:
202 levels = _format_levels(bins, precision, right=right,
--> 203 include_lowest=include_lowest)
204 except ValueError:
205 increases += 1
C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _format_levels(bins, prec, right, include_lowest)
240 levels = []
241 for a, b in zip(bins, bins[1:]):
--> 242 fa, fb = fmt(a), fmt(b)
243
244 if a != b and fa == fb:
C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in <lambda>(v)
236 def _format_levels(bins, prec, right=True,
237 include_lowest=False):
--> 238 fmt = lambda v: _format_label(v, precision=prec)
239 if right:
240 levels = []
C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _format_label(x, precision)
274 return '%d' % (-whole - 1)
275 else:
--> 276 return '%d' % (whole + 1)
277
278 if 'e' in val:
TypeError: %d format: a number is required, not numpy.float64
... which doesn't look very related to the cause at first glance. What is happening here is that the value passed to _format_label()
and then to the %
operator is a NaN, which is doesn't support.
Comment From: ron819
@chrish42 Are you still facing this issue?
Comment From: fangchenli
n = 1
data = list(range(10)) + [np.inf] * n
s = pd.Series(data, index=data)
result = pd.qcut(s, [0.1, 0.9])
result:
/opt/homebrew/Caskroom/miniforge/base/envs/pandas-dev/lib/python3.8/site-packages/numpy/lib/function_base.py:4011: RuntimeWarning: invalid value encountered in multiply
lerp_interpolation = asanyarray(add(a, diff_b_a*t, out=out))
Traceback (most recent call last):
File "test.py", line 10, in <module>
result = pd.qcut(s, [0.1, 0.9])
File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/reshape/tile.py", line 376, in qcut
fac, bins = _bins_to_cuts(
File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/reshape/tile.py", line 441, in _bins_to_cuts
labels = _format_labels(
File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/reshape/tile.py", line 583, in _format_labels
return IntervalIndex.from_breaks(breaks, closed=closed)
File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/indexes/interval.py", line 255, in from_breaks
array = IntervalArray.from_breaks(
File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/arrays/interval.py", line 415, in from_breaks
return cls.from_arrays(breaks[:-1], breaks[1:], closed, copy=copy, dtype=dtype)
File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/arrays/interval.py", line 492, in from_arrays
return cls._simple_new(
File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/arrays/interval.py", line 335, in _simple_new
result._validate()
File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/arrays/interval.py", line 601, in _validate
raise ValueError(msg)
ValueError: missing values must be missing in the same location both left and right sides
Comment From: HansBambel
I am facing the same issue:
pd.qcut([1,2,3,4,5,-np.inf, np.inf], q=3, duplicates="drop")
results in ValueError: missing values must be missing in the same location both left and right sides
I was expecting the first and last bin to contain np.inf. This was working in pandas 1.1.5