Code Sample
import pandas as pd
import numpy as np
pd.cut(np.array([1, 7, 5, 4, 6, 3]), bins=[0, 3, 6, 8], include_lowest=True)
Problem description
Just by setting the include_lowest
to True
the data type of the interval changes from int64
to float64
and the first interval isn't left-inclusive. Here is the wrong output that you'll get:
[(-0.001, 3.0], (6.0, 8.0], (3.0, 6.0], (3.0, 6.0], (3.0, 6.0], (-0.001, 3.0]]
Categories (3, interval[float64]): [(-0.001, 3.0] < (3.0, 6.0] < (6.0, 8.0]]
Expected Output
[(0, 3], (6, 8], (3, 6], (3, 6], (3, 6], (0, 3]]
Categories (3, interval[int64]): [[0, 3] < (3, 6] < (6, 8]]
Output of pd.show_versions()
Comment From: WWH98932
I met the same problem, I changed the intervals to str format and fix the first bound manually.
Comment From: mroeschke
I think this is the documented behavior as described in the bins section of the docstring:
The criteria to bin by. int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.
This situation could be better documented though
Comment From: evarga
Thanks for the comment! Please, also see the issue with altered data type of intervals (see my oiginal comment above). This behavior is nowhere documented as far as I know, and could be the consequence of this %1 extension, as you have cited.
Comment From: Nakai-Naoto
The criteria to bin by. * int : Defines the number of equal-width bins in the range of
x
. The range ofx
is extended by .1% on each side to include the minimum and maximum values ofx
. * sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range ofx
is done.
This does not seem to be documented because bins
is a sequence of scalars in the code sample.
Comment From: gdex1
I've looked more into how to resolve this and it seems this is more a restriction of IntervalIndex than a documentation error. IntervalIndex requires that all bins are closed on the same side so
(3, interval[int64]): [[0, 3] < (3, 6] < (6, 8]]
is not valid.
This is solved by slightly reducing the lower bound of the first interval and making it lower exclusive. Unfortunately, this converts the interval from int64 to float64 and creates the following unexpected behavior:
In:
pd.cut(np.array([-0.0001, 0, 1, 7, 5, 4, 6, 3, 8]), bins=[0, 3, 6, 8], include_lowest=True)
Out:
[NaN, (-0.001, 3.0], (-0.001, 3.0], (6.0, 8.0], (3.0, 6.0], (3.0, 6.0], (3.0, 6.0], (-0.001, 3.0], (6.0, 8.0]]
Categories (3, interval[float64]): [(-0.001, 3.0] < (3.0, 6.0] < (6.0, 8.0]]
You can see that even though -0.0001 should be in the first interval, it is not assigned to the first bin.
I am new to this so I was wondering what the best way to handle this is?
Comment From: jotasi
As suggested in #42212 (@mroeschke suggested to include it into the discussion here instead), I would love to also have the option to include not only the left-most boundary for left-open intervals but also the right-most boundary for right-open intervals (i.e. for right=False
).
My preferred solution would be to change include_lowest
to something like final_interval_closed
, which is also functional for right-open intervals (i.e. if right=False
is specified).
Maybe, if the solution to this issue is to adapt the function or the underlying IntervalIndex
instead of updating the documentation, this could be kept in mind and implemented as well.
Comment From: edwhuang23
take
Comment From: bluenote10
Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.
This sounds like it depends in the minimum of x
but this doesn't seem to be the case, no?
The current behavior raises a number of questions:
- Why does it not simply use a left-closed interval as the first bin instead of the awkward extension? Isn't that what they are for?
- How can I inclusively bin -np.inf
? As far as I can see this cannot be counted correctly with extending.
- What if I want to recover the original bin edges from the result? Using include_lowest
currently messes this up, because one would have to implement the inverse logic of how it got extended, which is super fragile.
For these reasons it would be great if include_lowest
would actually not rely on extending, no?
Comment From: AshaHolla
take
Comment From: AshaHolla
in which file and folder can I find this issue. Its my first time contributing
Comment From: bluenote10
cut
is in tile.py
.
But as stated above, the issue should probably rather be classified as "bug" instead of "documentation", i.e., it would be better to fix the code rather than adapting the documentation to the problematic semantics of the code.
Comment From: hualiu01
take
Comment From: palbha
@rhshadrach - Could you please guide if this needs a documentation or bug fix & I can take it accordingly
Comment From: rhshadrach
@palbha - from reading over the issue, it's not clear to me what the best path forward is. I currently don't have the bandwidth to invest more time in it.