Pandas pandas.cut: the 'include_lowest' argument isn't behaving as documented

Code Sample

import pandas as pd
import numpy as np

pd.cut(np.array([1, 7, 5, 4, 6, 3]), bins=[0, 3, 6, 8], include_lowest=True)

Problem description

Just by setting the include_lowest to True the data type of the interval changes from int64 to float64 and the first interval isn't left-inclusive. Here is the wrong output that you'll get:

[(-0.001, 3.0], (6.0, 8.0], (3.0, 6.0], (3.0, 6.0], (3.0, 6.0], (-0.001, 3.0]]
Categories (3, interval[float64]): [(-0.001, 3.0] < (3.0, 6.0] < (6.0, 8.0]]

Expected Output

[(0, 3], (6, 8], (3, 6], (3, 6], (3, 6], (0, 3]]
Categories (3, interval[int64]): [[0, 3] < (3, 6] < (6, 8]]

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.23.4 pytest: 3.8.2 pip: 10.0.1 setuptools: 40.4.3 Cython: 0.28.5 numpy: 1.15.2 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.0.1 sphinx: 1.8.1 patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.8 feather: None matplotlib: 3.0.0 openpyxl: 2.5.8 xlrd: 1.1.0 xlwt: 1.2.0 xlsxwriter: 1.1.1 lxml: 4.2.5 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.12 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: WWH98932

I met the same problem, I changed the intervals to str format and fix the first bound manually.

Comment From: mroeschke

I think this is the documented behavior as described in the bins section of the docstring:

The criteria to bin by. int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

This situation could be better documented though

Comment From: evarga

Thanks for the comment! Please, also see the issue with altered data type of intervals (see my oiginal comment above). This behavior is nowhere documented as far as I know, and could be the consequence of this %1 extension, as you have cited.

Comment From: Nakai-Naoto

The criteria to bin by. * int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x. * sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.

This does not seem to be documented because bins is a sequence of scalars in the code sample.

Comment From: gdex1

I've looked more into how to resolve this and it seems this is more a restriction of IntervalIndex than a documentation error. IntervalIndex requires that all bins are closed on the same side so (3, interval[int64]): [[0, 3] < (3, 6] < (6, 8]] is not valid.

This is solved by slightly reducing the lower bound of the first interval and making it lower exclusive. Unfortunately, this converts the interval from int64 to float64 and creates the following unexpected behavior:

In: 
pd.cut(np.array([-0.0001, 0, 1, 7, 5, 4, 6, 3, 8]), bins=[0, 3, 6, 8], include_lowest=True)

Out: 
[NaN, (-0.001, 3.0], (-0.001, 3.0], (6.0, 8.0], (3.0, 6.0], (3.0, 6.0], (3.0, 6.0], (-0.001, 3.0], (6.0, 8.0]]
Categories (3, interval[float64]): [(-0.001, 3.0] < (3.0, 6.0] < (6.0, 8.0]]

You can see that even though -0.0001 should be in the first interval, it is not assigned to the first bin.

I am new to this so I was wondering what the best way to handle this is?

Comment From: jotasi

As suggested in #42212 (@mroeschke suggested to include it into the discussion here instead), I would love to also have the option to include not only the left-most boundary for left-open intervals but also the right-most boundary for right-open intervals (i.e. for right=False).

My preferred solution would be to change include_lowest to something like final_interval_closed, which is also functional for right-open intervals (i.e. if right=False is specified).

Maybe, if the solution to this issue is to adapt the function or the underlying IntervalIndex instead of updating the documentation, this could be kept in mind and implemented as well.

Comment From: edwhuang23

take

Comment From: bluenote10

Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

This sounds like it depends in the minimum of x but this doesn't seem to be the case, no?

The current behavior raises a number of questions: - Why does it not simply use a left-closed interval as the first bin instead of the awkward extension? Isn't that what they are for? - How can I inclusively bin -np.inf? As far as I can see this cannot be counted correctly with extending. - What if I want to recover the original bin edges from the result? Using include_lowest currently messes this up, because one would have to implement the inverse logic of how it got extended, which is super fragile.

For these reasons it would be great if include_lowest would actually not rely on extending, no?

Comment From: AshaHolla

take

Comment From: AshaHolla

in which file and folder can I find this issue. Its my first time contributing

Comment From: bluenote10

cut is in tile.py.

But as stated above, the issue should probably rather be classified as "bug" instead of "documentation", i.e., it would be better to fix the code rather than adapting the documentation to the problematic semantics of the code.

Comment From: hualiu01

take

Comment From: palbha

@rhshadrach - Could you please guide if this needs a documentation or bug fix & I can take it accordingly

Comment From: rhshadrach

@palbha - from reading over the issue, it's not clear to me what the best path forward is. I currently don't have the bandwidth to invest more time in it.

Pandas pandas.cut: the 'include_lowest' argument isn't behaving as documented

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`