Code Sample

import pandas as pd
series = pd.Series([0, 1, 2, 3, 4, pd.np.nan, 6, 7], dtype='Int64')
breaks = [0, 2, 4, 6, 8]

breaks_cut = pd.cut(series, breaks)
breaks_cut
0           NaN
1    (0.0, 2.0]
2    (0.0, 2.0]
3    (2.0, 4.0]
4    (2.0, 4.0]
5           NaN
6    (0.0, 2.0]
7    (6.0, 8.0]
dtype: category
Categories (4, interval[int64]): [(0, 2] < (2, 4] < (4, 6] < (6, 8]]

Problem Description

When using the pd.Int64 nullable integer data type, pd.cut() unexpectedly bins the first non-np.nan value after an np.nan into the lowest interval. In the above example, the number 6 is binned into (0.0, 2.0].

Expected Output

0           NaN
1    (0.0, 2.0]
2    (0.0, 2.0]
3    (2.0, 4.0]
4    (2.0, 4.0]
5           NaN
6    (4.0, 6.0]
7    (6.0, 8.0]
dtype: category
Categories (4, interval[int64]): [(0, 2] < (2, 4] < (4, 6] < (6, 8]]

Note that using an IntervalIndex produces the expected output.

import pandas as pd
series = pd.Series([0, 1, 2, 3, 4, pd.np.nan, 6, 7], dtype='Int64')
breaks = [0, 2, 4, 6, 8]
intervals = [pd.Interval(x, y) for x, y in zip(breaks[:-1], breaks[1:])]
interval_index = pd.IntervalIndex(intervals)

interval_cut = pd.cut(series, interval_index)
interval_cut

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.0.0-37-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.3
numpy            : 1.17.3
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 19.3.1
setuptools       : 44.0.0.post20200102
Cython           : None
pytest           : 5.3.2
hypothesis       : None
sphinx           : 2.3.1
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.4.2
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.11.1
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.4.2
matplotlib       : 3.1.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : None
tables           : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None

Comment From: jschendel

Looks like there's an additional complication on master with the introduction of pd.NA and its use as the default NA for nullable integer dtypes:

In [1]: import numpy as np; import pandas as pd; pd.__version__
Out[1]: '0.26.0.dev0+1674.gbe1556c46'

In [2]: series = pd.Series([0, 1, 2, 3, 4, np.nan, 6, 7], dtype='Int64')

In [3]: breaks = [0, 2, 4, 6, 8]

In [4]: pd.cut(series, breaks)
---------------------------------------------------------------------------
TypeError: boolean value of NA is ambiguous

Comment From: parkerdgabel

take

Comment From: parkerdgabel

What should the expected behavior be?

Comment From: jschendel

Probably need to more generically fix #30944 before attempting to resolve the issue of the output being incorrect. It's possible that fixing #30944 may also fix this, in which case we'd just want to add some tests here.

Comment From: phofl

This works on main, may need tests

0           NaN
1    (0.0, 2.0]
2    (0.0, 2.0]
3    (2.0, 4.0]
4    (2.0, 4.0]
5           NaN
6    (4.0, 6.0]
7    (6.0, 8.0]

Comment From: kkangs0226

take