Pandas BUG: Index containing NA behaves absolutely unpredictably when length exceeds 128

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# OK:
n, val = 127, pd.NA
idx = pd.Index(range(n), dtype="Int64").union(pd.Index([val], dtype="Int64"))
s = pd.Series(index=idx, data=range(n+1), dtype="Int64")
s.drop(0)

# Still OK:
n, val = 128, 128
idx = pd.Index(range(n), dtype="Int64").union(pd.Index([val], dtype="Int64"))
s = pd.Series(index=idx, data=range(n+1), dtype="Int64")
s.drop(0)

# But this FAILS:
n, val = 128, pd.NA
idx = pd.Index(range(n), dtype="Int64").union(pd.Index([val], dtype="Int64"))
s = pd.Series(index=idx, data=range(n+1), dtype="Int64")
s.drop(0)  # ValueError: 'indices' contains values less than allowed (-128 < -1)
# Expected no error

WORKAROUND. to filter out elements, use a boolean mask/indexing instead of s.drop():

s[~s.index.isin([0])]

Issue Description

When NA is present in Index and the length of the Index exceeds 128, it behaves in a completely weird way.

This bug can be narrowed down to IndexEngine.get_indexer() or MaskedIndexEngine.get_indexer(), as these examples suggest:

axis = pd.Index(range(250), dtype='Int64').union(pd.Index([pd.NA], dtype='Int64'))
new_axis = axis.drop(0)
axis.get_indexer(new_axis)[-5:] # array([246, 247, 248, 249,  -6])
# Expected array([246, 247, 248, 249,  250])

axis = pd.Index(range(254), dtype='Int64').union(pd.Index([pd.NA], dtype='Int64'))
new_axis = axis.drop(0)
axis.get_indexer(new_axis)[-5:] # array([250, 251, 252, 253,  -2])
# Expected  array([250, 251, 252, 253,  254])

These examples further suggest that the root cause of the bug is in how NaN is represented in and is interacting with the hash tables that Index uses for its _engine.

Expected Behavior

See above

Installed Versions

commit : 76c7274985215c487248fa5640e12a9b32a06e8c python : 3.11.5.final.0 python-bits : 64 OS : Linux OS-release : 5.10.216-204.855.amzn2.x86_64 Version : #1 SMP Sat May 4 16:53:27 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 3.0.0.dev0+1067.g76c7274985 numpy : 1.26.4 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 23.2.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.4 qtpy : None pyqt5 : None

Comment From: avm19

take

Comment From: avm19

The bug is likely due to this line: https://github.com/pandas-dev/pandas/blob/3979e954a339db9fc5e99b72ccb5ceda081c33e5/pandas/_libs/hashtable_class_helper.pxi.in#L538

Should be {{c_type}} na_position = self.na_position. Will confirm soon.