Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

test = pd.DataFrame({'count' : np.random.randint(0,100,size=(4261028590))}, index=(pd.DatetimeIndex(np.empty(4261028590))))

stripped = test[test['count'] > 0]

Issue Description

When working with a very large data frame (4261028590) rows I get ".... in pandas._libs.lib.maybe_indices_to_slice OverflowError: value too large to convert to int:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/frame.py", line 4093, in __getitem__
    return self._getitem_bool_array(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/frame.py", line 4155, in _getitem_bool_array
    return self._take_with_is_copy(indexer, axis=0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/generic.py", line 4153, in _take_with_is_copy
    result = self.take(indices=indices, axis=axis)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/generic.py", line 4133, in take
    new_data = self._mgr.take(
               ^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 893, in take
    new_labels = self.axes[axis].take(indexer)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/datetimelike.py", line 839, in take
    maybe_slice = lib.maybe_indices_to_slice(indices, len(self))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib.pyx", line 522, in pandas._libs.lib.maybe_indices_to_slice
OverflowError: value too large to convert to int

While similar to other reports, this occurs in 'pandas._libs.lib.maybe_indices_to_slice'.

Other manipulations on the df also fail. Perhaps rows are represented internally as int32s ?

Expected Behavior

Return array where 'count' is greater than zero allowing for df to be filtered down further.

Installed Versions

pd.show_versions()

INSTALLED VERSIONS

commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.4.final.0 python-bits : 64 OS : Darwin OS-release : 23.5.0 Version : Darwin Kernel Version 23.5.0: Wed May 1 20:09:52 PDT 2024; root:xnu-10063.121.3~5/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 72.1.0 pip : 24.2 Cython : 3.0.11 pytest : 7.4.4 hypothesis : None sphinx : 7.3.7 blosc : None feather : None xlsxwriter : None lxml.etree : 5.2.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.25.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.3.7 dataframe-api-compat : None fastparquet : None fsspec : 2024.3.1 gcsfs : None matplotlib : 3.8.4 numba : 0.60.0 numexpr : 2.8.7 odfpy : None openpyxl : 3.1.5 pandas_gbq : None pyarrow : 14.0.2 pyreadstat : None python-calamine : None pyxlsb : None s3fs : 2024.3.1 scipy : 1.13.1 sqlalchemy : 2.0.30 tables : 3.9.2 tabulate : 0.9.0 xarray : 2023.6.0 xlrd : 2.0.1 zstandard : 0.22.0 tzdata : 2023.3 qtpy : 2.4.1 pyqt5 : None

Comment From: benjamindonnachie

Definely some ints in https://github.com/pandas-dev/pandas/blob/795cce2a12b6ff77b998d16fcd3ffd22add0711f/pandas/_libs/lib.pyx#L522C1-L558C1:

def maybe_indices_to_slice(ndarray[intp_t, ndim=1] indices, int max_len):
    cdef:
        Py_ssize_t i, n = len(indices)
        intp_t k, vstart, vlast, v

    if n == 0:
        return slice(0, 0)

    vstart = indices[0]
    if vstart < 0 or max_len <= vstart:
        return indices

    if n == 1:
        return slice(vstart, <intp_t>(vstart + 1))

    vlast = indices[n - 1]
    if vlast < 0 or max_len <= vlast:
        return indices

    k = indices[1] - indices[0]
    if k == 0:
        return indices
    else:
        for i in range(2, n):
            v = indices[i]
            if v - indices[i - 1] != k:
                return indices

        if k > 0:
            return slice(vstart, <intp_t>(vlast + 1), k)
        else:
            if vlast == 0:
                return slice(vstart, None, k)
            else:
                return slice(vstart, <intp_t>(vlast - 1), k)

Comment From: rhshadrach

Confirmed on main. Looks like switching the argument to Py_ssize_t resolves. @benjamindonnachie would you be interested in submitting a PR to fix?

Comment From: benjamindonnachie

Sure, happy to take a look. I think I have something that works but is proving awkward to test due to lack of RAM on dev environment. Leave it with me.

Comment From: rhshadrach

Yea - I was afraid of that. I think we may have to fix this without a test unless there is a creative way that I'm missing.

Comment From: benjamindonnachie

Progress! I used a node on our HPC (512GB) to pickle the test dataset and despite being 102GB it loads on my machine with 64GB. I can now manipulate it!

>>> import pandas as pd
>>> print (pd.__version__)
3.0.0.dev0+1351.g523afa840a.dirty
>>> test = pd.read_pickle("summary.pkl")
>>> print(len(test))
4261028590
>>> stripped = test[test['count'] > 0]
>>> print(len(stripped))
1621743

I'll just run some more tests to make sure it doesn't throw any more exceptions elsewhere.

Comment From: benjamindonnachie

The rest of my analysis code now works so creating a PR for review. Thanks! :)