Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import numpy as np
import pandas as pd
test = pd.DataFrame({'count' : np.random.randint(0,100,size=(4261028590))}, index=(pd.DatetimeIndex(np.empty(4261028590))))
stripped = test[test['count'] > 0]
Issue Description
When working with a very large data frame (4261028590) rows I get ".... in pandas._libs.lib.maybe_indices_to_slice OverflowError: value too large to convert to int:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/frame.py", line 4093, in __getitem__
return self._getitem_bool_array(key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/frame.py", line 4155, in _getitem_bool_array
return self._take_with_is_copy(indexer, axis=0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/generic.py", line 4153, in _take_with_is_copy
result = self.take(indices=indices, axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/generic.py", line 4133, in take
new_data = self._mgr.take(
^^^^^^^^^^^^^^^
File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 893, in take
new_labels = self.axes[axis].take(indexer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/datetimelike.py", line 839, in take
maybe_slice = lib.maybe_indices_to_slice(indices, len(self))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib.pyx", line 522, in pandas._libs.lib.maybe_indices_to_slice
OverflowError: value too large to convert to int
While similar to other reports, this occurs in 'pandas._libs.lib.maybe_indices_to_slice'.
Other manipulations on the df also fail. Perhaps rows are represented internally as int32s ?
Expected Behavior
Return array where 'count' is greater than zero allowing for df to be filtered down further.
Installed Versions
pd.show_versions()
INSTALLED VERSIONS
commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.4.final.0 python-bits : 64 OS : Darwin OS-release : 23.5.0 Version : Darwin Kernel Version 23.5.0: Wed May 1 20:09:52 PDT 2024; root:xnu-10063.121.3~5/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8
pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 72.1.0 pip : 24.2 Cython : 3.0.11 pytest : 7.4.4 hypothesis : None sphinx : 7.3.7 blosc : None feather : None xlsxwriter : None lxml.etree : 5.2.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.25.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.3.7 dataframe-api-compat : None fastparquet : None fsspec : 2024.3.1 gcsfs : None matplotlib : 3.8.4 numba : 0.60.0 numexpr : 2.8.7 odfpy : None openpyxl : 3.1.5 pandas_gbq : None pyarrow : 14.0.2 pyreadstat : None python-calamine : None pyxlsb : None s3fs : 2024.3.1 scipy : 1.13.1 sqlalchemy : 2.0.30 tables : 3.9.2 tabulate : 0.9.0 xarray : 2023.6.0 xlrd : 2.0.1 zstandard : 0.22.0 tzdata : 2023.3 qtpy : 2.4.1 pyqt5 : None
Comment From: benjamindonnachie
Definely some ints in https://github.com/pandas-dev/pandas/blob/795cce2a12b6ff77b998d16fcd3ffd22add0711f/pandas/_libs/lib.pyx#L522C1-L558C1:
def maybe_indices_to_slice(ndarray[intp_t, ndim=1] indices, int max_len):
cdef:
Py_ssize_t i, n = len(indices)
intp_t k, vstart, vlast, v
if n == 0:
return slice(0, 0)
vstart = indices[0]
if vstart < 0 or max_len <= vstart:
return indices
if n == 1:
return slice(vstart, <intp_t>(vstart + 1))
vlast = indices[n - 1]
if vlast < 0 or max_len <= vlast:
return indices
k = indices[1] - indices[0]
if k == 0:
return indices
else:
for i in range(2, n):
v = indices[i]
if v - indices[i - 1] != k:
return indices
if k > 0:
return slice(vstart, <intp_t>(vlast + 1), k)
else:
if vlast == 0:
return slice(vstart, None, k)
else:
return slice(vstart, <intp_t>(vlast - 1), k)
Comment From: rhshadrach
Confirmed on main. Looks like switching the argument to Py_ssize_t
resolves. @benjamindonnachie would you be interested in submitting a PR to fix?
Comment From: benjamindonnachie
Sure, happy to take a look. I think I have something that works but is proving awkward to test due to lack of RAM on dev environment. Leave it with me.
Comment From: rhshadrach
Yea - I was afraid of that. I think we may have to fix this without a test unless there is a creative way that I'm missing.
Comment From: benjamindonnachie
Progress! I used a node on our HPC (512GB) to pickle the test dataset and despite being 102GB it loads on my machine with 64GB. I can now manipulate it!
>>> import pandas as pd
>>> print (pd.__version__)
3.0.0.dev0+1351.g523afa840a.dirty
>>> test = pd.read_pickle("summary.pkl")
>>> print(len(test))
4261028590
>>> stripped = test[test['count'] > 0]
>>> print(len(stripped))
1621743
I'll just run some more tests to make sure it doesn't throw any more exceptions elsewhere.
Comment From: benjamindonnachie
The rest of my analysis code now works so creating a PR for review. Thanks! :)