Pandas pd.where OverflowError with large numbers

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.DataFrame([[1.0, 2e25],[np.nan, 0.1]])

# Works when applied to individual columns
print(df[0].where(pd.notnull(df[0]), None))
print(df[1].where(pd.notnull(df[1]), None))

# Breaks for whole dataframe
print(df.where(pd.notnull(df), None))

Problem description

The above code does not work with 1.0.0, but used to work with at least 0.25.0. Replacing large floats with pd.where breaks Running pd.where on a dataframe that contains large float values and the replacement value is of a different dtype throws OverflowError: int too big to convert:

    applied = getattr(b, f)(**kwargs)
  File "***\venv\lib\site-packages\pandas\core\internals\blocks.py", line 1426, in where
    return self._maybe_downcast(blocks, "infer")
  File "***\venv\lib\site-packages\pandas\core\internals\blocks.py", line 514, in _maybe_downcast
    return _extend_blocks([b.downcast(downcast) for b in blocks])
  File "***\venv\lib\site-packages\pandas\core\internals\blocks.py", line 514, in <listcomp>
    return _extend_blocks([b.downcast(downcast) for b in blocks])
  File "***\venv\lib\site-packages\pandas\core\internals\blocks.py", line 552, in downcast
    return self.split_and_operate(None, f, False)
  File "***\venv\lib\site-packages\pandas\core\internals\blocks.py", line 496, in split_and_operate
    nv = f(m, v, i)
  File "***\venv\lib\site-packages\pandas\core\internals\blocks.py", line 549, in f
    val = maybe_downcast_to_dtype(val, dtype="infer")
  File "***\venv\lib\site-packages\pandas\core\dtypes\cast.py", line 135, in maybe_downcast_to_dtype
    converted = maybe_downcast_numeric(result, dtype, do_round)
  File "***\venv\lib\site-packages\pandas\core\dtypes\cast.py", line 222, in maybe_downcast_numeric
    new_result = trans(result).astype(dtype)
OverflowError: int too big to convert

Replacing data one column at a time works.

Expected Output

Name: 1, dtype: float64
      0      1
0     1  2e+25
1  None    0.1

Output of `pd.show_versions()`

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None pandas : 1.0.0 numpy : 1.16.2 pytz : 2018.9 dateutil : 2.8.0 pip : 10.0.1 setuptools : 39.1.0 Cython : None pytest : 4.4.0 hypothesis : None sphinx : 2.0.0 blosc : None feather : None xlsxwriter : None lxml.etree : 4.3.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.3.3 matplotlib : None numexpr : None odfpy : None openpyxl : 2.6.2 pandas_gbq : None pyarrow : None pytables : None pytest : 4.4.0 pyxlsb : None s3fs : None scipy : None sqlalchemy : 1.3.3 tables : None tabulate : None xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : None numba : None

Comment From: wailashi

Seems to be still broken on 1.0.1.

Comment From: rmsilva1973

print(df.where(pd.notnull(df), 'some_str')) raises the same exception. However print(df.where(pd.notnull(df), 1)) works.

Debugging through the code, it seems on this line new_result = trans(result).astype(dtype), the astype method is to blame. dtype here is int64

Comment From: simonjayhawkins

The above code does not work with 1.0.0, but used to work with at least 0.25.0.

looks like #29139 (i.e. 1.0.0)

225cc9284d1abfbcc9d13203419516575a63cc7a is the first bad commit commit 225cc9284d1abfbcc9d13203419516575a63cc7a Author: jbrockmendel jbrockmendel@gmail.com Date: Thu Oct 24 05:10:07 2019 -0700

CLN: remove Block._try_coerce_arg (#29139)

cc @jbrockmendel

Comment From: eloyfelix

also affected by this bug

Comment From: uchoa91

Same here. However got it using df.replace({pd.NaT: None}) instead of df.where(pd.notnull(df), None)

Comment From: simonjayhawkins

@TomAugspurger adding the blocker tag as several users affected by this regression.

Comment From: jreback

@simonjayhawkins i suppose this is ok for 1.1 as a blocker, though generally we should not simply block on things, this would delay releases indefinitly which is a much worse problem.

Comment From: eloyfelix

still broken in 1.1.3

Comment From: jreback

for all those commenting 'this is still broken' - well it's an open issue

you are welcome to propose a patch

Comment From: eloyfelix

sorry, it was marked as a blocker for 1.1 and I was not sure about its status

Comment From: skvrahul

An added observation is that this seems to still be breaking with numbers smaller than sys.maxsize Trying the same piece of code with a change in number:

import pandas as pd
import numpy as np
import sys

df = pd.DataFrame([[1.0, sys.maxsize-5],[np.nan, 0.1]])

# Works when applied to individual columns
print(df[0].where(pd.notnull(df[0]), None))
print(df[1].where(pd.notnull(df[1]), None))

# This line still breaks
print(df.where(pd.notnull(df), None))

This still causes the same error

Comment From: suvayu

take

Comment From: mroeschke

This looks to work on master now (None should coerce to nan). Could use a test.

In [2]: import pandas as pd
   ...: import numpy as np
   ...:
   ...: df = pd.DataFrame([[1.0, 2e25],[np.nan, 0.1]])
   ...:
   ...: # Works when applied to individual columns
   ...: print(df[0].where(pd.notnull(df[0]), None))
   ...: print(df[1].where(pd.notnull(df[1]), None))
   ...:
   ...: # Breaks for whole dataframe
   ...: print(df.where(pd.notnull(df), None))
0    1.0
1    NaN
Name: 0, dtype: float64
0    2.000000e+25
1    1.000000e-01
Name: 1, dtype: float64
     0             1
0  1.0  2.000000e+25
1  NaN  1.000000e-01

Comment From: eloyfelix

edited: not sure if it is related to the issue but as this other issue points

df = df.where((pd.notnull(df)), None) doesn't seem to work anymore and df = df.replace({np.nan: None}) doesn't always work

Comment From: mliu08

take

Comment From: MarcoGorelli

Looks like this was fixed by https://github.com/pandas-dev/pandas/pull/39761

https://www.kaggle.com/code/marcogorelli/pandas-regression-example/notebook?scriptVersionId=112217500

Pandas pd.where OverflowError with large numbers

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`