Pandas Int64 with null value mangles large-ish integers

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

x = 9999999999999999
y = 123123123123123123
z = 10000000000000543
s = pd.Series([1, 2, 3, x, y, z], dtype="Int64")
s[3] == x  # True
s
s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
s2[3] == x  # False
s2
np.iinfo(np.int64).max

With interpreter output:

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> x = 9999999999999999
>>> y = 123123123123123123
>>> z = 10000000000000543
>>> s = pd.Series([1, 2, 3, x, y, z], dtype="Int64")
>>> s[3] == x  # True
True
>>> s
0                     1
1                     2
2                     3
3      9999999999999999
4    123123123123123123
5     10000000000000543
dtype: Int64
>>> s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
>>> s2[3] == x  # False
False
>>> s2
0                     1
1                     2
2                     3
3     10000000000000000
4    123123123123123120
5     10000000000000544
6                   NaN
dtype: Int64
>>> np.iinfo(np.int64).max
9223372036854775807

Problem description

It seams that the presence of np.nan values in a column being typed as Int64 causes some non-null values to be mangled. This seems to happen with large-ish values (but still below the max int limit).

Given that Int64 is the "Nullable integer" data type, null values should be allowed, and should certainly not silently change the values of other elements in the data frame.

Expected Output

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> x = 9999999999999999
>>> y = 123123123123123123
>>> z = 10000000000000543
>>> s = pd.Series([1, 2, 3, x, y, z], dtype="Int64")
>>> s[3] == x  # True
True
>>> s
0                     1
1                     2
2                     3
3      9999999999999999
4    123123123123123123
5     10000000000000543
dtype: Int64
>>> s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
>>> s2[3] == x  # True (was False above)
True
>>> s2
0                     1
1                     2
2                     3
3      9999999999999999
4    123123123123123123
5     10000000000000543
6                   NaN
dtype: Int64
>>> np.iinfo(np.int64).max
9223372036854775807

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.7.1.final.0 python-bits : 64 OS : Darwin OS-release : 19.0.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_CA.UTF-8 LOCALE : en_CA.UTF-8 pandas : 0.25.3 numpy : 1.17.4 pytz : 2019.3 dateutil : 2.8.1 pip : 10.0.1 setuptools : 39.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : None sqlalchemy : None tables : None xarray : None xlrd : None xlwt : None xlsxwriter : None

Comment From: WillAyd

Thanks for the report. Must be an unwanted float cast in the mix as precision looks to get lost around the same range:

>>> x = 2 ** 53
>>> y = 2 ** 53 + 1
>>> z = 2 ** 53  + 2
>>> s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
>>> s2
0                   1
1                   2
2                   3
3    9007199254740992
4    9007199254740992
5    9007199254740994
6                 NaN

Investigation and PRs certainly welcome

Comment From: jorisvandenbossche

This is caused by a converting the list to a numpy array, and letting numpy do type inference. Because there is a np.nan, numpy will create a float array. And this then indeed causes this precision loss.

With a quick hack like:

--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -205,7 +205,10 @@ def coerce_to_array(values, dtype, mask=None, copy=False):
             mask = mask.copy()
         return values, mask

-    values = np.array(values, copy=copy)
+    if isinstance(values, list):
+        values = np.array(values, dtype=object)
+    else:
+        values = np.array(values, copy=copy)
     if is_object_dtype(values):
         inferred_type = lib.infer_dtype(values, skipna=True)
         if inferred_type == "empty":

I get the correct behaviour (by ensuring we convert the list to an object array, so we don't loose precision due to the intermediate float array, and we can do the conversion to an integer array ourselves).

For a proper fix, the if/else check will need to be a bit more advanced. I think we should basically check if the input values already are a ndarray or have a dtype, and if that is the case keep that dtype, otherwise convert to object dtype.

Contributions certainly welcome!

Comment From: rushabh-v

take

Comment From: fcollman

is this behavior caused by the same bug? Here there are no NaNs involved, so casting was not even needed

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"id":[864691135341199281, 864691135341199281]}, dtype='Int64')
>>> df['cid']=[49912377,49912377]
>>> print(df)
                   id       cid
0  864691135341199281  49912377
1  864691135341199281  49912377
>>> print(df.groupby('cid').first())
                          id
cid                         
49912377  864691135341199232
>>> 
>>> 
>>> print(df.groupby('cid').first().id.dtype, df.id.dtype)
Int64 Int64
>>> 
>>> import numpy as np
>>> print(np.int64(np.float64(864691135341199281)))
864691135341199232

Comment From: rushabh-v

https://github.com/pandas-dev/pandas/issues/31108 is the reason!

Comment From: phofl

Fixed in #50757

Pandas Int64 with null value mangles large-ish integers

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`