Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
x = 9999999999999999
y = 123123123123123123
z = 10000000000000543
s = pd.Series([1, 2, 3, x, y, z], dtype="Int64")
s[3] == x # True
s
s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
s2[3] == x # False
s2
np.iinfo(np.int64).max
With interpreter output:
>>> import pandas as pd
>>> import numpy as np
>>>
>>> x = 9999999999999999
>>> y = 123123123123123123
>>> z = 10000000000000543
>>> s = pd.Series([1, 2, 3, x, y, z], dtype="Int64")
>>> s[3] == x # True
True
>>> s
0 1
1 2
2 3
3 9999999999999999
4 123123123123123123
5 10000000000000543
dtype: Int64
>>> s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
>>> s2[3] == x # False
False
>>> s2
0 1
1 2
2 3
3 10000000000000000
4 123123123123123120
5 10000000000000544
6 NaN
dtype: Int64
>>> np.iinfo(np.int64).max
9223372036854775807
Problem description
It seams that the presence of np.nan
values in a column being typed as Int64
causes some non-null values to be mangled. This seems to happen with large-ish values (but still below the max int limit).
Given that Int64 is the "Nullable integer" data type, null values should be allowed, and should certainly not silently change the values of other elements in the data frame.
Expected Output
>>> import pandas as pd
>>> import numpy as np
>>>
>>> x = 9999999999999999
>>> y = 123123123123123123
>>> z = 10000000000000543
>>> s = pd.Series([1, 2, 3, x, y, z], dtype="Int64")
>>> s[3] == x # True
True
>>> s
0 1
1 2
2 3
3 9999999999999999
4 123123123123123123
5 10000000000000543
dtype: Int64
>>> s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
>>> s2[3] == x # True (was False above)
True
>>> s2
0 1
1 2
2 3
3 9999999999999999
4 123123123123123123
5 10000000000000543
6 NaN
dtype: Int64
>>> np.iinfo(np.int64).max
9223372036854775807
Output of pd.show_versions()
Comment From: WillAyd
Thanks for the report. Must be an unwanted float cast in the mix as precision looks to get lost around the same range:
>>> x = 2 ** 53
>>> y = 2 ** 53 + 1
>>> z = 2 ** 53 + 2
>>> s2 = pd.Series([1, 2, 3, x, y, z, np.nan], dtype="Int64")
>>> s2
0 1
1 2
2 3
3 9007199254740992
4 9007199254740992
5 9007199254740994
6 NaN
Investigation and PRs certainly welcome
Comment From: jorisvandenbossche
This is caused by a converting the list to a numpy array, and letting numpy do type inference. Because there is a np.nan
, numpy will create a float array. And this then indeed causes this precision loss.
With a quick hack like:
--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -205,7 +205,10 @@ def coerce_to_array(values, dtype, mask=None, copy=False):
mask = mask.copy()
return values, mask
- values = np.array(values, copy=copy)
+ if isinstance(values, list):
+ values = np.array(values, dtype=object)
+ else:
+ values = np.array(values, copy=copy)
if is_object_dtype(values):
inferred_type = lib.infer_dtype(values, skipna=True)
if inferred_type == "empty":
I get the correct behaviour (by ensuring we convert the list to an object array, so we don't loose precision due to the intermediate float array, and we can do the conversion to an integer array ourselves).
For a proper fix, the if/else check will need to be a bit more advanced. I think we should basically check if the input values already are a ndarray or have a dtype
, and if that is the case keep that dtype, otherwise convert to object dtype.
Contributions certainly welcome!
Comment From: rushabh-v
take
Comment From: fcollman
is this behavior caused by the same bug? Here there are no NaNs involved, so casting was not even needed
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"id":[864691135341199281, 864691135341199281]}, dtype='Int64')
>>> df['cid']=[49912377,49912377]
>>> print(df)
id cid
0 864691135341199281 49912377
1 864691135341199281 49912377
>>> print(df.groupby('cid').first())
id
cid
49912377 864691135341199232
>>>
>>>
>>> print(df.groupby('cid').first().id.dtype, df.id.dtype)
Int64 Int64
>>>
>>> import numpy as np
>>> print(np.int64(np.float64(864691135341199281)))
864691135341199232
Comment From: rushabh-v
https://github.com/pandas-dev/pandas/issues/31108 is the reason!
Comment From: phofl
Fixed in #50757