Code Sample
import pandas as pd
import numpy as np
from io import StringIO
TESTDATA1 = StringIO("""col1;col2
1;123
2;456
3;1582218195625938945
""")
TESTDATA2 = StringIO("""col1;col2
1;123
2;
3;1582218195625938945
""")
D1 = pd.read_csv(TESTDATA1, sep=";", dtype={'col2':'Int64'})
D2 = pd.read_csv(TESTDATA2, sep=";", dtype={'col2':'Int64'})
D1['col2']
D2['col2']
D1.loc[2,'col2'] == D2.loc[2,'col2']
Yields :
>>> D1['col2']
0 123
1 456
2 1582218195625938945
Name: col2, dtype: Int64
>>> D2['col2']
0 123
1 <NA>
2 1582218195625938944
Name: col2, dtype: Int64
>>> D1.loc[2,'col2'] == D2.loc[2,'col2']
False
Problem description
When pd.read_csv reads a column with Nullable Int64 data, and the column contains missing values (N/A), the precision on other values in that column is not maintained.
Expected Output
>>> D1['col2']
0 123
1 456
2 1582218195625938945
Name: col2, dtype: Int64
>>> D2['col2']
0 123
1 <NA>
2 1582218195625938945
Name: col2, dtype: Int64
>>> D1.loc[2,'col2'] == D2.loc[2,'col2']
True
Output of pd.show_versions()
Comment From: WillAyd
I think this is the same root cause as #30268
Comment From: p1lgr1m
I just re-ran the code above, against the 1.5.0-dev branch (commit adec4febedbd7c987639ca57563ad2e190405e92, numpy 1.22.1, python 3.9.2.final.0). Output below: ```>>> D1['col2'] 0 123 1 456 2 1582218195625938945 Name: col2, dtype: Int64
D2['col2'] 0 123 1
2 1582218195625938688 Name: col2, dtype: Int64 D1.loc[2,'col2'] == D2.loc[2,'col2'] False
It hasn't been solved yet; if anything, it has gotten "worse" (the value in the column with the NaN is farther off from what it should be).
This bug was reported almost 2 years ago. I know that everyone is doing their best, but... this is a very clear, very reproducible bug. Could someone comment on what the holdup is, and what's a realistic timeframe to getting it fixed?
**Comment From: jreback**
@p1lgr1m pandas is an open source project, there are 3000 issues open, and we have thousands of pull requests. Patches are provided by the *community*, core-dev can review them. Feel free to open a pull request.
**Comment From: CFretter**
take
**Comment From: CFretter**
So I dug down and found a simpler reproducing example:
```from pandas.core.tools.numeric import to_numeric
import numpy as np
print(to_numeric([np.nan, '1582218195625938945']))
print(to_numeric(['1', '1582218195625938945']))
to_numeric has a comment that says:
Please note that precision loss may occur if really large numbers are passed in. Due to the internal limitations of
ndarray, if numbers smaller than-9223372036854775808(np.iinfo(np.int64).min) or larger than18446744073709551615(np.iinfo(np.uint64).max) are passed in, it is very likely they will be converted to float so that they can stored in anndarray. These warnings apply similarly toSeriessince it internally leveragesndarray.
But the number we use here is smaller than that limit.
The actual difference seems to lie in maybe_convert_numeric where there is a fast path for ints that can not be taken because of the nan value.
Now I am sure this function affects many other things I do not understand, so I am a bit reluctant to dig deeper.
It looks like this pull request attempted to solve the same problem: https://github.com/pandas-dev/pandas/pull/30282 which is blocked by this issue: https://github.com/pandas-dev/pandas/issues/31108
What is the proper procedure from here?