Pandas Bug: read_csv losing precision when reading Int64 data with N/A values

Code Sample

import pandas as pd
import numpy as np
from io import StringIO
TESTDATA1 = StringIO("""col1;col2
     1;123
     2;456
     3;1582218195625938945
     """)
TESTDATA2 = StringIO("""col1;col2
     1;123
     2;
     3;1582218195625938945
     """)
D1 = pd.read_csv(TESTDATA1, sep=";", dtype={'col2':'Int64'})
D2 = pd.read_csv(TESTDATA2, sep=";", dtype={'col2':'Int64'})
D1['col2']
D2['col2']
D1.loc[2,'col2'] == D2.loc[2,'col2']

Yields :

>>> D1['col2']
0                    123
1                    456
2    1582218195625938945
Name: col2, dtype: Int64
>>> D2['col2']
0                    123
1                   <NA>
2    1582218195625938944
Name: col2, dtype: Int64
>>> D1.loc[2,'col2'] == D2.loc[2,'col2']
False

Problem description

When pd.read_csv reads a column with Nullable Int64 data, and the column contains missing values (N/A), the precision on other values in that column is not maintained.

Expected Output

>>> D1['col2']
0                    123
1                    456
2    1582218195625938945
Name: col2, dtype: Int64
>>> D2['col2']
0                    123
1                   <NA>
2    1582218195625938945
Name: col2, dtype: Int64
>>> D1.loc[2,'col2'] == D2.loc[2,'col2']
True

Output of `pd.show_versions()`

>>> pd.show_versions() INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Linux OS-release : 4.18.0-3-amd64 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8 pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 18.1 setuptools : 40.8.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Comment From: WillAyd

I think this is the same root cause as #30268

Comment From: p1lgr1m

I just re-ran the code above, against the 1.5.0-dev branch (commit adec4febedbd7c987639ca57563ad2e190405e92, numpy 1.22.1, python 3.9.2.final.0). Output below: ```>>> D1['col2'] 0 123 1 456 2 1582218195625938945 Name: col2, dtype: Int64

D2['col2'] 0 123 1 2 1582218195625938688 Name: col2, dtype: Int64 D1.loc[2,'col2'] == D2.loc[2,'col2'] False


It hasn't been solved yet; if anything, it has gotten "worse" (the value in the column with the NaN is farther off from what it should be).

This bug was reported almost 2 years ago. I know that everyone is doing their best, but... this is a very clear, very reproducible bug. Could someone comment on what the holdup is, and what's a realistic timeframe to getting it fixed?

**Comment From: jreback**

@p1lgr1m pandas is an open source project, there are 3000 issues open, and we have thousands of pull requests. Patches are provided by the *community*, core-dev can review them. Feel free to open a pull request.

**Comment From: CFretter**

take

**Comment From: CFretter**

So I dug down and found a simpler reproducing example:

```from  pandas.core.tools.numeric import to_numeric
import numpy as np
print(to_numeric([np.nan, '1582218195625938945']))
print(to_numeric(['1', '1582218195625938945']))

to_numeric has a comment that says:

Please note that precision loss may occur if really large numbers are passed in. Due to the internal limitations of ndarray, if numbers smaller than -9223372036854775808 (np.iinfo(np.int64).min) or larger than 18446744073709551615 (np.iinfo(np.uint64).max) are passed in, it is very likely they will be converted to float so that they can stored in an ndarray. These warnings apply similarly to Series since it internally leverages ndarray.

But the number we use here is smaller than that limit.

The actual difference seems to lie in maybe_convert_numeric where there is a fast path for ints that can not be taken because of the nan value. Now I am sure this function affects many other things I do not understand, so I am a bit reluctant to dig deeper.

It looks like this pull request attempted to solve the same problem: https://github.com/pandas-dev/pandas/pull/30282 which is blocked by this issue: https://github.com/pandas-dev/pandas/issues/31108

What is the proper procedure from here?

Pandas Bug: read_csv losing precision when reading Int64 data with N/A values

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`