Code Sample
import pandas as pd
import numpy as np
from io import StringIO
TESTDATA1 = StringIO("""col1;col2
1;123
2;456
3;1582218195625938945
""")
TESTDATA2 = StringIO("""col1;col2
1;123
2;
3;1582218195625938945
""")
D1 = pd.read_csv(TESTDATA1, sep=";", dtype={'col2':'Int64'})
D2 = pd.read_csv(TESTDATA2, sep=";", dtype={'col2':'Int64'})
D1['col2']
D2['col2']
D1.loc[2,'col2'] == D2.loc[2,'col2']
Yields :
>>> D1['col2']
0 123
1 456
2 1582218195625938945
Name: col2, dtype: Int64
>>> D2['col2']
0 123
1 <NA>
2 1582218195625938944
Name: col2, dtype: Int64
>>> D1.loc[2,'col2'] == D2.loc[2,'col2']
False
Problem description
When pd.read_csv reads a column with Nullable Int64 data, and the column contains missing values (N/A), the precision on other values in that column is not maintained.
Expected Output
>>> D1['col2']
0 123
1 456
2 1582218195625938945
Name: col2, dtype: Int64
>>> D2['col2']
0 123
1 <NA>
2 1582218195625938945
Name: col2, dtype: Int64
>>> D1.loc[2,'col2'] == D2.loc[2,'col2']
True
Output of pd.show_versions()
Comment From: WillAyd
I think this is the same root cause as #30268
Comment From: p1lgr1m
I just re-ran the code above, against the 1.5.0-dev branch (commit adec4febedbd7c987639ca57563ad2e190405e92, numpy 1.22.1, python 3.9.2.final.0). Output below: ```>>> D1['col2'] 0 123 1 456 2 1582218195625938945 Name: col2, dtype: Int64
D2['col2'] 0 123 1
2 1582218195625938688 Name: col2, dtype: Int64 D1.loc[2,'col2'] == D2.loc[2,'col2'] False
It hasn't been solved yet; if anything, it has gotten "worse" (the value in the column with the NaN is farther off from what it should be).
This bug was reported almost 2 years ago. I know that everyone is doing their best, but... this is a very clear, very reproducible bug. Could someone comment on what the holdup is, and what's a realistic timeframe to getting it fixed?
**Comment From: jreback**
@p1lgr1m pandas is an open source project, there are 3000 issues open, and we have thousands of pull requests. Patches are provided by the *community*, core-dev can review them. Feel free to open a pull request.
**Comment From: CFretter**
take
**Comment From: CFretter**
So I dug down and found a simpler reproducing example:
```from pandas.core.tools.numeric import to_numeric
import numpy as np
print(to_numeric([np.nan, '1582218195625938945']))
print(to_numeric(['1', '1582218195625938945']))
to_numeric has a comment that says:
Please note that precision loss may occur if really large numbers are passed in. Due to the internal limitations of
ndarray
, if numbers smaller than-9223372036854775808
(np.iinfo(np.int64).min) or larger than18446744073709551615
(np.iinfo(np.uint64).max) are passed in, it is very likely they will be converted to float so that they can stored in anndarray
. These warnings apply similarly toSeries
since it internally leveragesndarray
.
But the number we use here is smaller than that limit.
The actual difference seems to lie in maybe_convert_numeric
where there is a fast path for ints that can not be taken because of the nan value.
Now I am sure this function affects many other things I do not understand, so I am a bit reluctant to dig deeper.
It looks like this pull request attempted to solve the same problem: https://github.com/pandas-dev/pandas/pull/30282 which is blocked by this issue: https://github.com/pandas-dev/pandas/issues/31108
What is the proper procedure from here?