Pandas Floating point precision in DataFrame.to_csv

http://stackoverflow.com/questions/12877189/float64-with-pandas-to-csv

What does R (or others) do?

Comment From: pmorissette

Hey all,

I just started using Pandas a few days ago and ran into a related issue.

Basically I am reading in data from a .csv file. I have been writing some unit tests and was getting some errors because my expected values were different from the ones I calculated in Excel. At first, I assumed it was due to rounding but when I inspected my data frame, I realized that I was getting errors because of floating point issues. Basically, an input price of 7.34 was now 7.3399999999999999 (I am working with stock prices).

I was just wondering what the recommended way of dealing with this is, if any? Should I be converting my data frame to another type once imported?

Thanks in advance for your help and great job on this solid library.

Comment From: wesm

It seems that CPython does a better job of float formatting than NumPy. I'll see what I can do

Comment From: wesm

I can't manage to find a standalone reproduction of this. The csv module uses str (via PyObject_Str) to format the numbers, and that appears to work fine on numbers like 0.085 or 7.34. If someone can post an example illustrating this breaking down, I'll see what I can do

Comment From: adamobeng

I think I've been able to reproduce this:

    df = pa.DataFrame({'float' : [9.728141, 4.810295]})
    df.to_csv('floats.csv')

floats.csv looks like:

,float                                                                                                                                                                 
0,9.7281410000000008
1,4.810295

Comment From: wesm

What OS/Python/NumPy combination are you using?

Comment From: adamobeng

uname -a

 Darwin boron 12.2.0 Darwin Kernel Version 12.2.0: Sat Aug 25 00:48:52 PDT 2012; root:xnu-2050.18.24~1/RELEASE_X86_64 x86_64

sys.version

'2.7.3 (default, Nov  3 2012, 17:31:26) \n[GCC 4.2.1 Compatible Apple Clang 4.0 ((tags/Apple/clang-421.0.57))]'

np.version

1.6.2

Edit: This does not happen (i.e. the output is as expected) on an EC2 node running starcluster with:

uname -a

Linux master 3.0.0-14-virtual #23-Ubuntu SMP Mon Nov 21 21:09:11 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

sys.version

'2.7.2+ (default, Oct  4 2011, 20:06:09) \n[GCC 4.6.1]'

np.version

'1.6.2'

Comment From: wesm

Urgh I've dug down into the belly of the Python interpreter and believe that the formatting is eventually happening in the C stdlib, which means that Linux and OS X (BSD) have slightly different implementations. This is annoying is crap.

Comment From: adamobeng

If I understand correctly, the problem comes from trying to write the underlying ndarray directly.

Is there a philosophical reason why there could not be a DataFrameFormatter for the CSV format, given that FloatArrayFormatter already takes care of this problem when outputting to LaTeX, HTML and plain text?

Comment From: wesm

I guess the concern would be loss of precision

Comment From: adamobeng

It depends whether you're using the CSV file for display or storage (i.e. as a faithful reproduction of the DataFrame). You might argue that using CSVs for storage is a bad idea anyway, because if the DataFrame contains arbitrary objects, you'll only end up with their string representations. Especially when you can serialize the same data very easily.

Comment From: antonywu

So the current workaround is to use Linux, instead of Mac to get the results we wanted in csv file? Honestly, for display purpose, I would prefer the option to intentionally drop trailing digits (yes, I mean rounding)... I wonder if there is a way to make it happen with .to_csv()..or would I have to write my own .to_csv() with dataframe iteration + round()

Comment From: frgomes

I detected that read_csv has this bug too.

It's not a Python format issue. It's not a general floating point issue, despite it's true that floating point arithmetic is a subject which demands some care from the programmer. This article below clarifies a bit this subject:

http://docs.python.org/2/tutorial/floatingpoint.html

The problem is that it's necessary to employ fixed point arithmetic and only convert to floating point in the end, applying a convenient divisor.

A classic one-liner which shows the "problem" is ...

0.1 + 0.1 + 0.1 0.30000000000000004

... which does not display 0.3 as one would expect. On the other hand, if you handle the calculation using fixed point arithmetic and only in the last step you employ floating point arithmetic, it will work as you expect. See this:

(1 + 1 + 1) * 1.0 / 10 0.3

So, it's necessary to account to the position of the decimal point, ignore it initially and go ahead with the algorithm which converts text to integers (not floats!). The last step consists on converting an integer to a float by dividing by an adequate power of 10.

If you desperately need to circumvent this problem quickly, I recommend you create another CSV file which contains all figures as integers, for example multiplying by 100, 1000 or other factor which turns out to be convenient. Inside your application, read the CSV file as usual and you will get those integer values back. Then convert those values to floating point, dividing by the same factor you multiplied before.

Comment From: jreback

closing in favor of #4668

Comment From: jiahe224

@pmorissette Hi, Have you found a solution? I found this problem whenever read decimals to dataframe and save as other file, I don't want to use solutions like round or format