Setting a new column with the sum of two existing columns does not work correctly for large dataframes: X['C']=X['A']+X['B'] may or may not work correctly for some runs and frame of size > 10000 X['A'] and X['B'] may contain any float number. This issue is absent in pandas 0.16.2. I am running 64 bit python on Windows 10. See attached notebook for details:
To reproduce:
+++++++++++++++++++++++++++++++++++++
import pandas as pd DataFrameSize=10001 ## will work with 10000 or less XR = pd.DataFrame({'A' : pd.Series(1,index=list(range(DataFrameSize)),dtype='float32'), 'B' : pd.Series(2,index=list(range(DataFrameSize)),dtype='float32')})
def CleanDebug(X):
X.loc[:,'C']=X.loc[:,'A']+X.loc[:,'B']
#X['C']=X['A']+X['B']
return X
for i in xrange(1000):
print 'iteration ',i
ry = CleanDebug(XR)
assert abs(ry.C.sum()-30003)<1
Comment From: jorisvandenbossche
Can you show the output of pd.show_versions()
?
Maybe also put the notebook in a gist on github, then it is easier to see the content
Comment From: jreback
this is almost certainly the same as https://github.com/pydata/pandas/issues/12023
you prob have and older numexpr, upgrade to 2.4.6 (latest) and reconfirm.
Comment From: kirickt
yes, it was older num_expr 2.4.4. After upgrade to 2.4.6 bug went away.Thanks a lot!
Comment From: kirickt
Should I delete the post?
Comment From: jreback
nope it's good