Code Sample, a copy-pastable example if possible
import pandas, numpy
import time
df1 = pandas.DataFrame(numpy.random.random((4000,100)))
df = df1.copy()
x = 'ab' #Note: Issue does not occur if x = 0.2
for c in df.columns.values:
df[c] = x #Note: This line is require to produce the issue.
start = time.time()
df.loc[1,1]=0.1
end = time.time()
if (c+1) % 10 == 0:
print(end-start)
Problem description
I have copied another dataframe and am iterating over the columns. For each column I set the entire column to a default string (x). However, a second assignment using df.loc takes longer and longer at each iteration. The output I get from the above code is:
0.004847049713134766
0.012796163558959961
0.016080856323242188
0.021880626678466797
0.026784181594848633
0.03387594223022461
0.04109668731689453
0.04407644271850586
0.049736976623535156
0.05252695083618164
As can be seen from the output, the second assignment requires ~5ms more on each iteration. This behavior leads to long wait times when using large dataframes.
Expected Output
I expected each assignment to require the same amount of time. For example setting x=0.2 in the above code gives:
0.0003058910369873047
0.0003154277801513672
0.00029277801513671875
0.0002968311309814453
0.0003590583801269531
0.00033211708068847656
0.00032329559326171875
0.0003428459167480469
0.00043463706970214844
0.0002982616424560547
Output of pd.show_versions()
Comment From: TomAugspurger
Each time you do df['c'] = 'ab'
, you force a dtype coercion from floats to object. This causes a lot of internal work in the blocks underlying the DataFrame (probably related to block consolidation if you want to search github issues for that).
What's your actual use-case? Do you actually need to iterate over the columns? If so, can you pre-allocate as object dtype to avoid the coercion?
Comment From: toobaz
@TomAugspurger is right...~~still, I'm surprised by the fact that the problem disappears if we remove~~ the line df.loc[1,1]=0.1
, ~~which would seem to be irrelevant~~ is presumably where consolidation happens (since .loc
can access also entire rows, it doesn't like to work on non-consolidated blocks, differently from df[.]
)
Comment From: jreback
This is an anti-pattern. setting object columns will force a re-consolidation and a copy each time you iterate on this loop. If you actually wanted to do this, append a column to a list, then use pd.concat
once at the end.