Code Sample, a copy-pastable example if possible

import pandas, numpy
import time
df1 = pandas.DataFrame(numpy.random.random((4000,100)))
df = df1.copy()
x = 'ab' #Note: Issue does not occur if x = 0.2
for c in df.columns.values:
    df[c] = x #Note: This line is require to produce the issue.
    start = time.time()
    df.loc[1,1]=0.1
    end = time.time()
    if (c+1) % 10 == 0:
        print(end-start)

Problem description

I have copied another dataframe and am iterating over the columns. For each column I set the entire column to a default string (x). However, a second assignment using df.loc takes longer and longer at each iteration. The output I get from the above code is:

0.004847049713134766
0.012796163558959961
0.016080856323242188
0.021880626678466797
0.026784181594848633
0.03387594223022461
0.04109668731689453
0.04407644271850586
0.049736976623535156
0.05252695083618164

As can be seen from the output, the second assignment requires ~5ms more on each iteration. This behavior leads to long wait times when using large dataframes.

Expected Output

I expected each assignment to require the same amount of time. For example setting x=0.2 in the above code gives:

0.0003058910369873047
0.0003154277801513672
0.00029277801513671875
0.0002968311309814453
0.0003590583801269531
0.00033211708068847656
0.00032329559326171875
0.0003428459167480469
0.00043463706970214844
0.0002982616424560547

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.4.3.final.0 python-bits: 64 OS: Linux OS-release: 3.8.0-44-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.19.2 nose: None pip: 9.0.1 setuptools: 20.1.1 Cython: None numpy: 1.12.1 scipy: 0.16.1 statsmodels: None xarray: None IPython: 4.0.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.1 openpyxl: None xlrd: 1.0.0 xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: None

Comment From: TomAugspurger

Each time you do df['c'] = 'ab', you force a dtype coercion from floats to object. This causes a lot of internal work in the blocks underlying the DataFrame (probably related to block consolidation if you want to search github issues for that).

What's your actual use-case? Do you actually need to iterate over the columns? If so, can you pre-allocate as object dtype to avoid the coercion?

Comment From: toobaz

@TomAugspurger is right...~~still, I'm surprised by the fact that the problem disappears if we remove~~ the line df.loc[1,1]=0.1, ~~which would seem to be irrelevant~~ is presumably where consolidation happens (since .loc can access also entire rows, it doesn't like to work on non-consolidated blocks, differently from df[.])

Comment From: jreback

This is an anti-pattern. setting object columns will force a re-consolidation and a copy each time you iterate on this loop. If you actually wanted to do this, append a column to a list, then use pd.concat once at the end.