Pandas Performance of adding columns to DataFrame

I found a performance issue involving modifying DataFrame objects. Adding columns where a value is present for every value of the index (e.g., len(column) == len(index)) is really, really slow.

import numpy as np
import pandas
import time

values = np.random.randn(100)

t0 = time.time()
columns = [values for i in range(10000)]
df = pandas.DataFrame(np.column_stack(columns),
                      columns=range(len(columns)),
                      index=values)
print 'A took %.3f seconds' % (time.time() - t0)

t0 = time.time()
df = pandas.DataFrame(index=values)
for i in range(10000):
    df[i] = values
print 'B took %.3f seconds' % (time.time() - t0)

Running it on my laptop shows that the first version (doing all the work in __init__) is much faster:

A took 0.070 seconds
B took 10.717 seconds

The cProfile results show hotspots involving the index:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    13410    2.716    0.000    2.716    0.000 {method 'astype' of 'numpy.ndarray' objects}
    12246    2.024    0.000    2.024    0.000 {numpy.core.multiarray.concatenate}
    22274    1.443    0.000    1.501    0.000 index.py:277(__contains__)
    10555    1.103    0.000    1.103    0.000 {method 'copy' of 'numpy.ndarray' objects}
      555    0.639    0.001    3.711    0.007 internals.py:823(_consolidate_inplace)
      555    0.429    0.001    3.072    0.006 internals.py:1370(_consolidate)
    11135    0.299    0.000    8.921    0.001 internals.py:892(insert)
    11141    0.217    0.000    0.217    0.000 {pandas.lib.is_integer_array}
   121164    0.163    0.000    0.163    0.000 {numpy.core.multiarray.array}
    25105    0.140    0.000    3.493    0.000 index.py:75(__new__)
   342808    0.107    0.000    0.107    0.000 {isinstance}
     1691    0.106    0.000    0.115    0.000 index.py:211(is_unique)
    11138    0.096    0.000    0.096    0.000 {pandas.lib.list_to_object_array}

Comment From: wesm

Unfortunately have to table any improvements here until a later release can make deeper infrastructural changes to DataFrame's internals. Each time you insert a column, it has to modify the column index, and unfortunately this is not a cheap operation (takes about 100 microseconds when you already have 10000 columns). I don't recommend writing a lot of code like this that inserts tons of columns-- instead populate a dict then turn that into a DataFrame.

Comment From: wesm

Closing as Won't Fix. Applications adding a lot of new columns to a DataFrame should avoid this issue (as long as we are using NumPy arrays as the container for column names) by preallocating empty columns and then inserting the data.

Comment From: danielhrisca

Using a dict with many arrays blows up the RAM. Is there a time and memory efficient way of adding columns or creating a DataFrame?

Comment From: vanschelven

Sad to see this old issue closed, as I'm also suffering from this so many years later.

The particular scenario where this is bothering me is when using pandas in dataframes with very few rows (e.g. a single row). In that context, per-column slowness is not nicely amortised over the large number of rows, and the per-row cost balloons.

I understand that data with few rows is not exactly Pandas' sweet spot; on the other hand: the fact that Pandas is non-performant for the small data scenario implies that it cannot be employed as a single solution across the big data / small data domains. Which in turn may take Pandas off the table as a potential tool for the big-data scenario too.

Also, just for reference to show how bad this is, even when there is no data on a modern version of Pandas:

>>> import pandas as pd
>>> print(pd.__version__)
0.24.2
>>> 
>>> from contextlib import contextmanager
>>> from time import time
>>> 
>>> 
>>> @contextmanager
... def timed():
...     t0 = time()
...     yield
...     print(time() - t0)
... 
>>> 
>>> empty_series = pd.Series([])
>>> 
>>> with timed():
...     df = pd.DataFrame()
...     for i in range(1000):
...         df[str(i)] = empty_series
... 
1.2648789882659912
>>> with timed():
...     d = {}
...     for i in range(1000):
...         d[str(i)] = empty_series
... 
0.0007038116455078125