for example, i want to concat/append two dataframes into one and only use the final dataframe later on:
` import numpy as np import pandas as pd from pandas.util import testing as tm from multiprocessing.pool import ThreadPool
num_rows = 100000
def make_df(num_rows=10000):
df = pd.DataFrame(np.random.rand(num_rows, 5), columns=list('abcde'))
df['foo'] = 'foo'
df['bar'] = 'bar'
df['baz'] = 'baz'
df['date'] = pd.date_range('20000101 09:00:00',
periods=num_rows,
freq='s')
df['int'] = np.arange(num_rows, dtype='int64')
return df
df1 = make_df(num_rows=num_rows) df2 = make_df(num_rows=num_rows) df = pd.concat([df1,df2]) `
during the last statement execution, the memory usage got doubled due to the copy assignment.
is it possible that we have some method like df1.append_but_modify_lvalue(df2,ignore_index=True) , thus we don't need to copy assign return value
Comment From: chris-b1
No, this is essentially impossible, for the same reason you can't concatenate numpy arrays without a copy - see http://stackoverflow.com/a/7869472/3657742.
If you know how big the result is going to be your best bet is probably to pre-allocate numpy array(s) of the full size, place values into it, and construct the DataFrame
from the single large array.