Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
import time
import pandas as pd
import os
import psutil
import tqdm
columns = {
'a': 'category',
'b': 'category',
'c': 'category',
'd': 'category',
'e': 'float64'
}
df = pd.DataFrame()
total_size = 198_748_533
chunk_size = 1_000_000
chunks = []
with pd.read_csv('test.csv', engine='c', low_memory=True, dtype=columns, chunksize=chunk_size) as reader:
iterator = tqdm.tqdm(reader, total=int(total_size / chunk_size + 1))
for chunk in iterator:
chunks.append(chunk)
df = pd.concat(chunks, copy=False, ignore_index=True)
I have large 14G csv-file and 16G RAM. So I read it by a chunks and concat them after. Reading all the chunks consumes ~6.6G RSS
And when it is starting concating memory doubles or triples.
Expected behaviour
Every dataframe passed to pd.concat requires memory doubling, but not the whole dataframe.
Installed Versions
Prior Performance
No response
Comment From: AlexKirko
Interesting. Seems to me like this is intended behavior and the only way to avoid it is to write a particular shortcut for when we pass a bunch of DataFrame objects which align perfectly and nothing needs to be modified before concatenation. Not sure it's worth it though, because you can reduce the memory usage, off the top of my head, by reading a chunk, stripping columns, then putting all the DataFrame.values
into a list
of ndarray
and then casting that directly to a DataFrame.
This is what's clear from the code:
After you run the loop, you end up with a list[pandas.DataFrame]
plus the local variables of pd.read_csv
that likely haven't been garbage collected yet. Then you run pd.concat
, which copies the list of DataFrames to work with them locally:
if keys is None:
objs = list(com.not_none(*objs))
else:
# #1649
clean_keys = []
clean_objs = []
for k, v in zip(keys, objs):
if v is None:
continue
clean_keys.append(k)
clean_objs.append(v)
objs = clean_objs
Looks to me like this copying is done, because generally pd.concat
needs to modify the objects before joining. Consider the case when the objects don't line up exactly, for example.
Then the resulting DataFrame is returned. So yes, in the process of concatenation, before the local variables are cleaned up, we get ~3x the memory usage. If you set copy=True
, then you get additional value copying instead of viewing.
This behavior doesn't look unintended to me, so I'm marking the issue as a performance discussion.
Comment From: AlexKirko
Unless I misunderstood something, my opinion is that adding a shortcut that avoids copying the objects completely isn't worth it. It would be a kludge, and when someone runs into a memory error in such cases, it can be circumvented via a couple of options:
* don't concat at all, just iterate over chunks directly
* concat ndarray
data instead while overwriting each chunk, then cast everything to pd.DataFrame
Comment From: AlexKirko
I'll close this next Saturday is there is no additional input from anyone.
Comment From: AlexKirko
Since there are no comments, I'm closing the issue.
Comment From: wscdrpj
Since there are no comments, I'm closing the issue. In my situation, I try read a 30GB HDF5 from disk use hdfFile.select like:
with pd.HDFStore('test2.h5') as hdfFile:
nrows = hdfFile.get_storer('ftable').nrows
dataIter = hdfFile.select('ftable')
Before reading, I have 55GB memory left. After several minites, it throws a "MemoryError:Unable to allocate 30 GiB for an array with shape (1500, 5703231) and data type float32". So, I use chunk read like:
with pd.HDFStore('test2.h5') as hdfFile:
nrows=hdfFile.get_storer('ftable').nrows
dataIter=hdfFile.select('ftable',chunksize=100000)
chunks=[data for data in dataIter]
From this code, I can get a 30GB list[pandas.DataFrame], but there's no enough memory to concat it to a dataFrame——just 55-30=25GB left, and if I use df = pd.concat(coll)
, I will get the same "MemoryError:Unable to allocate 30 GiB for an array with shape (1500, 5703231) and data type float32".
I think what you mentioned above wound be helpful, but I don't understand well(e.g. how to cast a list of ndarray directly to a DataFrame), so a code example will be helpful, or other advise for my situation are welcomed!