Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this issue exists on the latest version of pandas.

  • [ ] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import time
import pandas as pd
import os
import psutil
import tqdm

columns = {
    'a': 'category',
    'b': 'category',
    'c': 'category',
    'd': 'category',
    'e': 'float64'
}

df = pd.DataFrame()
total_size = 198_748_533
chunk_size = 1_000_000
chunks = []

with pd.read_csv('test.csv', engine='c', low_memory=True, dtype=columns, chunksize=chunk_size) as reader:
    iterator = tqdm.tqdm(reader, total=int(total_size / chunk_size + 1))
    for chunk in iterator:
        chunks.append(chunk)
df = pd.concat(chunks, copy=False, ignore_index=True)

I have large 14G csv-file and 16G RAM. So I read it by a chunks and concat them after. Reading all the chunks consumes ~6.6G RSS

Pandas pd.concat high memory consumption after read_csv

And when it is starting concating memory doubles or triples. Pandas pd.concat high memory consumption after read_csv

Expected behaviour

Every dataframe passed to pd.concat requires memory doubling, but not the whole dataframe.

Installed Versions

INSTALLED VERSIONS ------------------ commit : ca60aab7340d9989d9428e11a51467658190bb6b python : 3.9.2.final.0 python-bits : 64 OS : Linux OS-release : 5.10.0-13-amd64 Version : #1 SMP Debian 5.10.106-1 (2022-03-17) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8 pandas : 1.4.4 numpy : 1.22.4 pytz : 2022.2.1 dateutil : 2.8.2 setuptools : 52.0.0 pip : 20.3.4 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.9.0 snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None

Prior Performance

No response

Comment From: AlexKirko

Interesting. Seems to me like this is intended behavior and the only way to avoid it is to write a particular shortcut for when we pass a bunch of DataFrame objects which align perfectly and nothing needs to be modified before concatenation. Not sure it's worth it though, because you can reduce the memory usage, off the top of my head, by reading a chunk, stripping columns, then putting all the DataFrame.values into a list of ndarray and then casting that directly to a DataFrame.

This is what's clear from the code: After you run the loop, you end up with a list[pandas.DataFrame] plus the local variables of pd.read_csv that likely haven't been garbage collected yet. Then you run pd.concat, which copies the list of DataFrames to work with them locally:

        if keys is None:
            objs = list(com.not_none(*objs))
        else:
            # #1649
            clean_keys = []
            clean_objs = []
            for k, v in zip(keys, objs):
                if v is None:
                    continue
                clean_keys.append(k)
                clean_objs.append(v)
            objs = clean_objs

Looks to me like this copying is done, because generally pd.concat needs to modify the objects before joining. Consider the case when the objects don't line up exactly, for example. Then the resulting DataFrame is returned. So yes, in the process of concatenation, before the local variables are cleaned up, we get ~3x the memory usage. If you set copy=True, then you get additional value copying instead of viewing.

This behavior doesn't look unintended to me, so I'm marking the issue as a performance discussion.

Comment From: AlexKirko

Unless I misunderstood something, my opinion is that adding a shortcut that avoids copying the objects completely isn't worth it. It would be a kludge, and when someone runs into a memory error in such cases, it can be circumvented via a couple of options: * don't concat at all, just iterate over chunks directly * concat ndarray data instead while overwriting each chunk, then cast everything to pd.DataFrame

Comment From: AlexKirko

I'll close this next Saturday is there is no additional input from anyone.

Comment From: AlexKirko

Since there are no comments, I'm closing the issue.

Comment From: wscdrpj

Since there are no comments, I'm closing the issue. In my situation, I try read a 30GB HDF5 from disk use hdfFile.select like:

with pd.HDFStore('test2.h5') as hdfFile:
    nrows = hdfFile.get_storer('ftable').nrows
    dataIter = hdfFile.select('ftable')

Before reading, I have 55GB memory left. After several minites, it throws a "MemoryError:Unable to allocate 30 GiB for an array with shape (1500, 5703231) and data type float32". So, I use chunk read like:

with pd.HDFStore('test2.h5') as hdfFile:
    nrows=hdfFile.get_storer('ftable').nrows
    dataIter=hdfFile.select('ftable',chunksize=100000)
    chunks=[data for data in dataIter]

From this code, I can get a 30GB list[pandas.DataFrame], but there's no enough memory to concat it to a dataFrame——just 55-30=25GB left, and if I use df = pd.concat(coll), I will get the same "MemoryError:Unable to allocate 30 GiB for an array with shape (1500, 5703231) and data type float32".
I think what you mentioned above wound be helpful, but I don't understand well(e.g. how to cast a list of ndarray directly to a DataFrame), so a code example will be helpful, or other advise for my situation are welcomed!