Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this issue exists on the latest version of pandas.

  • [X] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I'm trying to measure Pandas dataframe memory usage with string[pyarrow] vs object. While .memory_usage(deep=True) shows string[pyarrow] occupying less memory than object, the results are very different when measuring memory usage with psutil. Repro script:

# proc_mem.py
import psutil
import os
import random
import string
import pandas as pd
import gc
import sys

def format_mb(n):
    return f"{n / 1024 ** 2:,.2f} MiB"

def random_string():
    return "".join(random.choices(string.ascii_letters + string.digits + " ", k=random.randint(10, 100)))

def random_strings(n, n_unique=None):
    if n_unique is None:
        n_unique = n
    if n == n_unique:
        return (random_string() for _ in range(n_unique))
    choices = [random_string() for _ in range(n_unique)]
    return (random.choice(choices) for _ in range(n))

if __name__ == "__main__":
    if len(sys.argv) == 4:
        string_dtype = sys.argv[1]
        N = int(sys.argv[2])
        N_UNIQUE = int(sys.argv[3])
    else:
        print(f"Usage: {sys.argv[0]} [STRING_DTYPE] [N] [N_UNIQUE]")
        exit(1)
    process = psutil.Process(os.getpid())
    mem1 = process.memory_info().rss
    print(f"{string_dtype = }, {N = :,}, {N_UNIQUE = :,}")
    print(f"before: {format_mb(mem1)}")
    s = pd.Series(random_strings(N, N_UNIQUE), dtype=string_dtype, copy=True)
    mem_s = s.memory_usage(deep=True)
    print(f"pandas reported: {format_mb(mem_s)}")
    mem2 = process.memory_info().rss
    print(f"psutil reported: {format_mb(mem2 - mem1)}")
    del s
    gc.collect()
    mem3 = process.memory_info().rss
    print(f"released: {format_mb(mem2 - mem3)}")

This produces the following output:

> python proc_mem.py 'object' 1_000_000 1_000_000
string_dtype = 'object', N = 1,000,000, N_UNIQUE = 1,000,000
before: 82.80 MiB
pandas reported: 106.84 MiB
psutil reported: 129.78 MiB
released: 104.98 MiB

> python proc_mem.py 'string[pyarrow]' 1_000_000 1_000_000
string_dtype = 'string[pyarrow]', N = 1,000,000, N_UNIQUE = 1,000,000
before: 82.39 MiB
pandas reported: 56.25 MiB
psutil reported: 117.33 MiB
released: 52.39 MiB

pandas tells me that string[pyarrow] Series consumes 56Mb of memory, while psutil reports increase of 117Mb.

With object dtype, pandas and psutil memory usage is very close. But if not all strings in my dataset are unique, the picture is different:

> python proc_mem.py 'object' 1_000_000 100_000
string_dtype = 'object', N = 1,000,000, N_UNIQUE = 100,000
before: 82.81 MiB
pandas reported: 106.81 MiB
psutil reported: 34.14 MiB
released: 7.98 MiB

> python proc_mem.py 'string[pyarrow]' 1_000_000 100_000
string_dtype = 'string[pyarrow]', N = 1,000,000, N_UNIQUE = 100,000
before: 83.05 MiB
pandas reported: 56.38 MiB
psutil reported: 119.30 MiB
released: 52.53 MiB

Nothing changed with string[pyarrow]. pandas still reports 2x less memory usage than psutil.

With object, pandas now tells me that memory usage is 106Mb, while psutil sees only a 34Mb increase, and only 7Mb is released after gc.collect().

What is going on here, and how can I explain these results?

I saw these as well, they are possibly related:

  • https://github.com/pandas-dev/pandas/issues/29411
  • https://github.com/pandas-dev/pandas/issues/50547

Installed Versions

INSTALLED VERSIONS ------------------ commit : c2a7f1ae753737e589617ebaaff673070036d653 python : 3.11.0.final.0 python-bits : 64 OS : Darwin OS-release : 22.4.0 Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:58 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6020 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.0.0rc1 numpy : 1.24.2 pytz : 2023.2 dateutil : 2.8.2 setuptools : 67.6.0 pip : 23.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.11.0 pandas_datareader: None bs4 : 4.12.0 bottleneck : None brotli : fastparquet : None fsspec : 2023.3.0 gcsfs : None matplotlib : 3.7.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : 2023.3.0 scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : None qtpy : None pyqt5 : None

Prior Performance

No response

Comment From: j-bennet

cc @jrbourbeau

Comment From: rhshadrach

Is part of this due to Python's string caching? https://stackoverflow.com/a/16757434

Comment From: j-bennet

Is part of this due to Python's string caching? https://stackoverflow.com/a/16757434

Pretty sure this is the case for the not-all-unique dataset, yes. But then, shouldn't pandas know that the data doesn't really occupy 106Mb?

Comment From: j-bennet

I see more than one thing being weird in this experiment:

  1. if arrow strings should only occupy 56Mb according to pandas, why do I see a 117Mb bump in process memory (~2x)
  2. with only 1/10 strings being unique, assuming cpython string caching is a factor, why does Pandas still estimate memory usage as 106Mb.

Basically, from this experiment, it looks like I don't save any memory by using arrow strings, and if my data is not completely unique, objects would actually fare better, because of internal Python optimizations. Is that the conclusion I should make?

Comment From: rhshadrach

But then, shouldn't pandas know that the data doesn't really occupy 106Mb?

I don't see why/how pandas would know this. Currently we just take the size for each value.