Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[X] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
I'm trying to measure Pandas dataframe memory usage with string[pyarrow]
vs object
. While .memory_usage(deep=True)
shows string[pyarrow]
occupying less memory than object
, the results are very different when measuring memory usage with psutil
. Repro script:
# proc_mem.py
import psutil
import os
import random
import string
import pandas as pd
import gc
import sys
def format_mb(n):
return f"{n / 1024 ** 2:,.2f} MiB"
def random_string():
return "".join(random.choices(string.ascii_letters + string.digits + " ", k=random.randint(10, 100)))
def random_strings(n, n_unique=None):
if n_unique is None:
n_unique = n
if n == n_unique:
return (random_string() for _ in range(n_unique))
choices = [random_string() for _ in range(n_unique)]
return (random.choice(choices) for _ in range(n))
if __name__ == "__main__":
if len(sys.argv) == 4:
string_dtype = sys.argv[1]
N = int(sys.argv[2])
N_UNIQUE = int(sys.argv[3])
else:
print(f"Usage: {sys.argv[0]} [STRING_DTYPE] [N] [N_UNIQUE]")
exit(1)
process = psutil.Process(os.getpid())
mem1 = process.memory_info().rss
print(f"{string_dtype = }, {N = :,}, {N_UNIQUE = :,}")
print(f"before: {format_mb(mem1)}")
s = pd.Series(random_strings(N, N_UNIQUE), dtype=string_dtype, copy=True)
mem_s = s.memory_usage(deep=True)
print(f"pandas reported: {format_mb(mem_s)}")
mem2 = process.memory_info().rss
print(f"psutil reported: {format_mb(mem2 - mem1)}")
del s
gc.collect()
mem3 = process.memory_info().rss
print(f"released: {format_mb(mem2 - mem3)}")
This produces the following output:
> python proc_mem.py 'object' 1_000_000 1_000_000
string_dtype = 'object', N = 1,000,000, N_UNIQUE = 1,000,000
before: 82.80 MiB
pandas reported: 106.84 MiB
psutil reported: 129.78 MiB
released: 104.98 MiB
> python proc_mem.py 'string[pyarrow]' 1_000_000 1_000_000
string_dtype = 'string[pyarrow]', N = 1,000,000, N_UNIQUE = 1,000,000
before: 82.39 MiB
pandas reported: 56.25 MiB
psutil reported: 117.33 MiB
released: 52.39 MiB
pandas
tells me that string[pyarrow]
Series consumes 56Mb of memory, while psutil
reports increase of 117Mb.
With object
dtype, pandas
and psutil
memory usage is very close. But if not all strings in my dataset are unique, the picture is different:
> python proc_mem.py 'object' 1_000_000 100_000
string_dtype = 'object', N = 1,000,000, N_UNIQUE = 100,000
before: 82.81 MiB
pandas reported: 106.81 MiB
psutil reported: 34.14 MiB
released: 7.98 MiB
> python proc_mem.py 'string[pyarrow]' 1_000_000 100_000
string_dtype = 'string[pyarrow]', N = 1,000,000, N_UNIQUE = 100,000
before: 83.05 MiB
pandas reported: 56.38 MiB
psutil reported: 119.30 MiB
released: 52.53 MiB
Nothing changed with string[pyarrow]
. pandas
still reports 2x less memory usage than psutil
.
With object
, pandas
now tells me that memory usage is 106Mb, while psutil
sees only a 34Mb increase, and only 7Mb is released after gc.collect()
.
What is going on here, and how can I explain these results?
I saw these as well, they are possibly related:
- https://github.com/pandas-dev/pandas/issues/29411
- https://github.com/pandas-dev/pandas/issues/50547
Installed Versions
Prior Performance
No response
Comment From: j-bennet
cc @jrbourbeau
Comment From: rhshadrach
Is part of this due to Python's string caching? https://stackoverflow.com/a/16757434
Comment From: j-bennet
Is part of this due to Python's string caching? https://stackoverflow.com/a/16757434
Pretty sure this is the case for the not-all-unique dataset, yes. But then, shouldn't pandas
know that the data doesn't really occupy 106Mb?
Comment From: j-bennet
I see more than one thing being weird in this experiment:
- if
arrow
strings should only occupy 56Mb according topandas
, why do I see a 117Mb bump in process memory (~2x) - with only 1/10 strings being unique, assuming cpython string caching is a factor, why does Pandas still estimate memory usage as 106Mb.
Basically, from this experiment, it looks like I don't save any memory by using arrow strings, and if my data is not completely unique, objects would actually fare better, because of internal Python optimizations. Is that the conclusion I should make?
Comment From: rhshadrach
But then, shouldn't
pandas
know that the data doesn't really occupy 106Mb?
I don't see why/how pandas would know this. Currently we just take the size for each value.