Pandas PERF: Serialized string[pyarrow] data larger than object

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[X] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pickle
import pandas as pd

data = ["a"] * 1_000_000

# object dtype
s_object = pd.Series(data, dtype=object)
print(f"{len(pickle.dumps(s_object)) / 1e6 = }")  # serialized size
print(f"{s_object.memory_usage(deep=True) / 1e6 = }\n")  # in-memory size

# `string[pyarrow]` dtype
s_pyarrow = pd.Series(data, dtype="string[pyarrow]")
print(f"{len(pickle.dumps(s_pyarrow)) / 1e6 = }")  # larger than object dtype (unexpected) 
print(f"{s_pyarrow.memory_usage(deep=True) / 1e6 = }")  # smaller than object dtype (expected)

outputs

len(pickle.dumps(s_object)) / 1e6 = 2.002884
s_object.memory_usage(deep=True) / 1e6 = 58.000128

len(pickle.dumps(s_pyarrow)) / 1e6 = 5.000804
s_pyarrow.memory_usage(deep=True) / 1e6 = 5.000128

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 7cb7592523380133f552e258f272a5694e37957a
python           : 3.10.8.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 22.2.0
Version          : Darwin Kernel Version 22.2.0: Fri Nov 11 02:08:47 PST 2022; root:xnu-8792.61.2~4/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.0.dev0+1147.g7cb7592523
numpy            : 1.24.1
pytz             : 2022.7.1
dateutil         : 2.8.2
setuptools       : 66.0.0
pip              : 22.3.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.8.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : None
brotli           :
fastparquet      : None
fsspec           : 2022.11.0
gcsfs            : None
matplotlib       : 3.6.3
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 10.0.1
pyreadstat       : None
pyxlsb           : None
s3fs             : 2022.11.0
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : None
qtpy             : None
pyqt5            : None

Prior Performance

While investigating performance of pyarrow-backed dtypes for use in dask we've seen some promising results for handling text data. Specifically, we see performance gain related to both memory usage (as reported by memory_usage(deep=True)) and string computation time (e.g. Series.str.lower()).

However, it appears that when pandas objects with string[pyarrow] data are serialized with pickle, the result is actually larger than is string[python] or object dtypes were used. Natively I would expect string[pyarrow] to be more efficient than object in this case -- similar to the in-memory representation.

cc @mroeschke @phofl @jorisvandenbossche for visibility

Comment From: mroeschke

Just noting that when picking the backing numpy / pyarrow arrays the results are similar

In [4]: len(pickle.dumps(np.array(data, dtype=object))) / 1e6
Out[4]: 2.002421

In [5]: len(pickle.dumps(pa.array(data, type=pa.string()))) / 1e6
Out[5]: 5.000198

Comment From: jbrockmendel

Does that suggest that this is an upstream issue?

Comment From: jorisvandenbossche

I suppose that pickle is quite optimized for built-in python types. And firstly, your example is creating a list of one million identical (in the python sense, i.e. same object) strings. I can imagine that Python can pickle this efficiently without actually serializing the string data one million times (but serializing it once, and then somehow keeping track that it is repeated).

If we create some random strings instead, the difference becomes smaller:

import pickle
import random
import string
import pandas as pd

# only this line is different
data = ["".join(random.choices(string.ascii_lowercase, k=5)) for _ in range0)]

# object dtype
s_object = pd.Series(data, dtype=object)
print(f"{len(pickle.dumps(s_object)) / 1e6 = }")  # serialized size
print(f"{s_object.memory_usage(deep=True) / 1e6 = }\n")  # in-memory size

# `string[pyarrow]` dtype
s_pyarrow = pd.Series(data, dtype="string[pyarrow]")
print(f"{len(pickle.dumps(s_pyarrow)) / 1e6 = }")  # larger than object dtype ed) 
print(f"{s_pyarrow.memory_usage(deep=True) / 1e6 = }")  # smaller than object dtype (expected)

gives

len(pickle.dumps(s_object)) / 1e6 = 8.003719
s_object.memory_usage(deep=True) / 1e6 = 62.000128

len(pickle.dumps(s_pyarrow)) / 1e6 = 9.000814
s_pyarrow.memory_usage(deep=True) / 1e6 = 9.000128

The difference is smaller here, but the python object dtype is still consistently 1MB smaller. Now, I don't know the details about how a python string and list gets serialized, but as mentioned, I can imagine this is quite efficient. Doing a quick experiment, it seems that it needs 3 bytes for every extra string (in addition to the bytes for the actual string), if I compare the pickle size of ["a", "b", "c"] and ["a", "b", "c", "d"]. While pyarrow uses int32 offsets, so every extra string requires an extra 4 bytes offset value. That gives 1 byte of difference per string element, and for 1 million elements, that matches the 1MB difference seen above.

A pyarrow string array is quite "dumb" or simple, in comparison. It's just all separate strings (regardless of repetition), using an offset for each of them. And pickling will just pickle the buffers from which the array is composed off, without any custom optimization. If you want to optimize the storage of that in pyarrow you would use an encoding like dictionary encoding (or the new run end encoding).

Now, the above is for small strings, where the memory of the string data is small compared to the overhead of storing the string (the offset, its length in the case of python). If a string becomes larger, it's the actual string data that becomes relatively more important, and so this difference we have seen above becomes insignificant:

# data = ["".join(random.choices(string.ascii_lowercase, k=10)) for _ in range(100_000)]
len(pickle.dumps(s_object)) / 1e6 = 1.300992
s_object.memory_usage(deep=True) / 1e6 = 6.700128

len(pickle.dumps(s_pyarrow)) / 1e6 = 1.400814
s_pyarrow.memory_usage(deep=True) / 1e6 = 1.400128

# data = ["".join(random.choices(string.ascii_lowercase, k=100)) for _ in range(10_000)]
len(pickle.dumps(s_object)) / 1e6 = 1.030768
s_object.memory_usage(deep=True) / 1e6 = 1.570128

len(pickle.dumps(s_pyarrow)) / 1e6 = 1.040797
s_pyarrow.memory_usage(deep=True) / 1e6 = 1.040128