Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
When investigating Dask's shuffle performance on string[pyarrow]
data, I observed pd.util.hash_pandas_object
was slightly less performant on string[pyarrow]
data than on regular object
Python objects. This surprised me as I would have expected hashing pyarrow
-backed data to be faster than Python objects
In [1]: import pandas as pd
In [2]: s = pd.Series(range(2_000))
In [3]: s_object = s.astype(object)
In [4]: s_pyarrow = s.astype("string[pyarrow]")
In [5]: %timeit pd.util.hash_pandas_object(s_object)
859 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [6]: %timeit pd.util.hash_pandas_object(s_pyarrow)
1.01 ms ± 29.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Installed Versions
INSTALLED VERSIONS
------------------
commit : 87cfe4e38bafe7300a6003a1d18bd80f3f77c763
python : 3.10.6.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Wed Aug 10 14:25:27 PDT 2022; root:xnu-8020.141.5~2/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.5.0
numpy : 1.23.3
pytz : 2022.4
dateutil : 2.8.2
setuptools : 65.4.1
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.5.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None
Prior Performance
No response
Comment From: rhshadrach
I'm seeing the same behavior on 1.4.x
503 µs ± 2.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
594 µs ± 496 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Faster on my machine, but same ratio (~1.18).
Comment From: mroeschke
Just noting that string[python]
dtype is comparable to string[pyarrow]
, but pyarrow is still slower. Hashing is still done in terms of numpy arrays so I imagine that is adding overhead
In [6]: s_str = s.astype("string[python]")
In [7]: %timeit pd.util.hash_pandas_object(s_object)
856 µs ± 3.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit pd.util.hash_pandas_object(s_str)
952 µs ± 3.28 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: %timeit pd.util.hash_pandas_object(s_pyarrow)
977 µs ± 6.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
EDIT: One more interesting observation: categorize=True
, the default here, goes through a categorical routine that is supposed to provide a speed up when there are duplicates which there are none in this example
In [8]: %timeit pd.util.hash_pandas_object(s_object, categorize=True)
843 µs ± 3.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: %timeit pd.util.hash_pandas_object(s_object, categorize=False)
986 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [10]: %timeit pd.util.hash_pandas_object(s_pyarrow, categorize=True)
986 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [11]: %timeit pd.util.hash_pandas_object(s_pyarrow, categorize=False)
515 µs ± 6.39 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Comment From: mrocklin
@jorisvandenbossche (or anyone else in arrow) do you know if arrow has nice elementwise hashing operations? I did a quick search of the API docs and couldn't find anything.
Comment From: jorisvandenbossche
If I understand correctly, what would be needed here is https://issues.apache.org/jira/browse/ARROW-8991? (and there is an open PR for it: https://github.com/apache/arrow/pull/13487)
Comment From: drin
Just coming back around to working on that PR. It is is almost ready, I'm just trying to add some extra test coverage. I am not sure it would make it into 10.0.0, but maybe if I can finish today it has a chance.
Comment From: jrbourbeau
Thanks @drin!
Comment From: jbrockmendel
We now have EA._hash_pandas_object so can do an arrow-specific implementation in ArrowExtensionArray
Comment From: drin
I had hit a snag for ARROW-8991 and have been kicking that can down the road for a bit too long. I'm hoping to get back to it soon again.
for clarification, @jbrockmendel , if Arrow implements a way of doing hashing, would the EA._hash_pandas_object delegate to that functionality for any pandas element (in a series, dataframe, etc.) that is an arrow object (element in an Arrow Array, for example)?
Comment From: jbrockmendel
If arrow/pyarrow implement something, we'd update ArrowExtensionArray to use that, which should handle any case where you have a Series/Index or DataFrame column backed by a pyarrow array.