Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this issue exists on the latest version of pandas.

  • [ ] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

When investigating Dask's shuffle performance on string[pyarrow] data, I observed pd.util.hash_pandas_object was slightly less performant on string[pyarrow] data than on regular object Python objects. This surprised me as I would have expected hashing pyarrow-backed data to be faster than Python objects

In [1]: import pandas as pd

In [2]: s = pd.Series(range(2_000))

In [3]: s_object = s.astype(object)

In [4]: s_pyarrow = s.astype("string[pyarrow]")

In [5]: %timeit pd.util.hash_pandas_object(s_object)
859 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [6]: %timeit pd.util.hash_pandas_object(s_pyarrow)
1.01 ms ± 29.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 87cfe4e38bafe7300a6003a1d18bd80f3f77c763
python           : 3.10.6.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.6.0
Version          : Darwin Kernel Version 21.6.0: Wed Aug 10 14:25:27 PDT 2022; root:xnu-8020.141.5~2/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.5.0
numpy            : 1.23.3
pytz             : 2022.4
dateutil         : 2.8.2
setuptools       : 65.4.1
pip              : 22.2.2
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : 8.5.0
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 9.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : None

Prior Performance

No response

Comment From: rhshadrach

I'm seeing the same behavior on 1.4.x

503 µs ± 2.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
594 µs ± 496 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Faster on my machine, but same ratio (~1.18).

Comment From: mroeschke

Just noting that string[python] dtype is comparable to string[pyarrow], but pyarrow is still slower. Hashing is still done in terms of numpy arrays so I imagine that is adding overhead

In [6]: s_str = s.astype("string[python]")

In [7]: %timeit pd.util.hash_pandas_object(s_object)
856 µs ± 3.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [8]: %timeit pd.util.hash_pandas_object(s_str)
952 µs ± 3.28 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [9]: %timeit pd.util.hash_pandas_object(s_pyarrow)
977 µs ± 6.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

EDIT: One more interesting observation: categorize=True, the default here, goes through a categorical routine that is supposed to provide a speed up when there are duplicates which there are none in this example

In [8]: %timeit pd.util.hash_pandas_object(s_object, categorize=True)
843 µs ± 3.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [9]: %timeit pd.util.hash_pandas_object(s_object, categorize=False)
986 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [10]: %timeit pd.util.hash_pandas_object(s_pyarrow, categorize=True)
986 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [11]: %timeit pd.util.hash_pandas_object(s_pyarrow, categorize=False)
515 µs ± 6.39 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Comment From: mrocklin

@jorisvandenbossche (or anyone else in arrow) do you know if arrow has nice elementwise hashing operations? I did a quick search of the API docs and couldn't find anything.

Comment From: jorisvandenbossche

If I understand correctly, what would be needed here is https://issues.apache.org/jira/browse/ARROW-8991? (and there is an open PR for it: https://github.com/apache/arrow/pull/13487)

Comment From: drin

Just coming back around to working on that PR. It is is almost ready, I'm just trying to add some extra test coverage. I am not sure it would make it into 10.0.0, but maybe if I can finish today it has a chance.

Comment From: jrbourbeau

Thanks @drin!

Comment From: jbrockmendel

We now have EA._hash_pandas_object so can do an arrow-specific implementation in ArrowExtensionArray

Comment From: drin

I had hit a snag for ARROW-8991 and have been kicking that can down the road for a bit too long. I'm hoping to get back to it soon again.

for clarification, @jbrockmendel , if Arrow implements a way of doing hashing, would the EA._hash_pandas_object delegate to that functionality for any pandas element (in a series, dataframe, etc.) that is an arrow object (element in an Arrow Array, for example)?

Comment From: jbrockmendel

If arrow/pyarrow implement something, we'd update ArrowExtensionArray to use that, which should handle any case where you have a Series/Index or DataFrame column backed by a pyarrow array.