Pandas PERF: parallelize libjoin calls

import numpy as np
import pandas as pd
import pandas._libs.join as libjoin

left = np.random.randint(0, 10**8, size=10**8)
right = np.random.randint(0, 10**8, size=10**8)

left.sort()
right.sort()

# The current usage
In [28]: %timeit result = libjoin.inner_join_indexer(left, right)
2.89 s ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# An alternative that _may_ be parallelizable
lloc = len(left) // 2
rloc = right.searchsorted(left[lloc])  # 734 ns ± 16.6 ns

chunk1 = libjoin.inner_join_indexer(left[:lloc], right[:rloc])
chunk2 = libjoin.inner_join_indexer(left[lloc:], right[rloc:])
result = tuple([np.r_[chunk1[i], chunk2[i]] for i in range(3)])
# totals 3.9 s ± 209 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The chunk1/chunk2 calls are each 1.47 s ± 40 ms and the concat is 1.05 s ± 179 ms. In a pyarrow or otherwise-chunked context the concat might not be needed.

For the non-object-dtype case, I'm pretty sure we could release the GIL in inner_join_indexer. Could we then do the chunk1/chunk2 calls in parallel? If so, could we split it further? cc @WillAyd?

Comment From: WillAyd

Do we know that our current GIL-releasing usage actually improves performance? AFAIK releasing the GIL allows for the use of multiple threads, but I'm not sure that helps most of our calculations which are single-threaded and CPU-bound?

Comment From: jbrockmendel

ive wondered this myself

Comment From: WillAyd

WIthout opening a can of worms, if you wanted concurrency here I think you could map the two chunks to different proceses and communicate the result back to the parent process. Of course makes the code more complicated and there's a break even point where it may not be worth it, but that is my best guess on how to tackle

Comment From: TomAugspurger

https://github.com/jcrist/ptime is handy for checking if stuff is being hampered by the GIL.

but I'm not sure that helps most of our calculations which are single-threaded and CPU-bound?

I'm not aware of anywhere in pandas that's currently multi-threaded, but I haven't followed things closely in a bit. GIL-releasing is crucial for libraries that use pandas in parallel, like Dask.

if you wanted concurrency here I think you could map the two chunks to different proceses and communicate the result back to the parent process.

Just to confirm: we're probably talking about threads here, not OS-level processes.

Is there a larger discussion around parallelizing things within pandas itself, rather than deferring to libraries like Dask? It is indeed a can of worms.

Comment From: WillAyd

Nice thanks for the insight @TomAugspurger . Yea sorry meant to say parallelism not concurrency

Comment From: jbrockmendel

Is there a larger discussion around parallelizing things within pandas itself, rather than deferring to libraries like Dask? It is indeed a can of worms.

Not really. I've been looking into what we can add to EAs to make it more appealing for dask/modin/cudf/etc to implement EAs. That's what got me looking at this chunk of code.

Comment From: WillAyd

Interesting conversation though - so do we think the current GIL-releasing stuff we have in pandas definitely helps Dask downstream? Is that tested / benchmarked there?

There are definitely some cases where we have a strange mix of Tempita / fused types that I think came back to a lack of support for conditional GIL releasing for object types. If someone took it up as a passion project we might be able to revisit that and simplify a good deal

Comment From: jbrockmendel

IIRC we got rid of most of the tempita. Most of what's left were cases where there were two separate dtyped-arguments that weren't necessarily the same dtype e.g. take_1d_{{in_dtype}}_{{out_dtype}}. If there's a nice way to move that to fused types I'd be all for it.

The conditional GIL i think is mostly already using fused types, will have a wave of additional cleanup once cython3 is released.

Comment From: lithomas1

xref https://github.com/pandas-dev/pandas/issues/43313 for I/O.

We should really start thinking of a keyword to control parallelism in pandas. I believe Arrow operations are parallel by default.

For cython, I wouldn't mind if we used Cython's OpenMP capabilities which seems like just using prange vs range (would make my life a lot harder? with packaging pandas). Personally, I think it's about time we added some sort of parallel capability to pandas (not the fancy out-of-core stuff etc. that dask does though).

Comment From: jbrockmendel

Personally, I think it's about time we added some sort of parallel capability to pandas (not the fancy out-of-core stuff etc. that dask does though).

+1. Aside from prange you mentioned, I don't have a clear picture of what is available.

Comment From: jreback

there is an effort to standardize the parallel user facing apis across numpy, scikit-learn and pandas (this is nasa funded ; @MarcoGorelli can hopefully point to the docs that exist)

so certainly experiment - but we should tread carefully here

generally +1 in internal parallelism where we can and that is appropriate

Comment From: jbrockmendel

would make my life a lot harder? with packaging pandas

@lithomas1 here's your first request for help on this topic. trying to add prange to libalgos.nancorr, added "-fopenmp" to extra_compile_args and extra_link_args in setup.py, getting clang: error: unsupported option '-fopenmp' at build time on Mac. Google suggested doing brew install llvm libomp but all i managed to do was mess up my dev environment. Suggestions?

Comment From: lithomas1

Install the conda-forge compilers package. Does that help?

I don't think conda-forge/brew play nicely with each other.

Comment From: WillAyd

@jbrockmendel macOS confusingly aliases gcc to clang, so you may think you are running the former when you are actually running the latter. It looks like clang might have a different command line argument to make that work

https://stackoverflow.com/a/33400688/621736

Comment From: jbrockmendel

Is there a chance that with numba this might Just Work?

Comment From: MarcoGorelli

there is an effort to standardize the parallel user facing apis across numpy, scikit-learn and pandas (this is nasa funded ; @MarcoGorelli can hopefully point to the docs that exist)

yup, here you go: https://thomasjpfan.github.io/parallelism-python-libraries-design/