Pandas DEP: make pyarrow required dependency

Doing this in 2.0 would open up options in 2.x. Off the top of my head:

1) Poking at the cython pyarrow API 2) Transitioning some of our internal EAs to be pyarrow-backed

Comment From: lithomas1

I am -1 on this for now. pyarrow is quite a large dependency, and I would wait until its more integrated before doing this.

Poking at the cython pyarrow API

I think this would be pretty cool. However, it would make build/infra work harder(ABI compatibility is tricky). I'm also not sure how I feel about C++(I guess we already use C++ stdlib in rolling/window Cython code?). Ideally, we should be able to use the Python buffer protocol(Cython memoryviews) to operate on pyarrow.

This also might be something we can do without making pyarrrow mandatory. IIUC, we just need to compile against the minimum version of pyarrow(if it has ABI compat like numpy), but we can make importing the Cython extension module lazy. (So it would be a build dep, but not a runtime dep)

Transitioning some of our internal EAs to be pyarrow-backed

It's probably better to live with two implementations(numpy and arrow) for now.

(Maybe off-topic, but it might be nice to have nullable by default before then, since all of pyarrows dtypes are nullable I think)

Also cc @jorisvandenbossche

Comment From: jorisvandenbossche

Yes, I also don't think that makes sense at this point. For example, I think that would certainly make sense if we would consider using arrow memory by default, but as long as that work is experimental, I think the dependency can also be optional?

(and I am all for using pyarrow more and more in pandas, but I think all that work can happen while pyarrow being optional?)

In addition, I think pyarrow should also first figure out how it can be distributed in separate smaller packages (to counteract the "pyarrow is quite a large dependency") before we take it as a required dependency.

Poking at the cython pyarrow API

There are different aspects here, I think, and that is 1) using arrow data in our cython code (like we interact with numpy arrays) and 2) using actual pyarrow cython APIs (cimport from pyarrow).

The second is quite a complication to packaging (wheels). For example, we don't guarantee stability like numpy, and so typically you have to pin to a specific version (I think this is required, I have never been involved in a package that does this myself). Depending on whether you link with arrow-cpp parts, you might also need to include those libs in the wheel as well (eg the snowflake python connector currently does this).

However, depending on what the goal would be, one doesn't necessarily need the cython APIs. To be able to work with arrow data in cython, there are other options:

Use the python buffer protocol to view those as memory views, and this pass the raw data from python to our cython algorithms (currently it exposes the buffers as generic bytes array, so that means you need to reinterpret it as the correct data type, eg arr = pa.array([1, 2, 3], type="int32"); view = np.array(arr.buffers()[1]).view(np.int32)). Of course, that means we have to take care of handling missing values ourselves (but I think that is generally expected for our cython algorithms, that's the same for our own masked arrays)
Use the Arrow C data interface to access those data at the C / cython level (https://arrow.apache.org/docs/dev/format/CDataInterface.html). There are helpers in development to work with this more easily (https://github.com/apache/arrow-nanoarrow), and this allows you to consume and produce arrow data without depending on the arrow C++ / pyarrow library.

Comment From: jbrockmendel

These seem like good reasons to hold off, particularly the packaging/ABI thing.

Comment From: WillAyd

Couldn't we just use arrow as a build time dependency then do runtime checks if it is installed before calling its Cython functions?

Comment From: jorisvandenbossche

As mentioned above, (py)arrow doesn't guarantee a stable C ABI, so I am not sure if that's feasible without pinning to an exact version (which I don't think we want to do). To be clear, I never tried this, so I am honestly not sure how it exactly would work for the pyarrow cython APIs.

Further, it's not fully clear to me what we would exactly want to do with pyarrow in cython. If it's "just" a matter of accessing the data (the buffers), you can unpack those without requiring pyarrow in cython (either unpacking before using pyarrow, or using the c data interface, see the two bullet points at the end of my message above)

Comment From: WillAyd

Yea understood on the stability, though that's going to be a bit of a chicken or the egg thing too. We can wait for arrow to decide what should be stable, or we can start building against some things knowing the risks and just pinning versions / providing compat as needed.

Definitely agree to your larger point of needing refinement on the things we expect to use, just don't want that research to be dissuaded by some of the other issues listed that I think, while not ideal, also aren't technically impossible

Comment From: jbrockmendel

In case I was unclear in the OP, I said "poking" at cython usage because I have no idea what is actually there and if there is anything we would actually use. I would hope there are cool things available/coming, but have nothing in mind.

Comment From: jbrockmendel

I guess if we do start implementing e.g. ArrowIndexingEngine it would be nice to have tight type declarations

Comment From: jbrockmendel

The main reason I am looking forward to making pyarrow required is that there are a bunch of issues made much easier to solve with pyarrow dtypes. e.g. #22720

Comment From: jbrockmendel

Closing in favor of a more targeted issue to be opened by @phofl