Feature Type

  • [X] Adding new functionality to pandas

  • [ ] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

I'm using df.str.split(....) in pandas currently. Tries to convert my df to pyarrow dtype, but this functionality is missing. I get this error: NotImplementedError: str.split not supported with pd.ArrowDtype(pa.string()).

Feature Description

Add the str.split function to pyarrow dtype.

Alternative Solutions

The alternative solution would be to use .apply which is not what I want to do, or stick to the classic numpy dtype.

Additional Context

No response

Comment From: pstorozenko

I wanted to run some benchmarks and I landed on the same issue.

I see there's a bunch of functions that are not yet implemented for pd.ArrowDtype(pa.string()) strings, although they seem to be implementable by using pyarrow.compute functions, at least some of them.

Is this something that has been left for a reason or it just requires someone to do the coding?

Comment From: jgarba

Take

Comment From: mroeschke

Thanks for the report. The str methods that were not implemented generally felt into 2 groups

  1. Do not have an efficient, equivalent pyarrow compute function
  2. Have tricky return types that require some gymnastics to integrate cleanly e.g. split should ideally return pa.list(pa.string() but the internals make that tricky.

Most definitely we want these implemented eventually hence NotImplmentedError