Pandas BUG: .str.contains() regex lookbehind and lookahead fail for data type string[pyarrow]

Pandas version checks

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({"c": ["a f", "c d", "e f g", "hij"]})

print(df["c"].astype(str).str.contains("(?<=\s)f(?=\s)"))
print(df["c"].astype("string[python]").str.contains("(?<=\s)f(?=\s)"))
print(df["c"].astype("string[pyarrow]").str.contains("(?<=\s)f(?=\s)"))

Issue Description

.str.contains() regex lookbehind and lookahead fail for data type string[pyarrow].

Expected Behavior

I believe string[pyarrow] should produce the same results as string[python] and str.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.10.16 python-bits : 64 OS : Linux OS-release : 6.8.0-1020-azure Version : #23-Ubuntu SMP Mon Dec 9 16:58:58 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.3 numpy : 2.2.2 pytz : 2024.2 dateutil : 2.9.0.post0 pip : 24.2 Cython : None sphinx : None IPython : 8.31.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.12.0 html5lib : None hypothesis : None gcsfs : None jinja2 : 3.1.5 lxml.etree : None matplotlib : 3.10.0 numba : None numexpr : None odfpy : None openpyxl : 3.1.5 pandas_gbq : None psycopg2 : None pymysql : None pyarrow : 19.0.0 pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 2.0.1 xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None

Comment From: rhshadrach

Thanks for the report! This also fails when infer_string=True and you have PyArrow, so tagging as 2.3.

cc @jorisvandenbossche @WillAyd

Comment From: WillAyd

The Arrow library does not use the Python standard library for regular expressions, but rather re2. It looks like re2 does not support lookahead and lookbehinds, and doesn't really plan on doing so given the time complexity of it.

See https://github.com/apache/arrow/issues/40220 and the re2 documentation https://github.com/google/re2/wiki/Syntax

Comment From: rhshadrach

Thanks @WillAyd - I think the conclusion is therefore that pandas needs to implement something internally to achieve feature parity, would you agree?

Comment From: WillAyd

This is probably something worth documenting in the conversion guide. Something to the effect of "for X list of regular expressions, users are advised to do ser.astype("object").str.contains(...) to force usage of the Python regular expression engine"

Comment From: rhshadrach

Thanks @WillAyd - I am personally of the opinion that documenting differences in behavior here is not sufficient, but do not feel strongly so.

Comment From: WillAyd

Looks like we replied at the same time :-)

A fallback to the object dtype internally is also an option. The downside to that is it could be a rather expensive operation to go from Arrow -> Python -> Arrow to compute a result. It might be faster to work with object storage in this very specific case, and by forcing end users down that path it helps signal performance implications.

Not a strong feeling either way though. There are merits to both approaches

Comment From: WillAyd

This is also a good conversation for PDEP-13 https://github.com/pandas-dev/pandas/pull/58455

Comment From: rhshadrach

Looks like we replied at the same time :-)

I dunno, it appears to me my comment is above yours 😆

Agreed relying on conversion to object is not great, my personal preference would still be to do this rather than require users to do it on their own. This would be a case where #55385 could be useful.

Comment From: jorisvandenbossche

It should be possible for us to reliably detect if the user is passing a regex using lookbehind and lookahead? (not very familiar here, quick google says that this is with ?! / ?= and ?<! / ?<=, but just looking for those strings might give false positives?)

Comment From: rhshadrach

but just looking for those strings might give false positives?

Yea - it appears so. I haven't looked at this in detail yet, but the 2nd answer makes me think this might be not so reliable.

https://stackoverflow.com/questions/79423223/tell-if-a-python-regex-string-has-a-lookaround

I think we'd be okay using re._parser in the first answer; don't know what performance that has yet.