Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame({"c": ["a f", "c d", "e f g", "hij"]})
print(df["c"].astype(str).str.contains("(?<=\s)f(?=\s)"))
print(df["c"].astype("string[python]").str.contains("(?<=\s)f(?=\s)"))
print(df["c"].astype("string[pyarrow]").str.contains("(?<=\s)f(?=\s)"))
Issue Description
.str.contains()
regex lookbehind and lookahead fail for data type string[pyarrow]
.
Expected Behavior
I believe string[pyarrow]
should produce the same results as string[python]
and str
.
Installed Versions
Comment From: rhshadrach
Thanks for the report! This also fails when infer_string=True
and you have PyArrow, so tagging as 2.3.
cc @jorisvandenbossche @WillAyd
Comment From: WillAyd
The Arrow library does not use the Python standard library for regular expressions, but rather re2. It looks like re2 does not support lookahead and lookbehinds, and doesn't really plan on doing so given the time complexity of it.
See https://github.com/apache/arrow/issues/40220 and the re2 documentation https://github.com/google/re2/wiki/Syntax
Comment From: rhshadrach
Thanks @WillAyd - I think the conclusion is therefore that pandas needs to implement something internally to achieve feature parity, would you agree?
Comment From: WillAyd
This is probably something worth documenting in the conversion guide. Something to the effect of "for X list of regular expressions, users are advised to do ser.astype("object").str.contains(...)
to force usage of the Python regular expression engine"
Comment From: rhshadrach
Thanks @WillAyd - I am personally of the opinion that documenting differences in behavior here is not sufficient, but do not feel strongly so.
Comment From: WillAyd
Looks like we replied at the same time :-)
A fallback to the object dtype internally is also an option. The downside to that is it could be a rather expensive operation to go from Arrow -> Python -> Arrow to compute a result. It might be faster to work with object storage in this very specific case, and by forcing end users down that path it helps signal performance implications.
Not a strong feeling either way though. There are merits to both approaches
Comment From: WillAyd
This is also a good conversation for PDEP-13 https://github.com/pandas-dev/pandas/pull/58455
Comment From: rhshadrach
Looks like we replied at the same time :-)
I dunno, it appears to me my comment is above yours 😆
Agreed relying on conversion to object is not great, my personal preference would still be to do this rather than require users to do it on their own. This would be a case where #55385 could be useful.
Comment From: jorisvandenbossche
It should be possible for us to reliably detect if the user is passing a regex using lookbehind and lookahead? (not very familiar here, quick google says that this is with ?!
/ ?=
and ?<!
/ ?<=
, but just looking for those strings might give false positives?)
Comment From: rhshadrach
but just looking for those strings might give false positives?
Yea - it appears so. I haven't looked at this in detail yet, but the 2nd answer makes me think this might be not so reliable.
https://stackoverflow.com/questions/79423223/tell-if-a-python-regex-string-has-a-lookaround
I think we'd be okay using re._parser
in the first answer; don't know what performance that has yet.