Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas
test_series = pandas.Series(['asdf', 'as'], dtype='string[pyarrow]')
regex = r'((as)|(as))'
regex2 = r'(as)|(as)'
test_series.str.fullmatch(regex)
# False
# True
test_series.str.fullmatch(regex2)
# True
# True
test_series2 = pandas.Series(['asdf', 'as'], dtype=str)
test_series2.str.fullmatch(regex)
# False
# True
test_series2.str.fullmatch(regex2)
# False
# True
Issue Description
As the example shows you can use the same regular expression for the str.fullmatch method when using a str dtype and string[pyarrow] dtype and get different results. This seems to stem from Apache Arrow not having a dedicated fullmatch or match, so the regular expression has to be edited with "^" and "$" characters before being delivered to its search function. There might also be some special handling in Python's fullmatch method with the "|" operator. Long story short, at least some regular expressions delivered to PyArrow need additional surrounding parentheses to get the same fullmatch results as when using Python's fullmatch.
I have submitted #61073 to try and address this.
Expected Behavior
The second set of fullmatch results in the example shows the expected behavior. The str.fullmatch method should behave the same for either dtype.
Installed Versions
Comment From: Pedro-Santos04
take
Comment From: ak47me
take