Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
>>> a = pd.Series(["abc"], dtype="string[pyarrow]")
>>> a.str.contains("B").dtype
BooleanDtype
Issue Description
An ea type is returned from an arrow series.
Expected Behavior
I expect an arrow type to be returned. There is a warning present:
But I thought that once data is in arrow it should stay in arrow. So will this eventually use arrow code or will it leverage the python re module?
Installed Versions
Comment From: mroeschke
Thanks for the report.
"string[pyarrow]"
which corresponds with StringDtype("pyarrow")
still returns masked numpy dtypes like BooleanDtype
for all methods including contains
.
If you use pd.ArrowDtype(pa.string())
these will return Arrow bool type
In [6]: a = pd.Series(["abc"], dtype=pd.ArrowDtype(pa.string()))
In [7]: a.str.contains("B")
Out[7]:
0 False
dtype: bool[pyarrow]
Comment From: rohanjain101
@mroeschke What is the difference between "string[pyarrow]" and pd.ArrowDtype(pa.string()? I thought it was 2 equivalent ways of specifying an arrow datatype. Also, just wanted to also mention that regex flags with arrow backend are not implemented yet:
>>> a = pd.Series(["abc"], dtype=pd.ArrowDtype(pa.string()))
>>> a.str.contains("B", flags=re.IGNORECASE)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\ps_0310_is\lib\site-packages\pandas\core\strings\accessor.py", line 128, in wrapper
return func(self, *args, **kwargs)
File "C:\ps_0310_is\lib\site-packages\pandas\core\strings\accessor.py", line 1249, in contains
result = self._data.array._str_contains(pat, case, flags, na, regex)
File "C:\ps_0310_is\lib\site-packages\pandas\core\arrays\arrow\array.py", line 1541, in _str_contains
raise NotImplementedError(f"contains not implemented with {flags=}")
NotImplementedError: contains not implemented with flags=re.IGNORECASE
>>>
Since my understanding is that arrow uses a different regex engine than the python re module, will arrow datatype regex support the same regex flags that work with the string pandas ea type?
Comment From: mroeschke
pd.ArrowDtype(pa.string())
, which was first supported in 1.5, does not have a string alias since the first implementation, pd.StringDtype("pyarrow")
, claimed the "string[pyarrow]"
alias in 1.2.
pd.StringDtype("pyarrow")
has different method implementations than pd.ArrowDtype(pa.string())
, The former generally will fall back to numpy-based operations if the arrow operations fails/is not supported while the later will just error.
will arrow datatype regex support the same regex flags that work with the string pandas ea type
This will first depend whether pyarrow will support this functionality in the future. If not, then there will need to be custom implementation in pandas.
Comment From: phofl
Should we close this @mroeschke? I guess there is nothing to do now
Comment From: mroeschke
Yeah I'm not sure if ArrowStringArray
methods should change it's return dtype at this point, especially if start encouraging users to use ArrowExtensionArray
. Since we improved the documenting I think we can close