Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas
pandas.read_csv("", sep="\s+", engine="pyarrow")
Issue Description
This fails with the following error:
ValueError: the 'pyarrow' engine does not support regex separators
(separators > 1 char and different from '\s+' are interpreted as regex)
Expected Behavior
I'm not sure if pyarrow
is meant to support \s+
. If pyarrow supports it, then this should not fail. If pyarrow
does not support it, then I believe the error should be modified to reflect this, since it now seems to imply that \s+
is not interpreted as a regex, so pyarrow should support it.
Update: I looked in the main branch and it seems that pyarrow does not to support \s+, so changing the error message should be enough.
Installed Versions
INSTALLED VERSIONS
------------------
commit : 478d340667831908b5b4bf09a2787a11a14560c9
python : 3.11.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-69-generic
Version : #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.0.0
numpy : 1.24.2
pytz : 2023.2
dateutil : 2.8.2
setuptools : 67.6.0
pip : 23.0.1
Cython : None
pytest : 7.2.2
hypothesis : None
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli :
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
Comment From: lithomas1
Thanks for the report. This is probably due to erroneously sharing that line of code with the C parser.
If you're interested in making a PR, We can probably add a check for pyarrow inside the elif block here. https://github.com/pandas-dev/pandas/blob/e9e034badbd2e36379958b93b611bf666c7ab360/pandas/io/parsers/readers.py#L1514
Otherwise, I'll take care of this in hopefully a week or so.