Pandas BUG: Chained urls (allowed by fsspec) not recognized by is_fsspec_url

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from pandas.io.common import is_fsspec_url
import fsspec

# This fsspec trick chains a cache fs with an s3 file system using "::".
# See https://filesystem-spec.readthedocs.io/en/latest/features.html#url-chaining.
fn = "filecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv"

with fsspec.open(fn, storage_options={"s3": {"anon": True}}) as f:
    foo = pd.read_csv(f)

print(foo.shape)

# But this url isn't recognized as a valid fsspec url by pandas...
print(pd.io.common.is_fsspec_url(fn))

# ...and so attempts to use chained url directly fail:
bar = pd.read_csv("filecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/"
                 "SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
                 storage_options={"s3": {"anon": True},
                                  "filecache":{"cache_storage":"/tmp/cache"}})

Issue Description

fsspec allows one to pass "chained" urls (useful, e.g., for caching), for example "filecache::https://example.com/my_file". However, since commit eeff2b0c3b meant to address issue #36271 supplying urls of this form to e.g., pd.read_csv has failed.

Expected Behavior

Pandas should pass the chained url to fsspec. In the example code, pd.Dataframes foo and bar should be identical.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 87cfe4e38bafe7300a6003a1d18bd80f3f77c763 python : 3.9.12.final.0 python-bits : 64 OS : Linux OS-release : 5.10.131-19115-g2e2fb0ed324d Version : #1 SMP PREEMPT Mon Sep 12 18:55:51 PDT 2022 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.0 numpy : 1.23.3 pytz : 2022.4 dateutil : 2.8.2 setuptools : 58.1.0 pip : 22.2.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.5.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : 2022.8.2 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : 2022.8.2 scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Comment From: JMBurley

Can confirm the issue is as stated. The root cause is the pattern match for is_fsspec_url being _RFC_3986_PATTERN = re.compile(r"^[A-Za-z][A-Za-z0-9+\-+.]*://").

Therefore the double-colon :: causes the regex to fail and fsspec is not called.

Easy to adjust that to allow a double colon, however I am not sure exactly what syntax we are corresponding to by allowing a double colon at the start of a (pseudo)URL, or if there is any unintended behaviour that could result...

I believe that double-colons are not part of RFC_3986, which wouldn't stop us making the change, but we should ensure that the regex pattern is logically named to reflect its function.

Comment From: JMBurley

In theory can fix with a PR that:

modifies regex pattern in pandas/io/common.py
updates test test_read_json_with_url_value with a double-colon case
updates test test_is_fsspec_url with a double-colon case

which should get the fsspec functionality AND ensure we don't have a collision with the prior JSON read issue.

If we want to action it, I can take this PR as I recently investigated the fsspec handoff conditions as part of another PR and am therefore pretty familiar (some of) the relevant code.

Comment From: ligon

Thanks for tackling this!

It may not matter for this case, but the double colons do show up in RFC 3986 for the case of elisions in IPv6 addresses. I suppose one would like to be able to handle the case of "filecache::s3://2607:f8b0:4005:811::200e/foo", though I don't know whether fsspec itself does this correctly or not.

On Wed, Oct 19, 2022 at 6:16 AM JMBurley @.***> wrote:

Can confirm the issue is as stated. The root cause is the pattern match for is_fsspec_url being _RFC_3986_PATTERN = re.compile(r"^[A-Za-z][A-Za-z0-9+-+.]*://").

Therefore the double-colon causes the regex to fail and fsspec is not called.

Easy to adjust that to allow a double colon, however I am not sure what exactly syntax we are corresponding to by allowing a double colon at the start of a (pseudo)URL, or if there is any unintended behaviour that could result...

I believe that double-colons are not part of RFC_3986 https://datatracker.ietf.org/doc/html/rfc3986

— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/48978#issuecomment-1283997551, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAET2DA2WXRWLFJYC6Q2JLLWD7YA3ANCNFSM6AAAAAAQ67ZJDI . You are receiving this because you authored the thread.Message ID: @.***>

-- Ethan Ligon, Professor Agricultural & Resource Economics University of California, Berkeley

Comment From: martindurant

I am coming very late to this, and believe that chained fsspec URLs would be very handy to permit. The specific issue, why the URL pattern was tightened up, was to do with URLs embedded in JSON strings passed to from_json - so always after both " and {-or-[.

If fsspec makes a canonical "this is what our URLs should look like" function or regex, could it be used here?

Also, http(s) continues to be excluded from fsspec handling, but with compression="infer", we do guess compressors for gzip etc. It could simplify pandas' code, but that stuff is very stable, so this is hardly urgent.

Comment From: ynouri

Not sure if there's a cleaner workaround, but here's a quick hack I used to load a parquet file using filecache::

import pandas as pd

# Patch Pandas bug https://github.com/pandas-dev/pandas/issues/48978
pd.io.common.is_fsspec_url = lambda x: True

df = pd.read_parquet("filecache::s3://my-bucket/my_file.parquet")

(This works with pandas==1.5.0)

Comment From: schmidt-ai

I'm also hitting this. Another workaround could be:

import pandas as pd
import fsspec

with fsspec.open("filecache::s3://my-bucket/my_file.parquet") as f:
    df = pd.read_parquet(f)