Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
from pandas.io.common import is_fsspec_url
import fsspec
# This fsspec trick chains a cache fs with an s3 file system using "::".
# See https://filesystem-spec.readthedocs.io/en/latest/features.html#url-chaining.
fn = "filecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv"
with fsspec.open(fn, storage_options={"s3": {"anon": True}}) as f:
foo = pd.read_csv(f)
print(foo.shape)
# But this url isn't recognized as a valid fsspec url by pandas...
print(pd.io.common.is_fsspec_url(fn))
# ...and so attempts to use chained url directly fail:
bar = pd.read_csv("filecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/"
"SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
storage_options={"s3": {"anon": True},
"filecache":{"cache_storage":"/tmp/cache"}})
Issue Description
fsspec allows one to pass "chained" urls (useful, e.g., for caching), for example "filecache::https://example.com/my_file". However, since commit eeff2b0c3b meant to address issue #36271 supplying urls of this form to e.g., pd.read_csv has failed.
Expected Behavior
Pandas should pass the chained url to fsspec. In the example code, pd.Dataframes foo and bar should be identical.
Installed Versions
Comment From: JMBurley
Can confirm the issue is as stated. The root cause is the pattern match for is_fsspec_url
being _RFC_3986_PATTERN = re.compile(r"^[A-Za-z][A-Za-z0-9+\-+.]*://")
.
Therefore the double-colon ::
causes the regex to fail and fsspec is not called.
Easy to adjust that to allow a double colon, however I am not sure exactly what syntax we are corresponding to by allowing a double colon at the start of a (pseudo)URL, or if there is any unintended behaviour that could result...
I believe that double-colons are not part of RFC_3986, which wouldn't stop us making the change, but we should ensure that the regex pattern is logically named to reflect its function.
Comment From: JMBurley
In theory can fix with a PR that:
- modifies regex pattern in
pandas/io/common.py
- updates test
test_read_json_with_url_value
with a double-colon case - updates test
test_is_fsspec_url
with a double-colon case
which should get the fsspec functionality AND ensure we don't have a collision with the prior JSON read issue.
If we want to action it, I can take this PR as I recently investigated the fsspec handoff conditions as part of another PR and am therefore pretty familiar (some of) the relevant code.
Comment From: ligon
Thanks for tackling this!
It may not matter for this case, but the double colons do show up in RFC 3986 for the case of elisions in IPv6 addresses. I suppose one would like to be able to handle the case of "filecache::s3://2607:f8b0:4005:811::200e/foo", though I don't know whether fsspec itself does this correctly or not.
On Wed, Oct 19, 2022 at 6:16 AM JMBurley @.***> wrote:
Can confirm the issue is as stated. The root cause is the pattern match for is_fsspec_url being _RFC_3986_PATTERN = re.compile(r"^[A-Za-z][A-Za-z0-9+-+.]*://").
Therefore the double-colon causes the regex to fail and fsspec is not called.
Easy to adjust that to allow a double colon, however I am not sure what exactly syntax we are corresponding to by allowing a double colon at the start of a (pseudo)URL, or if there is any unintended behaviour that could result...
I believe that double-colons are not part of RFC_3986 https://datatracker.ietf.org/doc/html/rfc3986
— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/48978#issuecomment-1283997551, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAET2DA2WXRWLFJYC6Q2JLLWD7YA3ANCNFSM6AAAAAAQ67ZJDI . You are receiving this because you authored the thread.Message ID: @.***>
-- Ethan Ligon, Professor Agricultural & Resource Economics University of California, Berkeley
Comment From: martindurant
I am coming very late to this, and believe that chained fsspec URLs would be very handy to permit. The specific issue, why the URL pattern was tightened up, was to do with URLs embedded in JSON strings passed to from_json - so always after both " and {-or-[.
If fsspec makes a canonical "this is what our URLs should look like" function or regex, could it be used here?
Also, http(s) continues to be excluded from fsspec handling, but with compression="infer"
, we do guess compressors for gzip etc. It could simplify pandas' code, but that stuff is very stable, so this is hardly urgent.
Comment From: ynouri
Not sure if there's a cleaner workaround, but here's a quick hack I used to load a parquet file using filecache::
import pandas as pd
# Patch Pandas bug https://github.com/pandas-dev/pandas/issues/48978
pd.io.common.is_fsspec_url = lambda x: True
df = pd.read_parquet("filecache::s3://my-bucket/my_file.parquet")
(This works with pandas==1.5.0
)
Comment From: schmidt-ai
I'm also hitting this. Another workaround could be:
import pandas as pd
import fsspec
with fsspec.open("filecache::s3://my-bucket/my_file.parquet") as f:
df = pd.read_parquet(f)