Feature Type
-
[X] Adding new functionality to pandas
-
[X] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
I wish there were an option to identify the default protocol when the user accesses the file so that the user can provide "
By having the ability to configure the default protocol, the request still reaches the fsspec handler class based on the protocol type and it helps in improving the user experience as he/she just needs to input the relative file path.
Feature Description
If possible add an option to Pandas to identify the default protocol.
eg:
pd.options.protocol.default = "abfss"
pd.read_csv("file.csv") //should be converted to pd.read_csv("abfss://file.csv")
Alternative Solutions
For the relative path or protocol-free URLs, pandas route the request to native python APIs, so one way is to override the Pandas behavior in the fork but having this functionality in Pandas might help a wide range of fsspec URL protocols and in improving the user experience.
Additional Context
No response
Comment From: TomAugspurger
I spoke with @KeerthiYandaOS about this, and overall I think it make sense. In some environments, users / teams will want to change the default behavior for how files are discovered. You can require that users manually prepend the protocol to all paths, but this feature request is to enable users to write code that works with remote file system without having to change the code.
In terms of implementation, I hope it would be relatively simple. We'd need to define the actual configuration option, something like pd.options.io.default_fsspec_protocol
. By default that would be None
, meaning no default protocol, and so pd.read_csv("test.csv")
would be unchanged (use the local filesystem handler). If a user sets pd.options.io.default_fsspec_protocol = "abfss"
pandas would just prefix the user-provided path with pd.options.io.default_fsspec_protocol
somewhere early in https://github.com/pandas-dev/pandas/blob/409673359972653a2fde437cb2a608a66c5753d1/pandas/io/common.py#L289.
Comment From: Rylie-W
I'd like to take it. It might have three subtasks:
- [ ] Define io.default_fsspec_protocol
as done for io.sql.engine
: https://github.com/pandas-dev/pandas/blob/7ab6f8b6a86e36d72b119ac9b4dc3fe4c4cf813f/pandas/core/config_init.py#L611-L623
- [ ] Modify method _get_filepath_or_buffer
to concatenate a prefix like 'abfss' and the file path
- [ ] Write the corresponding tests
Is there anything missing? It could be my first contribution to pandas
and I'd appreciate any advice.
Comment From: TomAugspurger
Thanks for volunteering @Rylie-W. We probably need some additional input from maintainers to ensure that this makes sense for the project. cc @datapythonista (since you've worked on filesystem / IO APIs in the past).
In terms of your subtasks, that sounds about right. We'll also want things like documentation. See the contributing guide for more.
For the tests, the memory
filesystem should be especially straightforward to test.
One complication I failed to consider, what about default storage_options
? With things like abfs
, you typically need to include an account_name=<storage-account-name>
. So simply prepending the protocol won't work. We could perhaps push that complication on the storage backend, by having them define how to configure default options when not specified.