Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# fsspec==2022.8.2
df = pd.read_parquet("hdfs:///path/to/myfile.parquet") #works
# fsspec==2022.11.0
df = pd.read_parquet("hdfs:///path/to/myfile.parquet") #errors
# OSError: only valid on seekable files
Issue Description
fsspec has changed the backend for hdfs to use the new filesystem in pyarrow in 2022.10.0. This seems to break compatibility with pandas as this apparently gives back a non seekable file now which pandas expects.
One solution could be to have pandas require fsspec<=2022.8.2
which is the last version which worked.
Another option would be to look upstream to fsspec and have them guarantee a seekable filehandle.
A third would be to modify the pandas reader to detect a non seekable filehandle and buffer the file.
Expected Behavior
read_parquet should continue to work with hdfs remote files as it did with earlier versions of the fsspec dependency
Installed Versions
Comment From: j3doucet
+1 this has a significant impact for us as users of Pandas + HDFS.
Comment From: f4hy
Adding more details. The error message given is the following
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-10-51b2586c3c25> in <module>
----> 1 pd.read_parquet("hdfs:///path/to/myfile.parquet")
~/.local/lib/python3.8/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
501 impl = get_engine(engine)
502
--> 503 return impl.read(
504 path,
505 columns=columns,
~/.local/lib/python3.8/site-packages/pandas/io/parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
249 )
250 try:
--> 251 result = self.api.parquet.read_table(
252 path_or_handle, columns=columns, **kwargs
253 ).to_pandas(**to_pandas_kwargs)
~/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit)
2924 )
2925 try:
-> 2926 dataset = _ParquetDatasetV2(
2927 source,
2928 schema=schema,
~/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, **kwargs)
2464
2465 self._dataset = ds.FileSystemDataset(
-> 2466 [fragment], schema=schema or fragment.physical_schema,
2467 format=parquet_format,
2468 filesystem=fragment.filesystem
~/.local/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()
~/.local/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
~/.local/lib/python3.8/site-packages/pyarrow/io.pxi in pyarrow.lib.NativeFile.tell()
~/.local/lib/python3.8/site-packages/pyarrow/io.pxi in pyarrow.lib.NativeFile.get_random_access_file()
~/.local/lib/python3.8/site-packages/pyarrow/io.pxi in pyarrow.lib.NativeFile._assert_seekable()
OSError: only valid on seekable files
Comment From: f4hy
It looks like fsspec has updated this in 2022.1.0 in PR#1154 to be able to make a read of an pyarrow file seekable, but requires passing the seekable=True
option.
e.g.
fs = fsspec.filesystem("hdfs")
path = "hdfs:///path/to/myfile.parquet"
with fs.open(path, 'rb', seekable=True) as f:
print(f.seekable()) # True
with fs.open(path, 'rb') as f:
print(f.seekable()) # False
It is unclear to me how to pass the seekable option to the open
of the underlying fsspec.
Comment From: martindurant
This is fixed on master since https://github.com/fsspec/filesystem_spec/pull/1186
Comment From: mroeschke
Sounds like this was an upstream issue that has been fixed so closing