Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# fsspec==2022.8.2
df = pd.read_parquet("hdfs:///path/to/myfile.parquet") #works
# fsspec==2022.11.0
df = pd.read_parquet("hdfs:///path/to/myfile.parquet") #errors
# OSError: only valid on seekable files

Issue Description

fsspec has changed the backend for hdfs to use the new filesystem in pyarrow in 2022.10.0. This seems to break compatibility with pandas as this apparently gives back a non seekable file now which pandas expects.

One solution could be to have pandas require fsspec<=2022.8.2 which is the last version which worked.

Another option would be to look upstream to fsspec and have them guarantee a seekable filehandle.

A third would be to modify the pandas reader to detect a non seekable filehandle and buffer the file.

Expected Behavior

read_parquet should continue to work with hdfs remote files as it did with earlier versions of the fsspec dependency

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.8.13.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-77-generic Version : #86~18.04.1-Ubuntu SMP Fri Jun 18 01:23:22 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : None LOCALE : None.None pandas : 1.5.2 numpy : 1.24.1 pytz : 2022.7 dateutil : 2.8.2 setuptools : 51.3.3 pip : 20.3.4 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 7.26.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : 2022.11.0 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 10.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Comment From: j3doucet

+1 this has a significant impact for us as users of Pandas + HDFS.

Comment From: f4hy

Adding more details. The error message given is the following

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-10-51b2586c3c25> in <module>
----> 1 pd.read_parquet("hdfs:///path/to/myfile.parquet")

~/.local/lib/python3.8/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
    501     impl = get_engine(engine)
    502 
--> 503     return impl.read(
    504         path,
    505         columns=columns,

~/.local/lib/python3.8/site-packages/pandas/io/parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
    249         )
    250         try:
--> 251             result = self.api.parquet.read_table(
    252                 path_or_handle, columns=columns, **kwargs
    253             ).to_pandas(**to_pandas_kwargs)

~/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit)
   2924             )
   2925         try:
-> 2926             dataset = _ParquetDatasetV2(
   2927                 source,
   2928                 schema=schema,

~/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, **kwargs)
   2464 
   2465             self._dataset = ds.FileSystemDataset(
-> 2466                 [fragment], schema=schema or fragment.physical_schema,
   2467                 format=parquet_format,
   2468                 filesystem=fragment.filesystem

~/.local/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()

~/.local/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/.local/lib/python3.8/site-packages/pyarrow/io.pxi in pyarrow.lib.NativeFile.tell()

~/.local/lib/python3.8/site-packages/pyarrow/io.pxi in pyarrow.lib.NativeFile.get_random_access_file()

~/.local/lib/python3.8/site-packages/pyarrow/io.pxi in pyarrow.lib.NativeFile._assert_seekable()

OSError: only valid on seekable files

Comment From: f4hy

It looks like fsspec has updated this in 2022.1.0 in PR#1154 to be able to make a read of an pyarrow file seekable, but requires passing the seekable=True option. e.g.

fs = fsspec.filesystem("hdfs")
path = "hdfs:///path/to/myfile.parquet"
with fs.open(path, 'rb', seekable=True) as f:
    print(f.seekable()) # True
with fs.open(path, 'rb') as f:
    print(f.seekable()) # False

It is unclear to me how to pass the seekable option to the open of the underlying fsspec.

Comment From: martindurant

This is fixed on master since https://github.com/fsspec/filesystem_spec/pull/1186

Comment From: mroeschke

Sounds like this was an upstream issue that has been fixed so closing