When loading a file that has been modified in S3 directly with pd.read_parquet() from another terminal, we get the ArrowIOError: Invalid parquet file. Corrupt footer. Error.
The code to replicate this error is:
import pandas as pd
pd.show_versions()
n = 5
df = pd.DataFrame()
for i in range(n):
df[str(i)] = [1, 2, 3]
bucket = 'XXXXXX'
file_name = f's3://{bucket}/dataframe.pqt'
df.to_parquet(file_name)
df = pd.read_parquet(file_name)
Then upload a modified DataFrame with the same name from another terminal:
import pandas as pd
n = 6
df = pd.DataFrame()
for i in range(n):
df[str(i)] = [1, 2, 3]
bucket = 'XXXXXX'
file_name = f's3://{bucket}/dataframe.pqt'
df.to_parquet(file_name)
And after trying to read it again from the first one:
df = pd.read_parquet(file_name)
We get an error:
----> 1 df = pd.read_parquet(file_name)
~/venv/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
313
314 impl = get_engine(engine)
--> 315 return impl.read(path, columns=columns, **kwargs)
~/venv/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
129 def read(self, path, columns=None, **kwargs):
130 parquet_ds = self.api.parquet.ParquetDataset(
--> 131 path, filesystem=get_fs_for_path(path), **kwargs
132 )
133 kwargs["columns"] = columns
~/venv/lib/python3.6/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, read_dictionary, memory_map, buffer_size)
1058
1059 if validate_schema:
-> 1060 self.validate_schemas()
1061
1062 def equals(self, other):
~/venv/lib/python3.6/site-packages/pyarrow/parquet.py in validate_schemas(self)
1090 self.schema = self.common_metadata.schema
1091 else:
-> 1092 self.schema = self.pieces[0].get_metadata().schema
1093 elif self.schema is None:
1094 self.schema = self.metadata.schema
~/venv/lib/python3.6/site-packages/pyarrow/parquet.py in get_metadata(self)
558 metadata : FileMetaData
559 """
--> 560 f = self.open()
561 return f.metadata
562
~/venv/lib/python3.6/site-packages/pyarrow/parquet.py in open(self)
565 Returns instance of ParquetFile
566 """
--> 567 reader = self.open_file_func(self.path)
568 if not isinstance(reader, ParquetFile):
569 reader = ParquetFile(reader, **self.file_options)
~/venv/lib/python3.6/site-packages/pyarrow/parquet.py in _open_dataset_file(dataset, path, meta)
940 read_dictionary=dataset.read_dictionary,
941 common_metadata=dataset.common_metadata,
--> 942 buffer_size=dataset.buffer_size
943 )
944
~/venv/lib/python3.6/site-packages/pyarrow/parquet.py in __init__(self, source, metadata, common_metadata, read_dictionary, memory_map, buffer_size)
135 self.reader.open(source, use_memory_map=memory_map,
136 buffer_size=buffer_size,
--> 137 read_dictionary=read_dictionary, metadata=metadata)
138 self.common_metadata = common_metadata
139 self._nested_paths_by_prefix = self._build_nested_paths()
~/venv/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.open()
~/venv/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowIOError: Invalid parquet file. Corrupt footer.
The output from pd.show_versions()
is:
INSTALLED VERSIONS
------------------
commit : None
python : 3.6.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-61-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.4
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 20.1.1
setuptools : 41.4.0
Cython : None
pytest : 5.2.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.0
pytables : None
pytest : 5.2.1
pyxlsb : None
s3fs : 0.3.5
scipy : 1.4.1
sqlalchemy : 1.3.10
tables : None
tabulate : 0.8.5
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.46.0
In order to reproduce the error, the DataFrame in S3 must be modified from a different terminal, uploading the modified DataFrame and reading it again from the same terminal will not raise the error.
Comment From: alimcmaster1
Thanks for the report. Just tried with pyarrow 0.17.1 - can't reproduce. Can you try?
Comment From: eeroel
Hi,
This is probably related to this bug in s3fs that appears to be fixed since s3fs version 0.4.2 on pypi, but present in 0.4.0: https://github.com/dask/s3fs/issues/332
It affects also read_csv
and probably any method that reads from s3. Basically s3fs was caching the size of the object on s3, and when the object was replaced with a larger one, only the data up to the cached size was read.
Maybe the pandas optional dependency for s3fs should be bumped to 0.4.2 ?
Comment From: alimcmaster1
Hi,
This is probably related to this bug in s3fs that appears to be fixed since s3fs version 0.4.2 on pypi, but present in 0.4.0: dask/s3fs#332 It affects also
read_csv
and probably any method that reads from s3. Basically s3fs was caching the size of the object on s3, and when the object was replaced with a larger one, only the data up to the cached size was read.Maybe the pandas optional dependency for s3fs should be bumped to 0.4.2 ?
Thanks for the report - is this a different issue to the above, do you have a repo?
Comment From: miguelfdag
It could very well be the same bug I experienced
On Sat, 26 Sep 2020 at 02:31, Ali McMaster notifications@github.com wrote:
Hi,
This is probably related to this bug in s3fs that appears to be fixed since s3fs version 0.4.2 on pypi, but present in 0.4.0: dask/s3fs#332 https://github.com/dask/s3fs/issues/332
It affects also read_csv and probably any method that reads from s3. Basically s3fs was caching the size of the object on s3, and when the object was replaced with a larger one, only the data up to the cached size was read.
Maybe the pandas optional dependency for s3fs should be bumped to 0.4.2 ?
Thanks for the report - is this a different issue to the above, do you have a repo?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/35209#issuecomment-699242253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHFS4XFXNM5UIOUOFIN3JRLSHUY67ANCNFSM4OWNRC2Q .
Comment From: eeroel
@alimcmaster1 I suspect the above issue is a special case of a more general bug. I created a unit test here: https://github.com/eeroel/pandas/tree/s3_changing_file_bug. I was able to reproduce the bug in a fresh Pandas Docker container as follows:
conda install s3fs==0.4.0 --no-deps # to install this exact version
pytest pandas/tests/io/test_s3.py # This will fail with s3fs==0.4.0, but pass with 0.4.2
Should I open a new issue for this?
Comment From: mroeschke
This looks like an issue with a unsupported version of s3fs so closing for now. Can reopen if it reproduces with a new version