Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import io
import pandas as pd
import pyarrow
print(pd.__version__)
print(pyarrow.__version__)
df = pd.read_csv(io.StringIO("""
a b
1 4
2 5
3 6
"""), sep="\t")
print(df.shape)
# outputs are:
# 1.5.3
# 11.0.0
# 0.16.9
# (3, 2)
# save to parquet (pyarrow engine)
df.to_parquet("./not_empty.parquet", engine="pyarrow")
df.query("a<0").to_parquet("./empty.parquet", engine="pyarrow")
# column number from pyarrow: not consistent (the extra is "__index_level_0__")
print(pyarrow.parquet.ParquetFile("./not_empty.parquet").metadata.num_columns)
print(pyarrow.parquet.ParquetFile("./empty.parquet").metadata.num_columns)
# read_parquet doesn't reflect such a column
print(pd.read_parquet("./not_empty.parquet").shape)
print(pd.read_parquet("./empty.parquet").shape)
# the problem is that using other tools (e.g. polars) and "__index_level_0__" appears.
import polars as pl
print(pl.__version__)
pl.scan_parquet("./empty.parquet").collect().columns
# outputs are:
# 2
# 3
# Index(['a', 'b'], dtype='object')
# Index(['a', 'b'], dtype='object')
# 0.16.9
# ['a', 'b', '__index_level_0__']
Issue Description
column __index_level_0__
is saved to a parquet file when the DataFrame has 0 row.
Expected Behavior
no this extra __index_level_0__
column in parquet file
Installed Versions
INSTALLED VERSIONS
------------------
commit : 2e218d10984e9919f0296931d92ea851c6a6faf5
python : 3.9.12.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 154 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Chinese (Simplified)_China.936
pandas : 1.5.3
numpy : 1.23.5
pytz : 2021.3
dateutil : 2.8.2
setuptools : 61.2.0
pip : 21.2.4
Cython : 0.29.28
pytest : 7.1.1
hypothesis : None
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.2.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.4
brotli :
fastparquet : 2023.2.0
fsspec : 2023.1.0
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.7.3
snappy :
sqlalchemy : 1.4.32
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.20.1
xlrd : 2.0.1
xlwt : None
zstandard : None
tzdata : 2022.2
Comment From: phofl
Hi, thanks for your report. This is a problem in pyarrow when converting your DataFrame to a pyarrow table
Comment From: phofl
Actually this was incorrect, arrow behaves as documented. This is a problem of polars. So please report there.
Simpler reproducer:
df = DataFrame({"A": [1, 2, 3]}, index=[1, 2, 3])
df.to_parquet("./empty.parquet")
print(pl.scan_parquet("./empty.parquet").collect().columns)