Pandas Pandas can't handle zipfile.Path objects (ValueError: Invalid file path or buffer object type: <class 'zipfile.Path'>)

This is reproducible in current latest Pandas 1.5.2.

In Python the zipfile.Path class is intendent to act similar (but not absolute equal!) to pathlib.Path. The latter is accepted by pandas but not the first.

Steps to reproduce: 1. Create a zip file named foo.zip with one an csv-file in it named bar.csv. 2. Create a path object directly pointing to that csv file in the zip file: zp = zipfile.Path('foo.zip', 'bar.csv') 3. Use that path object (zp) in pandas.read_csv() as path object.

Because of that part of your code https://github.com/pandas-dev/pandas/blob/3b09765bc9b141349a2f73edd15aaf4f436928bd/pandas/io/common.py#L446-L452 Python raise an " ValueError: Invalid file path or buffer object type: ".

EDIT: I'm aware that pandas.read_csv() do offer the compressions argument and can read compressed csv files by its own. But this doesn't help in my case. I'm using pandas as a backend for a more higher level API reading data files. Pandas is just one part of it. And one shortcoming of pandas here is that it is not able to deal with ZIP files containing multiple CSV files.

pathlib.Path and zipfile.Path are standard python. And pandas IMHO should be able to deal with it.

Comment From: twoertwein

Pandas works with all os.PathLike objects. zipfile.Path doesn't implement the os.PathLike protocol as it doesn't have __fspath__.

Since the python documentation of zipfile.Path claims that it is compatible with pathlib.Path (which has __fspath__), I would recommend opening a bug at cpython.

Comment From: buhtz

Thanks for your analysis. Where does pandas need __fspath__?

Comment From: twoertwein

Where does pandas need __fspath__?

Here https://github.com/pandas-dev/pandas/blob/61e0db25f5982063ba7bab062074d55d5e549586/pandas/io/common.py#L254

Comment From: twoertwein

I'm not familiar enough with what zipfile.Path does (or why it even exists - why not use pathlib) - it might be okay to have an elif branch to check for zipfile.Path and then convert it to a str?

Comment From: buhtz

I'm not familiar enough with what zipfile.Path does (or why it even exists - why not use pathlib)

That is a good question. Let me explain. The point is you can "access" files inside a zip-archive without explicit open and reading that zip file.

Let's say you have a zip-file containing three "data" files (2x csv, 1x excel). I would like to read them like this:

import zipfile
import pandas

fpa = zipfile.Path('data.zip', 'entryA.csv')
fpb = zipfile.Path('data.zip', 'entryB.csv')
fpc = zipfile.Path('data.zip', 'entryC.xlsx')

# that is possible
with fpa.open('r') as handle:
   csv_content = fpa.read()

# but pandas isn't able to do it like this
df = pandas.read_csv(fpa)

I have kind of a pandas-wrapper (buhtzology) for my own daily work. There in buhtzology.bandas.read_and_validate_excel() you can see how I workaround that shortcoming. Just have a closer look at _file_path_or_zip_path_as_buffer() where I decide if pandas.read_excel() will receive a pathlib.Path instance or an io-object via zipfile.Path.open().

Comment From: twoertwein

Thank you for describing the use case! In that case, it doesn't make sense to convert it to a str (and adding __fspath__ makes probably also no sense) as the file inside the zip file cannot be opened using open().

The reason why df = pandas.read_csv(fpa) doesn't work is that fpa doesn't have read/write. This should work:

with fpa.open('r') as handle:
    df = pandas.read_csv(handle)  # might need to specify mode="rb"

Comment From: buhtz

The reason why df = pandas.read_csv(fpa) doesn't work is that fpa doesn't have read/write. This should work:

python with fpa.open('r') as handle: df = pandas.read_csv(handle) # might need to specify mode="rb"

That is exactly what I'm doing here in my workaround.

But other file reading libraries that I've tested doesn't have problems with zipfile.Path and don't need such a workaround. E.g. Pythons csv.reader() can handle such path objects well as you can see here.

Comment From: twoertwein

I think an option to accommodate that would be:

first handle fsspec or urllib
convert remaining strings to pathlib.Path
call .open(mode=..., encoding=..., errors=..., newline=...) on objects that do not have read/write - works for both Paths