This is reproducible in current latest Pandas 1.5.2
.
In Python the zipfile.Path
class is intendent to act similar (but not absolute equal!) to pathlib.Path
. The latter is accepted by pandas
but not the first.
Steps to reproduce:
1. Create a zip file named foo.zip
with one an csv-file in it named bar.csv
.
2. Create a path object directly pointing to that csv file in the zip file: zp = zipfile.Path('foo.zip', 'bar.csv')
3. Use that path object (zp
) in pandas.read_csv()
as path object.
Because of that part of your code
https://github.com/pandas-dev/pandas/blob/3b09765bc9b141349a2f73edd15aaf4f436928bd/pandas/io/common.py#L446-L452
Python raise an " ValueError: Invalid file path or buffer object type:
EDIT:
I'm aware that pandas.read_csv()
do offer the compressions
argument and can read compressed csv files by its own. But this doesn't help in my case. I'm using pandas
as a backend for a more higher level API reading data files. Pandas is just one part of it. And one shortcoming of pandas here is that it is not able to deal with ZIP files containing multiple CSV files.
pathlib.Path
and zipfile.Path
are standard python. And pandas IMHO should be able to deal with it.
Comment From: twoertwein
Pandas works with all os.PathLike
objects. zipfile.Path
doesn't implement the os.PathLike
protocol as it doesn't have __fspath__
.
Since the python documentation of zipfile.Path
claims that it is compatible with pathlib.Path
(which has __fspath__
), I would recommend opening a bug at cpython.
Comment From: buhtz
Thanks for your analysis.
Where does pandas need __fspath__
?
Comment From: twoertwein
Where does pandas need
__fspath__
?
Here https://github.com/pandas-dev/pandas/blob/61e0db25f5982063ba7bab062074d55d5e549586/pandas/io/common.py#L254
Comment From: twoertwein
I'm not familiar enough with what zipfile.Path
does (or why it even exists - why not use pathlib) - it might be okay to have an elif
branch to check for zipfile.Path
and then convert it to a str
?
Comment From: buhtz
I'm not familiar enough with what
zipfile.Path
does (or why it even exists - why not use pathlib)
That is a good question. Let me explain. The point is you can "access" files inside a zip-archive without explicit open and reading that zip file.
Let's say you have a zip-file containing three "data" files (2x csv, 1x excel). I would like to read them like this:
import zipfile
import pandas
fpa = zipfile.Path('data.zip', 'entryA.csv')
fpb = zipfile.Path('data.zip', 'entryB.csv')
fpc = zipfile.Path('data.zip', 'entryC.xlsx')
# that is possible
with fpa.open('r') as handle:
csv_content = fpa.read()
# but pandas isn't able to do it like this
df = pandas.read_csv(fpa)
I have kind of a pandas-wrapper (buhtzology
) for my own daily work. There in buhtzology.bandas.read_and_validate_excel()
you can see how I workaround that shortcoming. Just have a closer look at _file_path_or_zip_path_as_buffer()
where I decide if pandas.read_excel()
will receive a pathlib.Path
instance or an io
-object via zipfile.Path.open()
.
Comment From: twoertwein
Thank you for describing the use case! In that case, it doesn't make sense to convert it to a str
(and adding __fspath__
makes probably also no sense) as the file inside the zip file cannot be opened using open()
.
The reason why df = pandas.read_csv(fpa)
doesn't work is that fpa
doesn't have read
/write
. This should work:
with fpa.open('r') as handle:
df = pandas.read_csv(handle) # might need to specify mode="rb"
Comment From: buhtz
The reason why
df = pandas.read_csv(fpa)
doesn't work is thatfpa
doesn't haveread
/write
. This should work:
python with fpa.open('r') as handle: df = pandas.read_csv(handle) # might need to specify mode="rb"
That is exactly what I'm doing here in my workaround.
But other file reading libraries that I've tested doesn't have problems with zipfile.Path
and don't need such a workaround. E.g. Pythons csv.reader()
can handle such path objects well as you can see here.
Comment From: twoertwein
I think an option to accommodate that would be:
- first handle fsspec or urllib
- convert remaining strings to
pathlib.Path
- call
.open(mode=..., encoding=..., errors=..., newline=...)
on objects that do not have read/write - works for bothPath
s