Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
with open(file_, "r") as f:
df = pd.read_xml(
f,
parser="etree",
iterparse={root: list(elements)}
)
Issue Description
the method read_xml
with iterparse as parms is used to read large xml file, but it's restricted to read only files on local disk.
this design is not justified since that the module iterparse
from xml.etree.Elements
does allow this features (reading from a file like object)
if (
not isinstance(self.path_or_buffer, str) # -> condition that raise the Error
or is_url(self.path_or_buffer)
or is_fsspec_url(self.path_or_buffer)
or self.path_or_buffer.startswith(("<?xml", "<"))
or infer_compression(self.path_or_buffer, "infer") is not None
):
raise ParserError(
"iterparse is designed for large XML files that are fully extracted on "
"local disk and not as compressed files or online sources."
)
maybe something like this is better:
if isinstance(self.path_or_buffer,str) or hasattr(self.path_or_buffer, "read"):
#do something
else:
raise ParserError(...)
Expected Behavior
this must return a dataframe reither than raise Exception
Installed Versions
Comment From: phofl
Hi, thanks for your report. Please provide a reproducible example that is copy-paste able
Comment From: ParfaitG
This is a straightforward fix and not a use case originally considered for very large files requiring iterparse
. For lxml
parser, the read mode must be rb
as its iterparse
method expects only binary input.
Comment From: bama-chi
@phofl dunno what can I provide more than the example above, you only have to replace the file_
variable with your xml file
let me know if I can help with something
@ParfaitG thank you for ur PR.
Comment From: phofl
That's exactly what copy-and-pasteable means. For example a simple read_csv reproducer:
data = """a,b
1,2
"""
pd.read_csv(StringIO(data))
Same goes for xml for future bug reports
Comment From: bama-chi
okey thanks