Code Sample
import pandas as pd
excel = pd.ExcelFile("test.xlsx")
for sheet in excel.sheet_names:
reader = excel.parse(sheet, chunksize=1000)
for chunk in reader:
# process chunk
Problem description
In version 0.16.1 the chunksize argument was available.
See: http://pandas.pydata.org/pandas-docs/version/0.16.1/generated/pandas.ExcelFile.parse.html
But in latest version it's not available.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.ExcelFile.parse.html
What was the reason that it was removed?
Also, how should I process excel file by chunks in latest version?
Comment From: jorisvandenbossche
This was removed in https://github.com/pandas-dev/pandas/pull/11198, stated there because the argument didn't do anything (xref https://github.com/pandas-dev/pandas/issues/8011)
Comment From: gfyoung
@jorisvandenbossche : Well, it is a functional parameter in read_csv
, so potentially it could (or should) be implemented. That being said, I think your explanation should satisfy the initial inquiry, so closing for now.
Comment From: jorisvandenbossche
Yes, that feature request is covered by #8011
Comment From: EugeneKovalev
And what was the reason of removal? Ok, it hadn't been working before but where is the fix? Removing a part of a code that should work is not a fix
Comment From: jorisvandenbossche
@EugeneKovalev No keyword (for now) is better than a confusing keyword that does not have any effect. The feature request is still open (#8011), a code contribution to add it is always welcome.
Comment From: swarits
@EugeneKovalev It was removed because the excel files would read up into memory as a whole during parsing because of the nature of XLSX file format. Hence, it'd cause 'MemoryError' if the file was large. And chunksize wouldn't change this behavior in case of excel files, but it works perfect in CSVs coz they could be loaded into memory in parts.