Is your feature request related to a problem?
While using the scipy-notebook
image from Jupyter, which installs pandas
and xlrd
but not openpyxl
(full dependency list here), I attempted to open a .xlsx
file using pd.read_excel
. Code example:
import pandas as pd
df = pd.read_excel('cities.xlsx')
The error message I got was:
ImportError: Missing optional dependency 'openpyxl'. Use pip or conda to install openpyxl.
This was confusing since previously all I needed was xlrd
for this code to work.
When I look in the docs for pd.read_excel
and scroll down through the params to engine
(which I had left as the default in my code), I see that this is because my xlrd
version has been upgraded to greater than 2.0, which means that it only supports .xls
files now, not .xlsx
.
I would like the error message to reflect this more clearly.
Describe the solution you'd like
There is already a ValueError
in the code that would be much clearer:
# Pass through if ext is None, otherwise check if ext valid for xlrd
if ext and ext != "xls" and xlrd_version >= Version("2"):
raise ValueError(
f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
f"only the xls format is supported. Install openpyxl instead."
)
(Link to source here)
However this is only triggered if the user explicitly specifies engine="xlrd"
, e.g.
import pandas as pd
df = pd.read_excel('cities.xlsx', engine='xlrd')
I propose that the logic be rearranged so that if:
- The user passes in a file with a
.xlsx
extension topd.read_excel
, and - No
engine
param is specified, and xlrd
with version > 2.0 is installed, andopenpyxl
is not installed
then the user will see the value error message explaining that xlrd
> 2.0 only supports .xls
, and that openpyxl
is required for .xlsx
.
In other words, this code:
import pandas as pd
df = pd.read_excel('cities.xlsx')
should produce an error message like this:
Your version of xlrd is {xlrd_version}. In xlrd >= 2.0,
only the xls format is supported. Install openpyxl instead.
rather than an error message like this:
ImportError: Missing optional dependency 'openpyxl'. Use pip or conda to install openpyxl.
API breaking implications
I'm not sure if it technically is part of the "API", but this changes what type of error is thrown from an ImportError
to a ValueError
in a specific circumstance.
Describe alternatives you've considered
Another alternative might be to move the documentation about xlrd
and openpxl
out from under the engine
param. If that were moved to the top or bottom of the pd.read_excel
page (rather than in the middle like it is now), that would make it easier to understand what's happening with an error like this.
Specifically this section of the documentation could be moved:
.. versionchanged:: 1.2.0
The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_
now only supports old-style ``.xls`` files.
When ``engine=None``, the following logic will be
used to determine the engine:
- If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt),
then `odf <https://pypi.org/project/odfpy/>`_ will be used.
- Otherwise if ``path_or_buffer`` is an xls format,
``xlrd`` will be used.
- Otherwise if ``path_or_buffer`` is in xlsb format,
``pyxlsb`` will be used.
.. versionadded:: 1.3.0
- Otherwise ``openpyxl`` will be used.
.. versionchanged:: 1.3.0
(Link to source here)
I still think changing the error message would be better and clearer.
Additional context
The current error message seems to assume that the user has the ability to install new dependencies on the system, and that the user does not have the ability to convert the file type they are working with. I think that the opposite is often true. In my specific situation, I am not actually using any of the functionality that is specific to a .xlsx
file, so the easiest workaround is to convert the file to .xls
. But the ImportError
implies that the best approach is to install additional dependencies, which would require waiting for updates to be made on the Jupyter project.
If there is support for this feature request I'm happy to put together a PR!
Comment From: lithomas1
Contributions very welcome for this.
Comment From: phofl
This was clarified in the docs since the issue was opened, so closing here. You should not use xlrd for xlsx files at all now.