Is your feature request related to a problem?

While using the scipy-notebook image from Jupyter, which installs pandas and xlrd but not openpyxl (full dependency list here), I attempted to open a .xlsx file using pd.read_excel. Code example:

import pandas as pd
df = pd.read_excel('cities.xlsx')

The error message I got was:

ImportError: Missing optional dependency 'openpyxl'. Use pip or conda to install openpyxl.

This was confusing since previously all I needed was xlrd for this code to work.

When I look in the docs for pd.read_excel and scroll down through the params to engine (which I had left as the default in my code), I see that this is because my xlrd version has been upgraded to greater than 2.0, which means that it only supports .xls files now, not .xlsx.

I would like the error message to reflect this more clearly.

Describe the solution you'd like

There is already a ValueError in the code that would be much clearer:

# Pass through if ext is None, otherwise check if ext valid for xlrd
if ext and ext != "xls" and xlrd_version >= Version("2"):
    raise ValueError(
        f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
        f"only the xls format is supported. Install openpyxl instead."
    )

(Link to source here)

However this is only triggered if the user explicitly specifies engine="xlrd", e.g.

import pandas as pd
df = pd.read_excel('cities.xlsx', engine='xlrd')

I propose that the logic be rearranged so that if:

  1. The user passes in a file with a .xlsx extension to pd.read_excel, and
  2. No engine param is specified, and
  3. xlrd with version > 2.0 is installed, and
  4. openpyxl is not installed

then the user will see the value error message explaining that xlrd > 2.0 only supports .xls, and that openpyxl is required for .xlsx.

In other words, this code:

import pandas as pd
df = pd.read_excel('cities.xlsx')

should produce an error message like this:

Your version of xlrd is {xlrd_version}. In xlrd >= 2.0,
only the xls format is supported. Install openpyxl instead.

rather than an error message like this:

ImportError: Missing optional dependency 'openpyxl'. Use pip or conda to install openpyxl.

API breaking implications

I'm not sure if it technically is part of the "API", but this changes what type of error is thrown from an ImportError to a ValueError in a specific circumstance.

Describe alternatives you've considered

Another alternative might be to move the documentation about xlrd and openpxl out from under the engine param. If that were moved to the top or bottom of the pd.read_excel page (rather than in the middle like it is now), that would make it easier to understand what's happening with an error like this.

Specifically this section of the documentation could be moved:

    .. versionchanged:: 1.2.0
        The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_
        now only supports old-style ``.xls`` files.
        When ``engine=None``, the following logic will be
        used to determine the engine:
       - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt),
         then `odf <https://pypi.org/project/odfpy/>`_ will be used.
       - Otherwise if ``path_or_buffer`` is an xls format,
         ``xlrd`` will be used.
       - Otherwise if ``path_or_buffer`` is in xlsb format,
         ``pyxlsb`` will be used.
         .. versionadded:: 1.3.0
       - Otherwise ``openpyxl`` will be used.
         .. versionchanged:: 1.3.0

(Link to source here)

I still think changing the error message would be better and clearer.

Additional context

The current error message seems to assume that the user has the ability to install new dependencies on the system, and that the user does not have the ability to convert the file type they are working with. I think that the opposite is often true. In my specific situation, I am not actually using any of the functionality that is specific to a .xlsx file, so the easiest workaround is to convert the file to .xls. But the ImportError implies that the best approach is to install additional dependencies, which would require waiting for updates to be made on the Jupyter project.

If there is support for this feature request I'm happy to put together a PR!

Comment From: lithomas1

Contributions very welcome for this.

Comment From: phofl

This was clarified in the docs since the issue was opened, so closing here. You should not use xlrd for xlsx files at all now.