Pandas read_excel crashes python for certain files

import pandas
pandas.read_excel('pandas_crash.xlsx')

pandas_crash.xlsx crashes the Python process.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: en_US LOCALE: None.None pandas: 0.23.4 pytest: 3.6.3 pip: 18.1 setuptools: 39.1.0 Cython: 0.28.3 numpy: 1.15.4 scipy: 1.0.1 pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.7.2 patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: 2.6.9dev0 feather: None matplotlib: 2.2.2 openpyxl: 2.5.10 xlrd: 1.0.0 xlwt: None xlsxwriter: None lxml: 4.1.1 bs4: 4.6.0 html5lib: 0.9999999 sqlalchemy: 1.2.2 pymysql: 0.8.0 psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: TomAugspurger

Can you reproduce the crash using just the underlying engine (openpyxl or xlrd)?

Can you post the traceback?

Comment From: Dobatymo

There is no traceback as no exception is thrown. The process simply crashes. With plain xldr or openpyxl I can read the file however.

path = "pandas_crash.xlsx"

from openpyxl import load_workbook

sheet = load_workbook(path).active

for i in range(1, sheet.max_row+1):
    for j in range(1, sheet.max_column+1):
        print(repr(sheet.cell(row=i,column=j).value))

import xlrd

sheet = xlrd.open_workbook(path).sheet_by_index(0)

for i in range(0, sheet.nrows):
    for j in range(0, sheet.ncols):
        print(repr(sheet.cell(i, j).value))

Output:

'Column1'
'_xDC88_'
'Column1'
'\udc88'

The output of openpyxl is not correct, it seems it cannot handle the single surrogate. xlrd is correct.

Comment From: TomAugspurger

There is no traceback as no exception is thrown. The process simply crashes.

Strange.

xlrd is correct.

Does pd.read_excel with engine='xldd' work?

On Tue, Nov 20, 2018 at 8:04 PM Dobatymo notifications@github.com wrote:

There is no traceback as no exception is thrown. The process simply crashes. With plain xldr or openpyxl I can read the file however.

path = "pandas_crash.xlsx" from openpyxl import load_workbook

sheet = load_workbook(path).active for i in range(1, sheet.max_row+1): for j in range(1, sheet.max_column+1): print(repr(sheet.cell(row=i,column=j).value)) import xlrd

sheet = xlrd.open_workbook(path).sheet_by_index(0) for i in range(0, sheet.nrows): for j in range(0, sheet.ncols): print(repr(sheet.cell(i, j).value))

Output:

'Column1' 'xDC88' 'Column1' '\udc88'

The output of openpyxl is not correct, it seems it cannot handle the single surrogates. xlrd is correct.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23809#issuecomment-440501342, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIowKo5UM8P1e2EwIR1XVHQUxB1_sks5uxLS6gaJpZM4Yqh8t .

Comment From: WillAyd

Not an expert in this domain but what encoding is supposed to be represented here? I believe modern Excel files use utf-16 encoding internally but doesn't that surrogate fall outside of the high surrogate range for that encoding?

Comment From: Dobatymo

The string is a single surrogate, which is not valid Unicode. However xlrd behaves correctly and passes the string through to python unmodified. I encountered this type of problem when exporting .xlsx files from SQL Server. It's possible they contain invalid Unicode strings.

read_excel(path, engine="xlrd")

fails as well.

I can verify that the source of the crash is from Pandas. Debugging with Visual Studio yields:

lib.cp36-win_amd64.pyd!00007ffd4eef114e()
lib.cp36-win_amd64.pyd!00007ffd4eef1400()
lib.cp36-win_amd64.pyd!00007ffd4eecf2bc()
lib.cp36-win_amd64.pyd!00007ffd4eed1428()
python36.dll!0000000064b2c902()
python36.dll!0000000064b2be83()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3e34e()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2b940()
python36.dll!0000000064b2b725()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2b940()
python36.dll!0000000064b2b725()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2485d()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2485d()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3e34e()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b512cf()
python36.dll!0000000064b5122d()
python36.dll!0000000064b511d7()
python36.dll!0000000064cba819()
python36.dll!0000000064cbafb9()
python36.dll!0000000064cba6f7()
python36.dll!0000000064c0a9f4()
python36.dll!0000000064b944f2()
python.exe!000000001c70126d()
kernel32.dll!00007ffd9a868102()
ntdll.dll!00007ffd9af9c5b4()

which is not terrible helpful, but at least we can be sure the crash is in pandas\_libs\lib.cp36-win_amd64.pyd

Comment From: WillAyd

OK thanks. I think generally we have a few issues with handling surrogates in the parsers (can search issues for similar ones). Not sure if there's a way to handle gracefully with Python2 support but would in any case welcome investigation and PRs.

FYI dropping Python2 support officially at the start of 2019 so Compatibility won't be as much of an issue soon

Comment From: corimnally

Is this resolved? I get the same issue where it crashes with no error thrown when calling pd.read_excel() with the path to an .xlsx file. I'm using python 3.10 and pandas 1.5.3. It works fine locally, but fails when it is run by a GitHub actions workflow.

Pandas read_excel crashes python for certain files

Output of pd.show_versions()

Output of `pd.show_versions()`