import pandas
pandas.read_excel('pandas_crash.xlsx')
pandas_crash.xlsx crashes the Python process.
Output of pd.show_versions()
Comment From: TomAugspurger
Can you reproduce the crash using just the underlying engine (openpyxl or xlrd)?
Can you post the traceback?
Comment From: Dobatymo
There is no traceback as no exception is thrown. The process simply crashes. With plain xldr
or openpyxl
I can read the file however.
path = "pandas_crash.xlsx"
from openpyxl import load_workbook
sheet = load_workbook(path).active
for i in range(1, sheet.max_row+1):
for j in range(1, sheet.max_column+1):
print(repr(sheet.cell(row=i,column=j).value))
import xlrd
sheet = xlrd.open_workbook(path).sheet_by_index(0)
for i in range(0, sheet.nrows):
for j in range(0, sheet.ncols):
print(repr(sheet.cell(i, j).value))
Output:
'Column1'
'_xDC88_'
'Column1'
'\udc88'
The output of openpyxl
is not correct, it seems it cannot handle the single surrogate. xlrd
is correct.
Comment From: TomAugspurger
There is no traceback as no exception is thrown. The process simply crashes.
Strange.
xlrd is correct.
Does pd.read_excel with engine='xldd'
work?
On Tue, Nov 20, 2018 at 8:04 PM Dobatymo notifications@github.com wrote:
There is no traceback as no exception is thrown. The process simply crashes. With plain xldr or openpyxl I can read the file however.
path = "pandas_crash.xlsx" from openpyxl import load_workbook
sheet = load_workbook(path).active for i in range(1, sheet.max_row+1): for j in range(1, sheet.max_column+1): print(repr(sheet.cell(row=i,column=j).value)) import xlrd
sheet = xlrd.open_workbook(path).sheet_by_index(0) for i in range(0, sheet.nrows): for j in range(0, sheet.ncols): print(repr(sheet.cell(i, j).value))
Output:
'Column1' 'xDC88' 'Column1' '\udc88'
The output of openpyxl is not correct, it seems it cannot handle the single surrogates. xlrd is correct.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/23809#issuecomment-440501342, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIowKo5UM8P1e2EwIR1XVHQUxB1_sks5uxLS6gaJpZM4Yqh8t .
Comment From: WillAyd
Not an expert in this domain but what encoding is supposed to be represented here? I believe modern Excel files use utf-16 encoding internally but doesn't that surrogate fall outside of the high surrogate range for that encoding?
Comment From: Dobatymo
The string is a single surrogate, which is not valid Unicode. However xlrd behaves correctly and passes the string through to python unmodified. I encountered this type of problem when exporting .xlsx
files from SQL Server. It's possible they contain invalid Unicode strings.
read_excel(path, engine="xlrd")
fails as well.
I can verify that the source of the crash is from Pandas. Debugging with Visual Studio yields:
lib.cp36-win_amd64.pyd!00007ffd4eef114e()
lib.cp36-win_amd64.pyd!00007ffd4eef1400()
lib.cp36-win_amd64.pyd!00007ffd4eecf2bc()
lib.cp36-win_amd64.pyd!00007ffd4eed1428()
python36.dll!0000000064b2c902()
python36.dll!0000000064b2be83()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3e34e()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2b940()
python36.dll!0000000064b2b725()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2b940()
python36.dll!0000000064b2b725()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2485d()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2485d()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3e34e()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b512cf()
python36.dll!0000000064b5122d()
python36.dll!0000000064b511d7()
python36.dll!0000000064cba819()
python36.dll!0000000064cbafb9()
python36.dll!0000000064cba6f7()
python36.dll!0000000064c0a9f4()
python36.dll!0000000064b944f2()
python.exe!000000001c70126d()
kernel32.dll!00007ffd9a868102()
ntdll.dll!00007ffd9af9c5b4()
which is not terrible helpful, but at least we can be sure the crash is in pandas\_libs\lib.cp36-win_amd64.pyd
Comment From: WillAyd
OK thanks. I think generally we have a few issues with handling surrogates in the parsers (can search issues for similar ones). Not sure if there's a way to handle gracefully with Python2 support but would in any case welcome investigation and PRs.
FYI dropping Python2 support officially at the start of 2019 so Compatibility won't be as much of an issue soon
Comment From: corimnally
Is this resolved? I get the same issue where it crashes with no error thrown when calling pd.read_excel() with the path to an .xlsx file. I'm using python 3.10 and pandas 1.5.3. It works fine locally, but fails when it is run by a GitHub actions workflow.