Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
string = chr(56000)
print(repr(string)) # This is a valid Python string, but can't be printed normally
df = pd.DataFrame({'A': [string]})
df.to_json()
Issue Description
When I run this from the command line, it causes a segmentation fault:
zsh: segmentation fault python main.py
Expected Behavior
It should not segmentation fault. Perhaps it throws some error saying it can't to_json
it, but it probably shouldn't crash the whole Python runtime (e.g. you can't even recover with a try catch).
Installed Versions
Comment From: lithomas1
Hi @naterush,
Thanks for the report. It looks like we're not handling errors correctly within the JSON C code.
For reference, we are calling PyUnicode_AsUTF8AndSize
here
https://github.com/pandas-dev/pandas/blob/1613f26ff0ec75e30828996fd9ec3f9dd5119ca6/pandas/_libs/src/ujson/python/objToJSON.c#L335-L338
and it is throwing
UnicodeEncodeError: 'utf-8' codec can't encode character '\udac0' in position 0: surrogates not allowed
.
(The error is suppressed because of the segfault)
I'll try to submit a PR for this soon.
Comment From: lithomas1
Looking into this further, it seems like the UnicodeEncodeError is expected.
In your example,
string.encode("utf-8")
will work, but
df.squeeze().encode("utf-8")
will throw the UnicodeEncodeError.
Comment From: naterush
@lithomas1 thanks for the quick turnaround, this was epic!