Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
string = chr(56000)
print(repr(string)) # This is a valid Python string, but can't be printed normally
df = pd.DataFrame({'A': [string]})
df.to_json()

Issue Description

When I run this from the command line, it causes a segmentation fault:

zsh: segmentation fault  python main.py

Expected Behavior

It should not segmentation fault. Perhaps it throws some error saying it can't to_json it, but it probably shouldn't crash the whole Python runtime (e.g. you can't even recover with a try catch).

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.9.9.final.0 python-bits : 64 OS : Darwin OS-release : 21.6.0 Version : Darwin Kernel Version 21.6.0: Wed Aug 10 14:28:23 PDT 2022; root:xnu-8020.141.5~2/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.2 numpy : 1.23.5 pytz : 2022.6 dateutil : 2.8.2 setuptools : 56.0.0 pip : 21.2.4 Cython : None pytest : 7.2.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.0.2 lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.7.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Comment From: lithomas1

Hi @naterush, Thanks for the report. It looks like we're not handling errors correctly within the JSON C code. For reference, we are calling PyUnicode_AsUTF8AndSize here https://github.com/pandas-dev/pandas/blob/1613f26ff0ec75e30828996fd9ec3f9dd5119ca6/pandas/_libs/src/ujson/python/objToJSON.c#L335-L338

and it is throwing UnicodeEncodeError: 'utf-8' codec can't encode character '\udac0' in position 0: surrogates not allowed. (The error is suppressed because of the segfault)

I'll try to submit a PR for this soon.

Comment From: lithomas1

Looking into this further, it seems like the UnicodeEncodeError is expected.

In your example, string.encode("utf-8") will work, but

df.squeeze().encode("utf-8") will throw the UnicodeEncodeError.

Comment From: naterush

@lithomas1 thanks for the quick turnaround, this was epic!