Pandas BUG: Unknown Error - Getting from Databricks SQL Python - From PyArrow module (pyarrow.lib.ArrowException)

Pandas version checks

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

arr = arr.to_numpy(zero_copy_only=False)

Issue Description

Description: Hi there,

I'm getting this behaviour from databricks sql connectors due to it's dependency of Pandas. Though of posting it here to get any idea to fix it. B'coz I'm getting the exception when I try to convert the data with utf-8 encoded string to numpy array in pandas and py-arrow. Also Just curious to know is there any specific version of pandas or py-arrow that I can use to fix this issue. Any help would be appreciated here!

Versions: Python: 3.11.9 Databricks SQL Connector: 3.3.0

Code: Below is the code I'm using to fetch data from databricks which has some utf-8 encoded char (�) in the data.

sql_query = f""" SELECT Column FROM table LIMIT 10 """

host = os.getenv("DATABRICKS_HOST")
http_path = os.getenv("DATABRICKS_HTTP_PATH")

connection = sql.connect(server_hostname=host, http_path=http_path)
with conn.cursor() as cursor:
        cursor.execute(sql_query)
        response = fetchmany(cursor)

Error:

response = fetchmany(cursor) ^^^^^^^^^^^^^^^^^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/my_agent/app_helper/db_helper.py", line 35, in fetchmany query_results = cursor.fetchmany(fetch_size) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/my_agent_connectors/dq_databricks/dependencies/databricks/sql/client.py", line 976, in fetchmany return self.active_result_set.fetchmany(size) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/my_agent_connectors/dq_databricks/dependencies/databricks/sql/client.py", line 1235, in fetchmany return self._convert_arrow_table(self.fetchmany_arrow(size)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/my_agent_connectors/dq_databricks/dependencies/databricks/sql/client.py", line 1161, in _convert_arrow_table df = table_renamed.to_pandas( ^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/array.pxi", line 884, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 4192, in pyarrow.lib.Table._to_pandas File "/usr/local/airflow/.local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 776, in table_to_dataframe blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1131, in _table_to_blocks return [_reconstruct_block(item, columns, extension_columns) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1131, in <listcomp> return [_reconstruct_block(item, columns, extension_columns) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 736, in _reconstruct_block pd_ext_arr = pandas_dtype.__from_arrow__(arr) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/pandas/core/arrays/string_.py", line 230, in __from_arrow__ arr = arr.to_numpy(zero_copy_only=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/array.pxi", line 1581, in pyarrow.lib.Array.to_numpy File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowException: Unknown error: Wrapping Ruta 204 � Zapote failed

Expected Behavior

It should support and parse the utf-8 encoded characters as well.

Installed Versions

INSTALLED VERSIONS

commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.11.9.final.0 python-bits : 64 OS : Darwin OS-release : 23.6.0 Version : Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:04 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T8122 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.5.2 numpy : 1.23.4 pytz : 2022.1 dateutil : 2.9.0.post0 setuptools : 65.5.0 pip : 24.0 Cython : 3.0.10 pytest : 7.4.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.9 jinja2 : 3.1.4 IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.3 pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : 1.4.52 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None

Comment From: rhshadrach

Thanks for the report. Are you able to get the exact string that is giving rise to this? The error message above has Ruta 204 � but I am wondering if that last character is being reported correctly.

You report you have pandas 1.5.2, but the stacktrace contains:

site-packages/pandas/core/arrays/string_.py", line 230, in from_arrow

Line 230 is not part of the function __from_arrow__

https://github.com/pandas-dev/pandas/blob/v1.5.2/pandas/core/arrays/string_.py

Comment From: Gobi2511

Hi @rhshadrach, my mistake. We are using below versions:

databricks-sql-connector==3.3.0 pandas==2.1.4 pyarrow==16.1.0

Comment From: Gobi2511

cc: @rhshadrach Also in pandas ==1.5.2 I'm getting the error in below location:

Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/my_agent/connections/base_connector.py", line 120, in execute_query response = fetchmany(cursor, limit) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/my_agent/app_helper/db_helper.py", line 35, in fetchmany query_results = cursor.fetchmany(fetch_size) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/my_agent_connectors/dq_databricks/dependencies/databricks/sql/client.py", line 976, in fetchmany return self.active_result_set.fetchmany(size) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/my_agent_connectors/dq_databricks/dependencies/databricks/sql/client.py", line 1235, in fetchmany return self._convert_arrow_table(self.fetchmany_arrow(size)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/my_agent_connectors/dq_databricks/dependencies/databricks/sql/client.py", line 1161, in _convert_arrow_table df = table_renamed.to_pandas( ^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/array.pxi", line 884, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 4192, in pyarrow.lib.Table._to_pandas File "/usr/local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 774, in table_to_blockmanager blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1124, in _table_to_blocks return [_reconstruct_block(item, columns, extension_columns) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1124, in <listcomp> return [_reconstruct_block(item, columns, extension_columns) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 736, in _reconstruct_block pd_ext_arr = pandas_dtype.__from_arrow__(arr) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/pandas/core/arrays/string_.py", line 202, in __from_arrow__ str_arr = StringArray._from_sequence(np.array(arr)) ^^^^^^^^^^^^^ File "pyarrow/array.pxi", line 1531, in pyarrow.lib.Array.__array__ File "pyarrow/array.pxi", line 1581, in pyarrow.lib.Array.to_numpy File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowException: Unknown error: Wrapping ESPA�A failed

Comment From: rhshadrach

Can you also check the version of PyArrow you are using? It appears to me that something akin to the following is failing:

pa.array(["ESPAÑA"]).to_numpy(zero_copy_only=False)

But this works on pyarrow==16.1.0 for me. It is also likely not exactly the string that your job is failing on. Can you get the string some other way?