Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
arr = arr.to_numpy(zero_copy_only=False)
Issue Description
Description: Hi there,
I'm getting this behaviour from databricks sql connectors due to it's dependency of Pandas. Though of posting it here to get any idea to fix it. B'coz I'm getting the exception when I try to convert the data with utf-8 encoded string to numpy array in pandas and py-arrow. Also Just curious to know is there any specific version of pandas or py-arrow that I can use to fix this issue. Any help would be appreciated here!
Versions: Python: 3.11.9 Databricks SQL Connector: 3.3.0
Code: Below is the code I'm using to fetch data from databricks which has some utf-8 encoded char (�) in the data.
sql_query = f""" SELECT Column FROM table LIMIT 10 """
host = os.getenv("DATABRICKS_HOST")
http_path = os.getenv("DATABRICKS_HTTP_PATH")
connection = sql.connect(server_hostname=host, http_path=http_path)
with conn.cursor() as cursor:
cursor.execute(sql_query)
response = fetchmany(cursor)
Error:
response = fetchmany(cursor)
^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/my_agent/app_helper/db_helper.py", line 35, in fetchmany
query_results = cursor.fetchmany(fetch_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/my_agent_connectors/dq_databricks/dependencies/databricks/sql/client.py", line 976, in fetchmany
return self.active_result_set.fetchmany(size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/my_agent_connectors/dq_databricks/dependencies/databricks/sql/client.py", line 1235, in fetchmany
return self._convert_arrow_table(self.fetchmany_arrow(size))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/my_agent_connectors/dq_databricks/dependencies/databricks/sql/client.py", line 1161, in _convert_arrow_table
df = table_renamed.to_pandas(
^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 884, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 4192, in pyarrow.lib.Table._to_pandas
File "/usr/local/airflow/.local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 776, in table_to_dataframe
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1131, in _table_to_blocks
return [_reconstruct_block(item, columns, extension_columns)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1131, in <listcomp>
return [_reconstruct_block(item, columns, extension_columns)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 736, in _reconstruct_block
pd_ext_arr = pandas_dtype.__from_arrow__(arr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/pandas/core/arrays/string_.py", line 230, in __from_arrow__
arr = arr.to_numpy(zero_copy_only=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 1581, in pyarrow.lib.Array.to_numpy
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowException: Unknown error: Wrapping Ruta 204 � Zapote failed
Expected Behavior
It should support and parse the utf-8 encoded characters as well.
Installed Versions
INSTALLED VERSIONS
commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7 python : 3.11.9.final.0 python-bits : 64 OS : Darwin OS-release : 23.6.0 Version : Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:04 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T8122 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.5.2 numpy : 1.23.4 pytz : 2022.1 dateutil : 2.9.0.post0 setuptools : 65.5.0 pip : 24.0 Cython : 3.0.10 pytest : 7.4.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.9 jinja2 : 3.1.4 IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.3 pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : 1.4.52 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None
Comment From: rhshadrach
Thanks for the report. Are you able to get the exact string that is giving rise to this? The error message above has Ruta 204 �
but I am wondering if that last character is being reported correctly.
You report you have pandas 1.5.2, but the stacktrace contains:
site-packages/pandas/core/arrays/string_.py", line 230, in from_arrow
Line 230 is not part of the function __from_arrow__
https://github.com/pandas-dev/pandas/blob/v1.5.2/pandas/core/arrays/string_.py
Comment From: Gobi2511
Hi @rhshadrach, my mistake. We are using below versions:
databricks-sql-connector==3.3.0 pandas==2.1.4 pyarrow==16.1.0
Comment From: Gobi2511
cc: @rhshadrach Also in pandas ==1.5.2 I'm getting the error in below location:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/my_agent/connections/base_connector.py", line 120, in execute_query
response = fetchmany(cursor, limit)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/my_agent/app_helper/db_helper.py", line 35, in fetchmany
query_results = cursor.fetchmany(fetch_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/my_agent_connectors/dq_databricks/dependencies/databricks/sql/client.py", line 976, in fetchmany
return self.active_result_set.fetchmany(size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/my_agent_connectors/dq_databricks/dependencies/databricks/sql/client.py", line 1235, in fetchmany
return self._convert_arrow_table(self.fetchmany_arrow(size))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/my_agent_connectors/dq_databricks/dependencies/databricks/sql/client.py", line 1161, in _convert_arrow_table
df = table_renamed.to_pandas(
^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 884, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 4192, in pyarrow.lib.Table._to_pandas
File "/usr/local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 774, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1124, in _table_to_blocks
return [_reconstruct_block(item, columns, extension_columns)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1124, in <listcomp>
return [_reconstruct_block(item, columns, extension_columns)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 736, in _reconstruct_block
pd_ext_arr = pandas_dtype.__from_arrow__(arr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pandas/core/arrays/string_.py", line 202, in __from_arrow__
str_arr = StringArray._from_sequence(np.array(arr))
^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 1531, in pyarrow.lib.Array.__array__
File "pyarrow/array.pxi", line 1581, in pyarrow.lib.Array.to_numpy
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowException: Unknown error: Wrapping ESPA�A failed
Comment From: rhshadrach
Can you also check the version of PyArrow you are using? It appears to me that something akin to the following is failing:
pa.array(["ESPAÑA"]).to_numpy(zero_copy_only=False)
But this works on pyarrow==16.1.0
for me. It is also likely not exactly the string that your job is failing on. Can you get the string some other way?