Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pyarrow as pa
import cudf
import pandas as pd
In [19]: data = [{"text": "hello", "list_col": np.asarray([1, 2], dtype="uint32")}]
In [20]: df = cudf.DataFrame(data)
In [21]: arrow_types_pdf = df.to_pandas(arrow_type=True)
In [22]: arrow_types_pdf
Out[22]:
text list_col
0 hello [1 2]
In [23]: arrow_types_pdf.dtypes
Out[23]:
text string[pyarrow]
list_col list<item: uint32>[pyarrow]
dtype: object
In [24]: pa_table = pa.Table.from_pandas(arrow_types_pdf)
In [25]: pa_table
Out[25]:
pyarrow.Table
text: string
list_col: list<item: uint32>
child 0, item: uint32
----
text: [["hello"]]
list_col: [[[1,2]]]
In [26]: pa_table.to_pandas()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[26], line 1
----> 1 pa_table.to_pandas()
File /datasets/pgali/envs/cudfdev/lib/python3.12/site-packages/pyarrow/array.pxi:887, in pyarrow.lib._PandasConvertible.to_pandas()
File /datasets/pgali/envs/cudfdev/lib/python3.12/site-packages/pyarrow/table.pxi:5132, in pyarrow.lib.Table._to_pandas()
File /datasets/pgali/envs/cudfdev/lib/python3.12/site-packages/pyarrow/pandas_compat.py:783, in table_to_dataframe(options, table, categories, ignore_metadata, types_mapper)
780 table = _add_any_metadata(table, pandas_metadata)
781 table, index = _reconstruct_index(table, index_descriptors,
782 all_columns, types_mapper)
--> 783 ext_columns_dtypes = _get_extension_dtypes(
784 table, all_columns, types_mapper)
785 else:
786 index = _pandas_api.pd.RangeIndex(table.num_rows)
File /datasets/pgali/envs/cudfdev/lib/python3.12/site-packages/pyarrow/pandas_compat.py:862, in _get_extension_dtypes(table, columns_metadata, types_mapper)
857 dtype = col_meta['numpy_type']
859 if dtype not in _pandas_supported_numpy_types:
860 # pandas_dtype is expensive, so avoid doing this for types
861 # that are certainly numpy dtypes
--> 862 pandas_dtype = _pandas_api.pandas_dtype(dtype)
863 if isinstance(pandas_dtype, _pandas_api.extension_dtype):
864 if hasattr(pandas_dtype, "__from_arrow__"):
File /datasets/pgali/envs/cudfdev/lib/python3.12/site-packages/pyarrow/pandas-shim.pxi:148, in pyarrow.lib._PandasAPIShim.pandas_dtype()
File /datasets/pgali/envs/cudfdev/lib/python3.12/site-packages/pyarrow/pandas-shim.pxi:151, in pyarrow.lib._PandasAPIShim.pandas_dtype()
File /datasets/pgali/envs/cudfdev/lib/python3.12/site-packages/pandas/core/dtypes/common.py:1645, in pandas_dtype(dtype)
1640 with warnings.catch_warnings():
1641 # GH#51523 - Series.astype(np.integer) doesn't show
1642 # numpy deprecation warning of np.integer
1643 # Hence enabling DeprecationWarning
1644 warnings.simplefilter("always", DeprecationWarning)
-> 1645 npdtype = np.dtype(dtype)
1646 except SyntaxError as err:
1647 # np.dtype uses `eval` which can raise SyntaxError
1648 raise TypeError(f"data type '{dtype}' not understood") from err
TypeError: data type 'list<item: uint32>[pyarrow]' not understood
In [27]: pa_table.to_pandas(ignore_metadata=True)
Out[27]:
text list_col
0 hello [1, 2]
Issue Description
It looks like pa.Table.to_pandas
is not yet compatible with arrow extension types that pandas supports and the conversion back fails in pandas_dtype
API.
Expected Behavior
Round-trip the dataframe without errors.
Installed Versions
In [28]: pd.show_versions()
INSTALLED VERSIONS
------------------
commit : 0691c5cf90477d3503834d983f69350f250a6ff7
python : 3.12.8
python-bits : 64
OS : Linux
OS-release : 5.4.0-182-generic
Version : #202-Ubuntu SMP Fri Apr 26 12:29:36 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.3
numpy : 2.0.2
pytz : 2024.1
dateutil : 2.9.0.post0
pip : 25.0.1
Cython : 3.0.12
sphinx : 8.1.3
IPython : 8.32.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2025.2.0
html5lib : None
hypothesis : 6.125.1
gcsfs : None
jinja2 : 3.1.5
lxml.etree : None
matplotlib : None
numba : 0.60.0
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 18.1.0
pyreadstat : None
pytest : 7.4.4
python-calamine : None
pyxlsb : None
s3fs : 2025.2.0
scipy : 1.15.1
sqlalchemy : 2.0.38
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlsxwriter : None
zstandard : 0.23.0
tzdata : 2025.1
qtpy : None
pyqt5 : None
Comment From: mroeschke
The canonical way to this, if you know the original pandas DataFrame all started with ArrowDtype
s, is to specify .to_pandas(types_mapper=pd.ArrowDtype)
And it appears as of pyarrow 19 (https://github.com/apache/arrow/pull/44720), this now supports roundtripping nested arrow types
In [4]: df = pd.DataFrame({"a": pd.arrays.ArrowExtensionArray(pa.array([[1]]))})
In [5]: pa.Table.from_pandas(df).to_pandas(types_mapper=pd.ArrowDtype)
Out[5]:
a
0 [1]
In [6]: pa.__version__
Out[6]: '19.0.0'