Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pyarrow as pa
import pandas as pd
arr = ["Mon", "Tue", "Mon", "Wed", "Mon", "Thu", "Fri", "Sat", "Sun"]
table = pa.table(
{"weekday": pa.array(arr).dictionary_encode()}
)
exchange_df = table.__dataframe__()
from pandas.core.interchange.from_dataframe import from_dataframe
from_dataframe(exchange_df)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 53, in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 74, in _from_dataframe
pandas_df = protocol_df_chunk_to_pandas(chunk)
File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 124, in protocol_df_chunk_to_pandas
columns[name], buf = categorical_column_to_series(col)
File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 185, in categorical_column_to_series
assert isinstance(cat_column, PandasColumn), "categories must be a PandasColumn"
AssertionError: categories must be a PandasColumn
Issue Description
I am currently working on implementing a dataframe interchange protocol for pyarrow.Table
in Apache Arrow project (https://github.com/apache/arrow/pull/14613).
I am using pandas implementation to test that the produced __dataframe__
object can be correctly consumed.
When consuming a pyarrow.Table
with categorical column I get an error from pandas that the categories must be a PandasColumn
and not a general __dataframe__
column defined by the interchange protocol. There is a check on line 185 for PandasColumn
instance:
https://github.com/pandas-dev/pandas/blob/70121c75a0e2a42e31746b6c205c7bb9e4b9b930/pandas/core/interchange/from_dataframe.py#L164-L185
Expected Behavior
categorical_column_to_series()
function should accept a general dataframe_protocol column for the categories in the categorical column.
Installed Versions
Comment From: AlenkaF
cc @jorisvandenbossche @mroeschke
Comment From: ronwho
take
Comment From: ronwho
Hello @AlenkaF ,
I am fairly new to open source contribution, and I am trying to take this task for a school project.
I wanted to reproduce your error; however, I see that to do so, I must use your version of pyarrow.
I tried a couple of things to pip install your pyarrow: 1. pip3 install git+https://github.com/AlenkaF/arrow.git@ARROW-18152#subdirectory=python 2. also tried clone your version and then doing pip install . in the python directory.
I always run into the error:
"Could not find a package configuration file provided by "Arrow" with any of
the following names:
ArrowConfig.cmake
arrow-config.cmake"
I tried looking into the pyarrow documentation for cpp building on OSX; however I still could not figure out how get past the error.
I have a feeling it should not be this difficult to do; perhaps I am missing something? Please let me know :)
Thanks, Ron
Comment From: AlenkaF
Hi @ronwho ,
thank you for working on this issue!
You are correct. To be able to reproduce the error you will have to work from the branch in my Apache Arrow fork. I will send you a link to a new branch I will create today (need to change)!
But first you will need to build PyArrow from source following Python Development section in our documentation.
You will need to build Arrow C++ and then PyArrow, I suggest on Apache Arrow master branch first, and then you can checkout my branch with dataframe interchange protocol code.
Hope that makes sense. Feel free to ping me for questions in case you run into any difficulties!
Comment From: ronwho
Hey @AlenkaF,
Thanks for your response, it has really helped me!
After carefully following the Python development guide I was successfully able to build and pip install the package in main. After checking out the ARROW-18152 branch I noticed there was a compile error while compiling Arrow C++, but I saw that you created a new branch "ARROW-18152-second." I tried building that and it built and pip installed successfully!
However, for some reason when testing the code on python, I am getting an import error from pyarrow.
File "/Users/ron/Projects/open-source/testing-pandas/test-case.py", line 1, in <module>
import pyarrow as pa
File "/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/__init__.py", line 65, in <module>
import pyarrow.lib as _lib
ImportError: dlopen(/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/lib.cpython-310-darwin.so, 0x0002): Library not loaded: '@rpath/libarrow_dataset.1000.dylib'
Referenced from: '/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/lib.cpython-310-darwin.so'
Reason: tried: '/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/libarrow_dataset.1000.dylib' (no such file), '/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/libarrow_dataset.1000.dylib' (no such file), '/opt/homebrew/lib/libarrow_dataset.1000.dylib' (no such file), '/opt/homebrew/lib/libarrow_dataset.1000.dylib' (no such file), '/usr/local/lib/libarrow_dataset.1000.dylib' (no such file), '/usr/lib/libarrow_dataset.1000.dylib' (no such file)
I tested with also building the main branch, however I run into the same error. I also tried git clean and removing the build folder and rebuilding. I think I must be doing something wrong. Have you ever encountered an error like this? Please let me know!
Thanks so much again!
Comment From: AlenkaF
Great, you found the new branch! 👍
Can you check if you built Arrow C++ with -DARROW_DATASET=ON
? Can you also check if you can find libarrow_dataset.*
in the folder where the libraries are installed (/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/
)?
From the error I would think that the Arrow C++ dataset package is not installed.
Comment From: AlenkaF
On a second thought, if it build ssuccessfully then there should be some other reason for the error. Somehow it is searching for Arrow C++ dataset module in a wrong place. There are a few things I would check/try:
- Does the import of PyArrow work in the Python console if opened from
arrow/python
? - Run the tests with
python -m pytest ...
- If you are on MacOS then try setting this before building PyArrow:
export R_LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH
or try building Arrow C++ with adding
-DARROW_INSTALL_NAME_RPATH=OFF
- You could try to build PyArrow with
pip install -e . --no-build-isolation
or maybe withbuild_ext
:
pushd arrow/python
export PYARROW_WITH_PARQUET=1
export PYARROW_WITH_DATASET=1
export PYARROW_PARALLEL=4
python setup.py build_ext --inplace
popd
Comment From: ronwho
Hello Alenka, sorry for the late response, thanks for helping me detect the build problem I was facing. I was successfully able to build your version of pyarrow. Adding "-DARROW_INSTALL_NAME_RPATH=OFF" solved it.
My project team and I tried to the change the assert statement by changing from PandasColumn
to a Column
. like so:
print(cat_column)
print(type(cat_column))
# for mypy/pyright
assert isinstance(cat_column, Column), "categories must abide by __dataframe__ protocol API"
This is because we should be checking Column
instead of PandasColumn
since we believeColumn
should be the general dataframe_protocol column type. Please correct us if we are mistaken.
However after building and running, it still seems that the column passed in the example code to showcase the edge case is a PyArrowColumn
type. This still gives us an assertion error. We printed the cat_column
and type of cat_column
above to check. Here is the output:
<pyarrow.interchange.column._PyArrowColumn object at 0x124663400>
<class 'pyarrow.interchange.column._PyArrowColumn'>
Traceback (most recent call last):
File "/Users/ron/Projects/open-source/testing-pandas/test-case.py", line 13, in <module>
from_dataframe(exchange_df)
File "/Users/ron/Projects/open-source/pandas-ronwho/pandas/core/interchange/from_dataframe.py", line 53, in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
File "/Users/ron/Projects/open-source/pandas-ronwho/pandas/core/interchange/from_dataframe.py", line 74, in _from_dataframe
pandas_df = protocol_df_chunk_to_pandas(chunk)
File "/Users/ron/Projects/open-source/pandas-ronwho/pandas/core/interchange/from_dataframe.py", line 124, in protocol_df_chunk_to_pandas
columns[name], buf = categorical_column_to_series(col)
File "/Users/ron/Projects/open-source/pandas-ronwho/pandas/core/interchange/from_dataframe.py", line 187, in categorical_column_to_series
assert isinstance(cat_column, Column), "categories must abide by __dataframe__ protocol API"
AssertionError: categories must abide by __dataframe__ protocol API
Should the function should be able to accept PyArrowColumn
type as well?
Comment From: AlenkaF
This is because we should be checking
Column
instead ofPandasColumn
since we believeColumn
should be the general dataframe_protocol column type. Please correct us if we are mistaken.
That is how I understand it also.
However after building and running, it still seems that the column passed in the example code to showcase the edge case is a
PyArrowColumn
type. This still gives us an assertion error. We printed thecat_column
and type ofcat_column
above to check.
The issue here is that the parent class Column
(or Buffer
&DataFrame
) is defined by the data-apis consortium and I haven't implemented it in PyArrow. For now I am defining a ColumnObject
as Any
, see:
https://github.com/apache/arrow/blob/3b27cf2e2546efefac224748de16b44fa326689a/python/pyarrow/interchange/from_dataframe.py#L38-L42
But in pandas, only for the categories, PandasColumn
seems to be checked and I am not sure why. And currently the check for the Column
parent class is also a bit strict as most of the implementations are not using it, see also:
- CuDF implementation https://github.com/rapidsai/cudf/blob/22087b3c2a78b49140fdf96091b5abd325427b11/python/cudf/cudf/core/df_protocol.py#L642
- Vaex implementation https://github.com/vaexio/vaex/blob/a9ae5e63f6aba5828e4afca1c2f0ab2a7352386f/packages/vaex-core/vaex/dataframe_protocol.py#L34-L37
Comment From: ronwho
Hello Alenka, I made a PR for this issue by relaxing the assertion statement from PandasColumn
to Column
, so that once you have implemented the parent class you are able to use this function with no issues. Hope this helped!