Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
In [5]: import cudf
In [6]: df = cudf.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
In [7]: df.to_orc("abc")
In [8]: df.to_parquet("abc.parquet")
In [9]: import pandas as pd
In [10]: pd.read_parquet("abc.parquet", columns=['b', 'a'])
Out[10]:
b a
0 a 1
1 b 2
2 c 3
In [11]: pd.read_orc("abc", columns=['b', 'a'])
Out[11]:
a b
0 1 a
1 2 b
2 3 c
In [12]: pd.__version__
Out[12]: '1.4.3'
Issue Description
The ordering of columns in the output is being respected in pd.read_parquet
but not in pd.read_orc
.
Expected Behavior
The expected behavior is the columns in the output dataframe are in the same order of columns
Installed Versions
Comment From: mroeschke
I think this may be an issue with Pyarrow. I'll raise an issue there to see if this is expected behavior
In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
# pandas 1.5 main branch
In [2]: df.to_orc("abc")
In [3]: pd.read_orc("abc", columns=['b', 'a'])
Out[3]:
a b
0 1 a
1 2 b
2 3 c
In [4]: import pyarrow.orc as orc
In [5]: orc_file = orc.ORCFile("abc")
In [6]: orc_file.read(columns=['b', 'a']).to_pandas()
Out[6]:
a b
0 1 a
1 2 b
2 3 c
# reordered to a, b
In [7]: orc_file.read(columns=['b', 'a'])
Out[7]:
pyarrow.Table
a: int64
b: string
----
a: [[1,2,3]]
b: [["a","b","c"]]
Comment From: mroeschke
xref https://issues.apache.org/jira/browse/ARROW-17360
Comment From: mroeschke
Looks like the resolution is from the above issue was to add a note in the doc that order is not guaranteed to be preserved when passing columns
: https://github.com/apache/arrow/pull/14528
We can mirror this note in pd.read_orc
Comment From: grtcoder
Hey, I would like to take this up
Comment From: markopacak
@mroeschke is it just about adding a note like
Output always follows the ordering of the file and not the
columns
list.
to read_orc
?
I see @grtcoder is working on multiple issues ATM. Perhaps I could take this? It would be my first contribution to Pandas.
Comment From: mroeschke
Correct. Yeah anyone is welcome to open up a pull request to tackle this issue
Comment From: markopacak
take