Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [5]: import cudf

In [6]: df = cudf.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})

In [7]: df.to_orc("abc")

In [8]: df.to_parquet("abc.parquet")

In [9]: import pandas as pd

In [10]: pd.read_parquet("abc.parquet", columns=['b', 'a'])
Out[10]: 
   b  a
0  a  1
1  b  2
2  c  3

In [11]: pd.read_orc("abc", columns=['b', 'a'])
Out[11]: 
   a  b
0  1  a
1  2  b
2  3  c

In [12]: pd.__version__
Out[12]: '1.4.3'

Issue Description

The ordering of columns in the output is being respected in pd.read_parquet but not in pd.read_orc.

Expected Behavior

The expected behavior is the columns in the output dataframe are in the same order of columns

Installed Versions

INSTALLED VERSIONS ------------------ commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.9.13.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-76-generic Version : #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.4.3 numpy : 1.22.4 pytz : 2022.1 dateutil : 2.8.2 setuptools : 63.3.0 pip : 22.2.1 Cython : 0.29.32 pytest : 7.1.2 hypothesis : 6.47.1 sphinx : 5.1.1 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.4.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : fastparquet : None fsspec : 2022.7.1 gcsfs : None markupsafe : 2.1.1 matplotlib : None numba : 0.55.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 8.0.1 pyreadstat : None pyxlsb : None s3fs : 2022.7.1 scipy : 1.9.0 snappy : sqlalchemy : 1.4.39 tables : None tabulate : 0.8.10 xarray : None xlrd : None xlwt : None zstandard : None

Comment From: mroeschke

I think this may be an issue with Pyarrow. I'll raise an issue there to see if this is expected behavior

In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})

# pandas 1.5 main branch
In [2]: df.to_orc("abc")

In [3]: pd.read_orc("abc", columns=['b', 'a'])
Out[3]:
   a  b
0  1  a
1  2  b
2  3  c

In [4]: import pyarrow.orc as orc

In [5]: orc_file = orc.ORCFile("abc")

In [6]: orc_file.read(columns=['b', 'a']).to_pandas()
Out[6]:
   a  b
0  1  a
1  2  b
2  3  c

# reordered to a, b
In [7]: orc_file.read(columns=['b', 'a'])
Out[7]:
pyarrow.Table
a: int64
b: string
----
a: [[1,2,3]]
b: [["a","b","c"]]

Comment From: mroeschke

xref https://issues.apache.org/jira/browse/ARROW-17360

Comment From: mroeschke

Looks like the resolution is from the above issue was to add a note in the doc that order is not guaranteed to be preserved when passing columns: https://github.com/apache/arrow/pull/14528

We can mirror this note in pd.read_orc

Comment From: grtcoder

Hey, I would like to take this up

Comment From: markopacak

@mroeschke is it just about adding a note like

Output always follows the ordering of the file and not the columns list.

to read_orc?

I see @grtcoder is working on multiple issues ATM. Perhaps I could take this? It would be my first contribution to Pandas.

Comment From: mroeschke

Correct. Yeah anyone is welcome to open up a pull request to tackle this issue

Comment From: markopacak

take