Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.DataFrame({}).columns

Issue Description

The above code returns

Index([], dtype='object')

Expected Behavior

But as stated in https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#empty-dataframes-series-will-now-default-to-have-a-rangeindex it should return instead:

RangeIndex(start=0, stop=0, step=1)

which it does for

pd.DataFrame().columns
pd.DataFrame(None).columns
pd.DataFrame([]).columns
pd.DataFrame(()).columns

Installed Versions

INSTALLED VERSIONS ------------------ commit : 478d340667831908b5b4bf09a2787a11a14560c9 python : 3.8.16.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-144-generic Version : #161~18.04.1-Ubuntu SMP Fri Feb 10 15:55:22 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 2.0.0 numpy : 1.24.2 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.5.1 pip : 23.0.1 Cython : None pytest : None hypothesis : None sphinx : 6.1.3 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Comment From: hagenw

When I try to reproduce it for main I get:

>>> import pandas as pd
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/__init__.py", line 22, in <module>
    from pandas.compat import is_numpy_dev as _is_numpy_dev  # pyright: ignore # noqa:F401
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/compat/__init__.py", line 25, in <module>
    from pandas.compat.numpy import (
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/compat/numpy/__init__.py", line 4, in <module>
    from pandas.util.version import Version
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/util/__init__.py", line 2, in <module>
    from pandas.util._decorators import (  # noqa:F401
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/util/_decorators.py", line 14, in <module>
    from pandas._libs.properties import cache_readonly
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/_libs/__init__.py", line 16, in <module>
    import pandas._libs.pandas_parser  # noqa # isort: skip # type: ignore[reportUnusedImport]
ModuleNotFoundError: No module named 'pandas._libs.pandas_parser'

Comment From: hagenw

I now followed https://pandas.pydata.org/docs/dev/development/contributing_environment.html#option-2-using-pip and was able to reproduce the issue on the main branch as well:

>>> pd.__version__
'2.1.0.dev0+409.g5a1f280647'
>>> pd.DataFrame({}).columns
Index([], dtype='object')

Comment From: zmwaris1

Hi, I would like to work on this issue. Can you assign this to me and share the details?

Comment From: MarcoGorelli

thanks @hagenw for the report! it looks like this works for some initialisations but not others:

In [8]: import pandas as pd
   ...: pd.DataFrame().axes
Out[8]: [RangeIndex(start=0, stop=0, step=1), RangeIndex(start=0, stop=0, step=1)]

In [9]: import pandas as pd
   ...: pd.DataFrame([]).axes
Out[9]: [RangeIndex(start=0, stop=0, step=1), RangeIndex(start=0, stop=0, step=1)]

In [10]: import pandas as pd
    ...: pd.DataFrame({}).axes
Out[10]: [RangeIndex(start=0, stop=0, step=1), Index([], dtype='object')]

cc @topper-123

Comment From: topper-123

This was intentional on my part when I made #49572.

@mroeschke asked in a comment:

Why does an empty dict not produce RangeIndexes?

I argued there that that for a dict d, Series(d) is the most equivalent to Series(d.values(), index=d.keys()), which is equivalent to Series([], index=[]) for en empty dict, i.e. has an index with dtype object.

Also notice that non-empty dict can never give a RangeIndex.

But IDK, maybe this just trips people up and it would be better to have empty dicts to give a RangeIndex?

Comment From: MarcoGorelli

thanks for explaining! I think your explanation makes sense, personally I think it'd be fine to keep as-is

Comment From: mroeschke

Would special casing be necessary to make DataFrame({}) produce a RangeIndex on both axes? If not, it might be better to forgo semantics and align with user expectation of "empty"

Comment From: topper-123

It's very easy to change, so it's more a question of what we want. I can follow the thought that this can be a bit surprising.

Comment From: mroeschke

I would support returning a RangeIndex on both axes to match user expectation and an opportunity to avoid introducing an object dtype axis

Comment From: phofl

agree with @mroeschke

This is confusing for most users.

Comment From: topper-123

I've made a PR about this.

Comment From: jorisvandenbossche

I would support returning a RangeIndex on both axes to match user expectation and an opportunity to avoid introducing an object dtype axis

On the other hand, I personally found it confusing to get a RangeIndex for columns, and I actually want to avoid introducing an int64 axis for the columns (if you otherwise always use string column names, using object dtype for an empty columns object is closer to what you want than an int64)

Anyway, I don't necessarily object the change (consistency with other variants of initialization also has its value), but just wanted to point out that "confusing" / "user expectation" depends quite a bit on your use case (as usual ;)). Pyarrow will keep returning object dtype for empty columns (which I think makes most sense for pyarrow since our column names are always strings, and which follows from using pandas' Index([])

Comment From: jorisvandenbossche

Pyarrow will keep returning object dtype for empty columns (which I think makes most sense for pyarrow since our column names are always strings,

Small correction: if the pyarrow Table came from a pandas DataFrame roundtrip originally, we actually store in the pandas metadata the dtype of the columns object, and use that information to correctly "restore" the column names. We don't know that it was a RangeIndex though, so if using this information, it comes back as an empty Index[int64]. When there is no pandas metadata, then we will use empty object dtype Index.

Comment From: topper-123

I agreed with you initially, but when I had to explain it it sounded maybe more complex than I expected. But I could personally live with both I think they each have their advantages, and I now like the explanation "an empty axes on empty data is always a RangeIndex"...