Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
pd.DataFrame({}).columns
Issue Description
The above code returns
Index([], dtype='object')
Expected Behavior
But as stated in https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#empty-dataframes-series-will-now-default-to-have-a-rangeindex it should return instead:
RangeIndex(start=0, stop=0, step=1)
which it does for
pd.DataFrame().columns
pd.DataFrame(None).columns
pd.DataFrame([]).columns
pd.DataFrame(()).columns
Installed Versions
Comment From: hagenw
When I try to reproduce it for main
I get:
>>> import pandas as pd
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/audeering.local/hwierstorf/git/pandas/pandas/__init__.py", line 22, in <module>
from pandas.compat import is_numpy_dev as _is_numpy_dev # pyright: ignore # noqa:F401
File "/home/audeering.local/hwierstorf/git/pandas/pandas/compat/__init__.py", line 25, in <module>
from pandas.compat.numpy import (
File "/home/audeering.local/hwierstorf/git/pandas/pandas/compat/numpy/__init__.py", line 4, in <module>
from pandas.util.version import Version
File "/home/audeering.local/hwierstorf/git/pandas/pandas/util/__init__.py", line 2, in <module>
from pandas.util._decorators import ( # noqa:F401
File "/home/audeering.local/hwierstorf/git/pandas/pandas/util/_decorators.py", line 14, in <module>
from pandas._libs.properties import cache_readonly
File "/home/audeering.local/hwierstorf/git/pandas/pandas/_libs/__init__.py", line 16, in <module>
import pandas._libs.pandas_parser # noqa # isort: skip # type: ignore[reportUnusedImport]
ModuleNotFoundError: No module named 'pandas._libs.pandas_parser'
Comment From: hagenw
I now followed https://pandas.pydata.org/docs/dev/development/contributing_environment.html#option-2-using-pip and was able to reproduce the issue on the main
branch as well:
>>> pd.__version__
'2.1.0.dev0+409.g5a1f280647'
>>> pd.DataFrame({}).columns
Index([], dtype='object')
Comment From: zmwaris1
Hi, I would like to work on this issue. Can you assign this to me and share the details?
Comment From: MarcoGorelli
thanks @hagenw for the report! it looks like this works for some initialisations but not others:
In [8]: import pandas as pd
...: pd.DataFrame().axes
Out[8]: [RangeIndex(start=0, stop=0, step=1), RangeIndex(start=0, stop=0, step=1)]
In [9]: import pandas as pd
...: pd.DataFrame([]).axes
Out[9]: [RangeIndex(start=0, stop=0, step=1), RangeIndex(start=0, stop=0, step=1)]
In [10]: import pandas as pd
...: pd.DataFrame({}).axes
Out[10]: [RangeIndex(start=0, stop=0, step=1), Index([], dtype='object')]
cc @topper-123
Comment From: topper-123
This was intentional on my part when I made #49572.
@mroeschke asked in a comment:
Why does an empty dict not produce RangeIndexes?
I argued there that that for a dict d
, Series(d)
is the most equivalent to Series(d.values(), index=d.keys())
, which is equivalent to Series([], index=[])
for en empty dict, i.e. has an index with dtype object.
Also notice that non-empty dict can never give a RangeIndex.
But IDK, maybe this just trips people up and it would be better to have empty dicts to give a RangeIndex
?
Comment From: MarcoGorelli
thanks for explaining! I think your explanation makes sense, personally I think it'd be fine to keep as-is
Comment From: mroeschke
Would special casing be necessary to make DataFrame({})
produce a RangeIndex
on both axes? If not, it might be better to forgo semantics and align with user expectation of "empty"
Comment From: topper-123
It's very easy to change, so it's more a question of what we want. I can follow the thought that this can be a bit surprising.
Comment From: mroeschke
I would support returning a RangeIndex on both axes to match user expectation and an opportunity to avoid introducing an object
dtype axis
Comment From: phofl
agree with @mroeschke
This is confusing for most users.
Comment From: topper-123
I've made a PR about this.
Comment From: jorisvandenbossche
I would support returning a RangeIndex on both axes to match user expectation and an opportunity to avoid introducing an
object
dtype axis
On the other hand, I personally found it confusing to get a RangeIndex for columns, and I actually want to avoid introducing an int64 axis for the columns (if you otherwise always use string column names, using object dtype for an empty columns object is closer to what you want than an int64)
Anyway, I don't necessarily object the change (consistency with other variants of initialization also has its value), but just wanted to point out that "confusing" / "user expectation" depends quite a bit on your use case (as usual ;)).
Pyarrow will keep returning object dtype for empty columns (which I think makes most sense for pyarrow since our column names are always strings, and which follows from using pandas' Index([])
Comment From: jorisvandenbossche
Pyarrow will keep returning object dtype for empty columns (which I think makes most sense for pyarrow since our column names are always strings,
Small correction: if the pyarrow Table came from a pandas DataFrame roundtrip originally, we actually store in the pandas metadata the dtype of the columns object, and use that information to correctly "restore" the column names. We don't know that it was a RangeIndex though, so if using this information, it comes back as an empty Index[int64]. When there is no pandas metadata, then we will use empty object dtype Index.
Comment From: topper-123
I agreed with you initially, but when I had to explain it it sounded maybe more complex than I expected. But I could personally live with both I think they each have their advantages, and I now like the explanation "an empty axes on empty data is always a RangeIndex"...