Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
### Example 1: Tail ###
# Note: non-unique values in index
df = pd.DataFrame({"A": ["a", "b", "c"], "B": range(3)}, index=[1, 2, 2])
# Works
pd.concat([df.drop("B", axis=1), df["B"].tail()], axis=1)
# Fails
df2 = df.tail(2)
pd.concat([df.drop("B", axis=1), df2["B"].tail()], axis=1)
### Example 2: Explode ###
# Works
df = pd.DataFrame({"A": ["a", "b"], "B": [[0], [1, 2]]})
pd.concat([df.drop("B", axis=1), df["B"].explode()], axis=1)
# Fails
df = pd.DataFrame({"A": ["a", "b"], "B": [[0], [1, 2]]}, index=[0, 1])
pd.concat([df.drop("B", axis=1), df.explode("B")], axis=1)
Which both fail with:
raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
```
Issue Description
The concat function seems to behave different based on different scenarios. Trying to troubleshoot I found two examples where I had unexpected errors from the concat function. The error returned is that Reindexing only valid with uniquely valued Index objects, but both examples, to me, appear to have counter parts that do work with non-unique indexes.
This may be related to https://github.com/pandas-dev/pandas/issues/51646 which has a similar issue.
Expected Behavior
The behaivor I expected was for concat to work in a similar fashion to the "Works" version, even with the non-unique index, like a SQL left join or full-outer join.
Installed Versions
INSTALLED VERSIONS
commit : 5dd6efc209f5a47aadc9813def6bb29695a14653 python : 3.11.0.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-32-generic Version : #33~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jan 30 17:03:34 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 2.1.0.dev0+82.g5dd6efc209 numpy : 1.24.1 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 23.0 Cython : None pytest : 7.2.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.0.8 lxml.etree : 4.9.2 html5lib : None pymysql : None psycopg2 : 2.9.5 jinja2 : 3.1.2 IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.6.3 numba : None numexpr : None odfpy : None openpyxl : 3.1.0 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : 1.4.4 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : None qtpy : None pyqt5 : None
Comment From: phofl
Hi thanks for your report. We stopped supporting concat on duplicate indexes a while back, so I think all of them should raise I guess?
Comment From: ddxv
I didn't know that, and the documentation to me seems to imply it is allowed (ie default join='outer', which from SQL world is OK with duplicated index). But you're right that the current behavior seems to indicate this is not true.
In that case perhaps there could also be an addition to the docs for this? Looking forward to hearing more input and happy to help.
Comment From: phofl
looks like I remembered it wrong, see https://github.com/pandas-dev/pandas/pull/38654
That was what I meant, but yours should work I guess. investigations welcome
Comment From: ddxv
@phofl I did some investigating and found that in core.indexes.api the function union_indexes has three different outcomes based on what are called "kinds". The three kinds are "special", "array" and "other". In the second example above, when I did not specify the index type, it was default created as RangeIndex. Meanwhile, the other dataframes had Index[int64].
Inside union_indexes
the function _sanatize_and_check
would output kind="array" when both Indexes are simple Index[int64] and kind="special" when indexes were of more than one type.
Then later, the different "kind" value is used to call different functions. The "special" kind ultimately uses index.union to combine each Index, creating the outer join index.
It seemed to me that the incorrect part was that the kind "array" was incorrectly using _unique_indices
which unique the index, which led to the behavior above.
I removed unique_indices
and instead called index.union
to match the other kind. Which leads to the question of whether this should be done for the final kind and if so, can this actually be cleaned into a less complicated function (currently 3 return statements inside if / else / for , but perhaps could just have one).
I am not sure this is the complete fix, as this resolves Example 2 from above and Not Example 1.