Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> from cyberpandas import IPArray
>>> import pandas as pd
>>> 
>>> df1 = pd.DataFrame({
...     'address': IPArray(['192.168.1.1', '192.168.1.10']),
...     'date': ['2022-01-01', '2022-01-02'],
...     'a': [1, 2]
... })
>>> df1 = df1.set_index(['address', 'date'])
>>> 
>>> 
>>> df2 = pd.DataFrame({
...     'address': IPArray(['192.168.1.1', '192.168.1.10']),
...     'date': pd.to_datetime(['2022-01-01', '2022-01-02']),
...     'a': [1, 2]
... })
>>> df2 = df2.set_index(['address', 'date'])
>>> 
>>> df1.index.dtypes
address        ip
date       object
dtype: object
>>> 
>>> df2.index.dtypes
address                ip
date       datetime64[ns]
dtype: object
>>> 
>>> df1.index.union(df2.index).dtypes
address            object   # <-- should be type "ip", not "object"
date       datetime64[ns]
dtype: object

Issue Description

The ExtensionType can get lost when two MultiIndex objects are combined by .union() (which becomes a problem when using df.combine_first(...) which relies on index.union(...)).

The problem occurs when both MIs share the same EA series, but the other series (assuming only 2-series MI) has a different type. In that case, the former EA dimension of the joined MI is losing its EA dtype.

Expected Behavior

EA type can be maintained after index.union(...).

Installed Versions

INSTALLED VERSIONS ------------------ commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.8.13.final.0 python-bits : 64 OS : Darwin OS-release : 21.5.0 Version : Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_AU.UTF-8 LOCALE : en_AU.UTF-8 pandas : 1.4.3 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 62.3.2 pip : 22.1.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None

Comment From: phofl

Hi, thanks for your report. As a note: This also happens for our own extension arrays.

Comment From: ssche

This also happens for our own extension arrays.

Do you have an example?

Comment From: phofl

Simply replace your Arrays with Series([1, 2], dtype="Int64")

Comment From: ssche

Hmm... not quite. I get int64 instead of Int64, but not object.

>>> df1 = pd.DataFrame({
...     'address': pd.Series([1, 2], dtype="Int64"),
...     'date': ['2022-01-01', '2022-01-02'],
...     'a': [1, 2]
... })
>>> df1 = df1.set_index(['address', 'date'])
>>> 
>>> 
>>> df2 = pd.DataFrame({
...     'address': pd.Series([1, 2], dtype="Int64"),
...     'date': pd.to_datetime(['2022-01-01', '2022-01-02']),
...     'a': [1, 2]
... })
>>> df2 = df2.set_index(['address', 'date'])
>>> 
>>> df1.index.dtypes
address     Int64
date       object
dtype: object
>>> 
>>> df2.index.dtypes
address             Int64
date       datetime64[ns]
dtype: object
>>> 
>>> df1.index.union(df2.index).dtypes
address             int64
date       datetime64[ns]
dtype: object

Comment From: phofl

Sorry, maybe my comment was not clear enough. I meant: This is buggy for our own extension arrays too.

Comment From: ssche

>>> import pandas as pd
>>> 
>>> 
>>> df1 = pd.DataFrame({
...     'address': pd.Series([1, 2], dtype="Int64"),
...     'date': ['2022-01-01', '2022-01-02'],
...     'a': [1, 2]
... })
>>> df1 = df1.set_index(['address', 'date'])
>>> 
>>> df2 = pd.DataFrame({
...     'address': pd.Series([1, 2], dtype="Int64"),
...     'date': pd.to_datetime(['2022-01-01', '2022-01-02']),
...     'a': [1, 2]
... })
>>> df2 = df2.set_index(['address', 'date'])
>>> 
>>> df1.index.dtypes
address     Int64
date       object
dtype: object
>>> 
>>> df2.index.dtypes
address             Int64
date       datetime64[ns]
dtype: object
>>> 
>>> df1.index.union(df2.index).dtypes
<stdin>:1: RuntimeWarning: The values in the array are unorderable. Pass `sort=False` to suppress this warning.
address     Int64
date       object
dtype: object
>>> 
>>> pd.__version__
'1.6.0.dev0+328.g7aa391ead'

This seems to be resolved now. I might throw in a test case to peg down the behaviour.

Comment From: phofl

Thanks for following up. This is fixed and tested :)

Comment From: ssche

Great work! I didn’t realise this issue was explicitly fixed (as part of another ticket). I thought it might have been fixed as a (pleasant) side effect of another fix (hence my suggestion). Do you recall which ticket/commit this ticket relates to?

Comment From: phofl

We did a couple of prs in that are. You can check the whatsnew for 2.0, they are documented there