Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
>>> from cyberpandas import IPArray
>>> import pandas as pd
>>>
>>> df1 = pd.DataFrame({
... 'address': IPArray(['192.168.1.1', '192.168.1.10']),
... 'date': ['2022-01-01', '2022-01-02'],
... 'a': [1, 2]
... })
>>> df1 = df1.set_index(['address', 'date'])
>>>
>>>
>>> df2 = pd.DataFrame({
... 'address': IPArray(['192.168.1.1', '192.168.1.10']),
... 'date': pd.to_datetime(['2022-01-01', '2022-01-02']),
... 'a': [1, 2]
... })
>>> df2 = df2.set_index(['address', 'date'])
>>>
>>> df1.index.dtypes
address ip
date object
dtype: object
>>>
>>> df2.index.dtypes
address ip
date datetime64[ns]
dtype: object
>>>
>>> df1.index.union(df2.index).dtypes
address object # <-- should be type "ip", not "object"
date datetime64[ns]
dtype: object
Issue Description
The ExtensionType can get lost when two MultiIndex objects are combined by .union() (which becomes a problem when using df.combine_first(...) which relies on index.union(...)).
The problem occurs when both MIs share the same EA series, but the other series (assuming only 2-series MI) has a different type. In that case, the former EA dimension of the joined MI is losing its EA dtype.
Expected Behavior
EA type can be maintained after index.union(...).
Installed Versions
Comment From: phofl
Hi, thanks for your report. As a note: This also happens for our own extension arrays.
Comment From: ssche
This also happens for our own extension arrays.
Do you have an example?
Comment From: phofl
Simply replace your Arrays with Series([1, 2], dtype="Int64")
Comment From: ssche
Hmm... not quite. I get int64 instead of Int64, but not object.
>>> df1 = pd.DataFrame({
... 'address': pd.Series([1, 2], dtype="Int64"),
... 'date': ['2022-01-01', '2022-01-02'],
... 'a': [1, 2]
... })
>>> df1 = df1.set_index(['address', 'date'])
>>>
>>>
>>> df2 = pd.DataFrame({
... 'address': pd.Series([1, 2], dtype="Int64"),
... 'date': pd.to_datetime(['2022-01-01', '2022-01-02']),
... 'a': [1, 2]
... })
>>> df2 = df2.set_index(['address', 'date'])
>>>
>>> df1.index.dtypes
address Int64
date object
dtype: object
>>>
>>> df2.index.dtypes
address Int64
date datetime64[ns]
dtype: object
>>>
>>> df1.index.union(df2.index).dtypes
address int64
date datetime64[ns]
dtype: object
Comment From: phofl
Sorry, maybe my comment was not clear enough. I meant: This is buggy for our own extension arrays too.
Comment From: ssche
>>> import pandas as pd
>>>
>>>
>>> df1 = pd.DataFrame({
... 'address': pd.Series([1, 2], dtype="Int64"),
... 'date': ['2022-01-01', '2022-01-02'],
... 'a': [1, 2]
... })
>>> df1 = df1.set_index(['address', 'date'])
>>>
>>> df2 = pd.DataFrame({
... 'address': pd.Series([1, 2], dtype="Int64"),
... 'date': pd.to_datetime(['2022-01-01', '2022-01-02']),
... 'a': [1, 2]
... })
>>> df2 = df2.set_index(['address', 'date'])
>>>
>>> df1.index.dtypes
address Int64
date object
dtype: object
>>>
>>> df2.index.dtypes
address Int64
date datetime64[ns]
dtype: object
>>>
>>> df1.index.union(df2.index).dtypes
<stdin>:1: RuntimeWarning: The values in the array are unorderable. Pass `sort=False` to suppress this warning.
address Int64
date object
dtype: object
>>>
>>> pd.__version__
'1.6.0.dev0+328.g7aa391ead'
This seems to be resolved now. I might throw in a test case to peg down the behaviour.
Comment From: phofl
Thanks for following up. This is fixed and tested :)
Comment From: ssche
Great work! I didn’t realise this issue was explicitly fixed (as part of another ticket). I thought it might have been fixed as a (pleasant) side effect of another fix (hence my suggestion). Do you recall which ticket/commit this ticket relates to?
Comment From: phofl
We did a couple of prs in that are. You can check the whatsnew for 2.0, they are documented there