Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
>>> from cyberpandas import IPArray
>>> import pandas as pd
>>>
>>> df1 = pd.DataFrame({
... 'address': IPArray(['192.168.1.1', '192.168.1.10']),
... 'date': ['2022-01-01', '2022-01-02'],
... 'a': [1, 2]
... })
>>> df1 = df1.set_index(['address', 'date'])
>>>
>>>
>>> df2 = pd.DataFrame({
... 'address': IPArray(['192.168.1.1', '192.168.1.10']),
... 'date': pd.to_datetime(['2022-01-01', '2022-01-02']),
... 'a': [1, 2]
... })
>>> df2 = df2.set_index(['address', 'date'])
>>>
>>> df1.index.dtypes
address ip
date object
dtype: object
>>>
>>> df2.index.dtypes
address ip
date datetime64[ns]
dtype: object
>>>
>>> df1.index.union(df2.index).dtypes
address object # <-- should be type "ip", not "object"
date datetime64[ns]
dtype: object
Issue Description
The ExtensionType can get lost when two MultiIndex objects are combined by .union()
(which becomes a problem when using df.combine_first(...)
which relies on index.union(...)
).
The problem occurs when both MIs share the same EA series, but the other series (assuming only 2-series MI) has a different type. In that case, the former EA dimension of the joined MI is losing its EA dtype.
Expected Behavior
EA type can be maintained after index.union(...)
.
Installed Versions
Comment From: phofl
Hi, thanks for your report. As a note: This also happens for our own extension arrays.
Comment From: ssche
This also happens for our own extension arrays.
Do you have an example?
Comment From: phofl
Simply replace your Arrays with Series([1, 2], dtype="Int64")
Comment From: ssche
Hmm... not quite. I get int64
instead of Int64
, but not object
.
>>> df1 = pd.DataFrame({
... 'address': pd.Series([1, 2], dtype="Int64"),
... 'date': ['2022-01-01', '2022-01-02'],
... 'a': [1, 2]
... })
>>> df1 = df1.set_index(['address', 'date'])
>>>
>>>
>>> df2 = pd.DataFrame({
... 'address': pd.Series([1, 2], dtype="Int64"),
... 'date': pd.to_datetime(['2022-01-01', '2022-01-02']),
... 'a': [1, 2]
... })
>>> df2 = df2.set_index(['address', 'date'])
>>>
>>> df1.index.dtypes
address Int64
date object
dtype: object
>>>
>>> df2.index.dtypes
address Int64
date datetime64[ns]
dtype: object
>>>
>>> df1.index.union(df2.index).dtypes
address int64
date datetime64[ns]
dtype: object
Comment From: phofl
Sorry, maybe my comment was not clear enough. I meant: This is buggy for our own extension arrays too.
Comment From: ssche
>>> import pandas as pd
>>>
>>>
>>> df1 = pd.DataFrame({
... 'address': pd.Series([1, 2], dtype="Int64"),
... 'date': ['2022-01-01', '2022-01-02'],
... 'a': [1, 2]
... })
>>> df1 = df1.set_index(['address', 'date'])
>>>
>>> df2 = pd.DataFrame({
... 'address': pd.Series([1, 2], dtype="Int64"),
... 'date': pd.to_datetime(['2022-01-01', '2022-01-02']),
... 'a': [1, 2]
... })
>>> df2 = df2.set_index(['address', 'date'])
>>>
>>> df1.index.dtypes
address Int64
date object
dtype: object
>>>
>>> df2.index.dtypes
address Int64
date datetime64[ns]
dtype: object
>>>
>>> df1.index.union(df2.index).dtypes
<stdin>:1: RuntimeWarning: The values in the array are unorderable. Pass `sort=False` to suppress this warning.
address Int64
date object
dtype: object
>>>
>>> pd.__version__
'1.6.0.dev0+328.g7aa391ead'
This seems to be resolved now. I might throw in a test case to peg down the behaviour.
Comment From: phofl
Thanks for following up. This is fixed and tested :)
Comment From: ssche
Great work! I didn’t realise this issue was explicitly fixed (as part of another ticket). I thought it might have been fixed as a (pleasant) side effect of another fix (hence my suggestion). Do you recall which ticket/commit this ticket relates to?
Comment From: phofl
We did a couple of prs in that are. You can check the whatsnew for 2.0, they are documented there