Code Sample
In this example, we merge multi-indexed dataframes, where one level contains datetime.date
objects. We illustrate the inconsistent behavior by comparing the output of two merge operations.
1. Let us create the three dataframes {df1, df2, df3} such that
* df1 and df2 indexes overlap
* df1 and df3 indexes do not overlap
>>> index_1 = pd.MultiIndex(levels=[['dummy'], [dt.date(2016,1,1)]],
... labels=[[0], [0]], names=['i', 'j'])
>>> index_2 = pd.MultiIndex(levels=[['dummy'], [dt.date(2016,1,1)]],
... labels=[[0], [0]], names=['i', 'j'])
>>> index_3 = pd.MultiIndex(levels=[['dummy'], [dt.date(2016,1,2)]],
... labels=[[0], [0]], names=['i', 'j'])
>>> df1 = pd.DataFrame([1], index=index_1, columns=['A'])
>>> df2 = pd.DataFrame([1], index=index_2, columns=['B'])
>>> df3 = pd.DataFrame([1], index=index_3, columns=['C'])
>>> df1
A
i j
dummy 2016-01-01 1
- Let us merge using the indexes.
>>> merge1 = df1.merge(df2, left_index=True, right_index=True, how='outer')
>>> merge2 = df1.merge(df3, left_index=True, right_index=True, how='outer')
>>> merge1
A B
i j
dummy 2016-01-01 1 1
>>> merge2
A C
i j
dummy 2016-01-01 1.0 NaN
2016-01-02 NaN 1.0
- Let us examine the indexes
>>> merge1.index[0][1]
datetime.date(2016, 1, 1)
>>> merge2.index[0][1]
Timestamp('2016-01-01 00:00:00')
Problem description
The index type of the merge output is inconsistent. In my opinion, the merge operation should not change the types.
Expected Output
>>> merge2.index[0][1]
datetime.date(2016, 1, 1)
Output of pd.show_versions()
Comment
This bug does not occur for single-index dataframes
>>> df1 = pd.DataFrame([1], index=[dt.date(2016,1,1)], columns=['A'])
>>> df3 = pd.DataFrame([1], index=[dt.date(2016,1,2)], columns=['B'])
>>> df1.merge(df3, left_index=True, right_index=True, how='outer').index[0]
datetime.date(2016, 1, 1)
Comment From: jreback
datetime.date
is not a first class type and you should simply not use it. This works correctly and properly for input of datetime.datetime
which are coerced to Timestamp
and fully handled.
the inference machinery in all of pandas tries pretty hard to coerce datetimelikes even if they end up as object; this is why the second one is the correct result.
this has a special check for datetime.date
but its completely non-performant and not really supported.
if you want to submit a fix would take it (assuming it doesn't break anythng else). but we are not generally suppoprting datetime.date
.