Code Sample, a copy-pastable example if possible
import numpy as np
import pandas as pd
df1 = pd.DataFrame(data={'col1': [1.1, 1.2]},
index=pd.MultiIndex.from_product([['A'], [1.0, 2.0]],
names=['id1', 'id2']))
df2 = pd.DataFrame(data={'col2': [2.1, 2.2]},
index=pd.MultiIndex.from_product([['A'], [np.NaN, 2.0]],
names=['id1', 'id2']))
print(df1.join(df2))
Problem description
The index containing the NaN value of df2 is falsely joined with the index of df1. Therefore, the result contains both values of df2 instead of only one value and a NaN value.
Expected Output
col1 col2
id1 id2
A 1.0 1.1 NaN
2.0 1.2 2.2
The expected result was returned by older pandas versions (successfully tested with 0.18.1 ; 0.20.3 ; 0.21.1 ; 0.22.0).
Actual Output
col1 col2
id1 id2
A 1.0 1.1 2.1
2.0 1.2 2.2
This result is returned by newer pandas versions, beginning with version 0.23.0 (tested with 0.23.0 ; 0.23.4 ; 0.24.2 ; 0.25.2).
I suppose this is a bug, or am I missing something?
Output of pd.show_versions()
Comment From: simonjayhawkins
master produces different output
>>> import numpy as np
>>> import pandas as pd
>>>
>>> pd.__version__
'0.26.0.dev0+684.g953757a3e'
>>>
>>> df1 = pd.DataFrame(
... data={"col1": [1.1, 1.2]},
... index=pd.MultiIndex.from_product([["A"], [1.0, 2.0]], names=["id1", "id2"]),
... )
>>> df2 = pd.DataFrame(
... data={"col2": [2.1, 2.2]},
... index=pd.MultiIndex.from_product([["A"], [np.NaN, 2.0]], names=["id1", "id2"]),
... )
>>> print(df1.join(df2))
col1 col2
id1 id2
A 1.0 1.1 NaN
2.0 1.2 NaN
>>>
Comment From: hanudev
In case someone else has the same problem, the following gives the expected output:
print(df1.join(df2, how='outer').loc[df1.index])
This is of course only a workaround, which is noticeably slower than a direct join.
Comment From: ezwelty
pandas 1.4.3
I just ran into this problem, and I would argue that it is more serious than first meets the eye. The behavior is inconsistent between pd.Index (correct) and pd.MultiIndex (wrong) and for pd.MultiIndex, between pd.DataFrame.join() (wrong) and pd.DataFrame.merge() (correct).
Say we want to join the following dataframes, and one has a missing value in the x column. (I use float64 here, but the behavior is the same with Float64, Int64, ... and pd.NA).
import pandas as pd
dfa = pd.DataFrame({
'x': [1.0, float('nan')],
'y': [1.0, 2.0]
})
dfb = pd.DataFrame({
'x': [1.0, 2.0],
'y': [1.0, 2.0],
'z': [10.0, 20.0]
})
Joining on just x (with pd.Index) works as expected.
key = ['x']
ia = pd.Index(dfa['x'])
ib = pd.Index(dfb['x'])
ia.difference(ib).tolist()
# [nan]
ia.intersection(ib).tolist()
# [1.0]
dfa.set_index(key).join(dfb.set_index(key), how='inner', rsuffix='b')
# y yb z
# x
# 1.0 1.0 1.0 10.0
dfa.merge(dfb, left_on=key, right_on=key, suffixes=('', 'b'))
# x y yb z
# 0 1.0 1.0 1.0 10.0
But joining on both x and y (with pd.MultiIndex) fails for pd.DataFrame.join somehow works correctly for pd.DataFrame.merge:
key = ['x', 'y']
ia = pd.MultiIndex.from_frame(dfa[key])
ib = pd.MultiIndex.from_frame(dfb[key])
ia.difference(ib).tolist()
# []
ia.intersection(ib)
# [(1.0, 1.0), (nan, 2.0)]
dfa.set_index(key).join(dfb.set_index(key), how='inner')
# z
# x y
# 1.0 1.0 10.0
# NaN 2.0 NaN
dfa.merge(dfb, how='inner')
# x y z
# 0 1.0 1.0 10.0