Code Sample, a copy-pastable example if possible
import numpy as np
import pandas as pd
df1 = pd.DataFrame(data={'col1': [1.1, 1.2]},
index=pd.MultiIndex.from_product([['A'], [1.0, 2.0]],
names=['id1', 'id2']))
df2 = pd.DataFrame(data={'col2': [2.1, 2.2]},
index=pd.MultiIndex.from_product([['A'], [np.NaN, 2.0]],
names=['id1', 'id2']))
print(df1.join(df2))
Problem description
The index containing the NaN value of df2
is falsely joined with the index of df1
. Therefore, the result contains both values of df2
instead of only one value and a NaN value.
Expected Output
col1 col2
id1 id2
A 1.0 1.1 NaN
2.0 1.2 2.2
The expected result was returned by older pandas versions (successfully tested with 0.18.1 ; 0.20.3 ; 0.21.1 ; 0.22.0).
Actual Output
col1 col2
id1 id2
A 1.0 1.1 2.1
2.0 1.2 2.2
This result is returned by newer pandas versions, beginning with version 0.23.0 (tested with 0.23.0 ; 0.23.4 ; 0.24.2 ; 0.25.2).
I suppose this is a bug, or am I missing something?
Output of pd.show_versions()
Comment From: simonjayhawkins
master produces different output
>>> import numpy as np
>>> import pandas as pd
>>>
>>> pd.__version__
'0.26.0.dev0+684.g953757a3e'
>>>
>>> df1 = pd.DataFrame(
... data={"col1": [1.1, 1.2]},
... index=pd.MultiIndex.from_product([["A"], [1.0, 2.0]], names=["id1", "id2"]),
... )
>>> df2 = pd.DataFrame(
... data={"col2": [2.1, 2.2]},
... index=pd.MultiIndex.from_product([["A"], [np.NaN, 2.0]], names=["id1", "id2"]),
... )
>>> print(df1.join(df2))
col1 col2
id1 id2
A 1.0 1.1 NaN
2.0 1.2 NaN
>>>
Comment From: hanudev
In case someone else has the same problem, the following gives the expected output:
print(df1.join(df2, how='outer').loc[df1.index])
This is of course only a workaround, which is noticeably slower than a direct join.
Comment From: ezwelty
pandas 1.4.3
I just ran into this problem, and I would argue that it is more serious than first meets the eye. The behavior is inconsistent between pd.Index
(correct) and pd.MultiIndex
(wrong) and for pd.MultiIndex
, between pd.DataFrame.join()
(wrong) and pd.DataFrame.merge()
(correct).
Say we want to join the following dataframes, and one has a missing value in the x
column. (I use float64
here, but the behavior is the same with Float64
, Int64
, ... and pd.NA
).
import pandas as pd
dfa = pd.DataFrame({
'x': [1.0, float('nan')],
'y': [1.0, 2.0]
})
dfb = pd.DataFrame({
'x': [1.0, 2.0],
'y': [1.0, 2.0],
'z': [10.0, 20.0]
})
Joining on just x
(with pd.Index
) works as expected.
key = ['x']
ia = pd.Index(dfa['x'])
ib = pd.Index(dfb['x'])
ia.difference(ib).tolist()
# [nan]
ia.intersection(ib).tolist()
# [1.0]
dfa.set_index(key).join(dfb.set_index(key), how='inner', rsuffix='b')
# y yb z
# x
# 1.0 1.0 1.0 10.0
dfa.merge(dfb, left_on=key, right_on=key, suffixes=('', 'b'))
# x y yb z
# 0 1.0 1.0 1.0 10.0
But joining on both x
and y
(with pd.MultiIndex
) fails for pd.DataFrame.join
somehow works correctly for pd.DataFrame.merge
:
key = ['x', 'y']
ia = pd.MultiIndex.from_frame(dfa[key])
ib = pd.MultiIndex.from_frame(dfb[key])
ia.difference(ib).tolist()
# []
ia.intersection(ib)
# [(1.0, 1.0), (nan, 2.0)]
dfa.set_index(key).join(dfb.set_index(key), how='inner')
# z
# x y
# 1.0 1.0 10.0
# NaN 2.0 NaN
dfa.merge(dfb, how='inner')
# x y z
# 0 1.0 1.0 10.0