Pandas BUG: Outer join on multiple date time columns results in nan keys

I have tried to reduce this problem as small as possible, two data frames attached each with 291 rows.

df1 = pd.read_pickle('df1.pkl')
df2 = pd.read_pickle('df2.pkl')

merged = df1.merge(df2, how='outer', on=['DateTime', 'valid_datetime'])
#this results in nan values on the join keys
merged.shape[0] == merged[['DateTime', 'valid_datetime']].dropna().shape[0]

Note that if the DateTime and valid_datetime columns are converted to strings then this works fine.

Comment From: jreback

looks good to me. show pd.show_versions().

 In [26]: pd.__version__
Out[26]: '0.18.1+374.gc3e24a1'

In [5]: pd.options.display.max_rows=10

In [6]: merged[['DateTime', 'valid_datetime']]
Out[6]: 
                     DateTime            valid_datetime
0   2015-03-31 02:00:00+02:00 2015-03-16 01:00:00+01:00
1   2015-03-17 00:00:00+01:00 2015-03-17 01:00:00+01:00
2   2015-03-17 01:00:00+01:00 2015-03-17 01:00:00+01:00
3   2015-03-17 02:00:00+01:00 2015-03-17 01:00:00+01:00
4   2015-03-17 03:00:00+01:00 2015-03-17 01:00:00+01:00
..                        ...                       ...
577 2015-03-13 07:00:00+01:00 2015-02-27 01:00:00+01:00
578 2015-03-13 08:00:00+01:00 2015-02-27 01:00:00+01:00
579 2015-03-13 09:00:00+01:00 2015-02-27 01:00:00+01:00
580 2015-03-13 10:00:00+01:00 2015-02-27 01:00:00+01:00
581 2015-03-13 11:00:00+01:00 2015-02-27 01:00:00+01:00

[582 rows x 2 columns]

In [7]: merged[['DateTime', 'valid_datetime']].dropna()
Out[7]: 
                     DateTime            valid_datetime
0   2015-03-31 02:00:00+02:00 2015-03-16 01:00:00+01:00
1   2015-03-17 00:00:00+01:00 2015-03-17 01:00:00+01:00
2   2015-03-17 01:00:00+01:00 2015-03-17 01:00:00+01:00
3   2015-03-17 02:00:00+01:00 2015-03-17 01:00:00+01:00
4   2015-03-17 03:00:00+01:00 2015-03-17 01:00:00+01:00
..                        ...                       ...
577 2015-03-13 07:00:00+01:00 2015-02-27 01:00:00+01:00
578 2015-03-13 08:00:00+01:00 2015-02-27 01:00:00+01:00
579 2015-03-13 09:00:00+01:00 2015-02-27 01:00:00+01:00
580 2015-03-13 10:00:00+01:00 2015-02-27 01:00:00+01:00
581 2015-03-13 11:00:00+01:00 2015-02-27 01:00:00+01:00

[582 rows x 2 columns]

note that you dont have any matches.

Comment From: jreback

pls reopen if you have more information and the problem repros

Comment From: nosniborj

thanks for the very quick response! still having the same problem

Python 3.5.2 |Anaconda 4.0.0 (32-bit)| (default, Jul  5 2016, 11:45:57) [MSC v.1900 32 bit (Intel)] on win32
In[2]: import pandas as pd
Backend Qt4Agg is interactive backend. Turning interactive mode on.
In[3]: df1 = pd.read_pickle('../df1.pkl')
In[4]: df2 = pd.read_pickle('../df2.pkl')
In[5]: merged = df1.merge(df2, how='outer', on=['DateTime', 'valid_datetime'])
In[6]: merged.shape[0] == merged[['DateTime', 'valid_datetime']].dropna().shape[0]
Out[6]: False
In[7]: pd.options.display.max_rows=10
In[8]: merged[['DateTime', 'valid_datetime']]
Out[8]: 
                     DateTime            valid_datetime
0   2015-03-31 02:00:00+02:00 2015-03-16 01:00:00+01:00
1   2015-03-17 01:00:00+01:00 2015-03-17 01:00:00+01:00
2   2015-03-17 02:00:00+01:00 2015-03-17 01:00:00+01:00
3   2015-03-17 03:00:00+01:00 2015-03-17 01:00:00+01:00
4   2015-03-17 04:00:00+01:00 2015-03-17 01:00:00+01:00
..                        ...                       ...
577                       NaT 2015-02-27 01:00:00+01:00
578                       NaT 2015-02-27 01:00:00+01:00
579                       NaT 2015-02-27 01:00:00+01:00
580                       NaT 2015-02-27 01:00:00+01:00
581                       NaT 2015-02-27 01:00:00+01:00

[582 rows x 2 columns]
In[9]: merged[['DateTime', 'valid_datetime']].dropna()
Out[9]: 
                     DateTime            valid_datetime
0   2015-03-31 02:00:00+02:00 2015-03-16 01:00:00+01:00
1   2015-03-17 01:00:00+01:00 2015-03-17 01:00:00+01:00
2   2015-03-17 02:00:00+01:00 2015-03-17 01:00:00+01:00
3   2015-03-17 03:00:00+01:00 2015-03-17 01:00:00+01:00
4   2015-03-17 04:00:00+01:00 2015-03-17 01:00:00+01:00
..                        ...                       ...
286 2015-03-28 21:00:00+01:00 2015-03-17 01:00:00+01:00
287 2015-03-28 22:00:00+01:00 2015-03-17 01:00:00+01:00
288 2015-03-28 23:00:00+01:00 2015-03-17 01:00:00+01:00
289 2015-03-29 00:00:00+01:00 2015-03-17 01:00:00+01:00
290 2015-03-29 01:00:00+01:00 2015-03-17 01:00:00+01:00

[291 rows x 2 columns]

pd.show_versions() gives:

In[11]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: 0.6.7.None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

aware that there are no matches (both dataframes are small subsets of my data)

Comment From: jreback

my example was in master; this was fixed by https://github.com/pydata/pandas/pull/13802

Comment From: nosniborj

ah ok, thanks!