Pandas 3 way joins column suffix looks wrong

Code Sample

import pandas as pd
import numpy as np


dataframe1 = pd.DataFrame(data={'a': np.random.rand(100), 'b': np.ones(100)})

dataframe2 = pd.DataFrame(data={'a': np.random.rand(100), 'b': np.ones(100)})

dataframe3 = pd.DataFrame(data={'a': np.random.rand(100), 'b': np.ones(100)})

joined_dataframe = dataframe1.join(dataframe2, lsuffix='_first', rsuffix='_second').join(dataframe3, lsuffix='_nothing', rsuffix='_third')

print joined_dataframe.columns

This outputs the columns on the final dataframe. These have the wrong suffixes. The above script prints the following with pandas 0.21.0

Index([u'a_first', u'b_first', u'a_second', u'b_second', u'a', u'b'], dtype='object')

Problem description

In my above example, the first join operation gives the correct column suffixes. But the second operation does not. In my experience, it does not matter how you divide them steps, the second join operation never assigns the correct column suffixes.

The joined_dataframe should of course have columns which reflect the suffixes in my second join, namely "_nothing" on columns from dataframe 1 and 2, and "_third" on columns from dataframe 3.

Expected Output

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-40-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.21.0 pytest: 3.2.5 pip: 8.1.1 setuptools: 20.7.0 Cython: None numpy: 1.13.3 scipy: 1.0.0 pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 2.1.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999 sqlalchemy: 1.1.15 pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: jschendel

This is the expected behavior. Per the documentation of DataFrame.join, lsuffix and rsuffix are only applied to overlapping columns. There are no overlapping columns at the time you're performing the join against dataframe3, as you're joining against the joined version of dataframe1 and dataframe2, which have had their columns renamed per the lsuffix and rsuffix specifications of the first join.

In other words, what you're doing is essentially being broken down analogous to:

temp = dataframe1.join(dataframe2, lsuffix='_first', rsuffix='_second').
joined_dataframe = temp.join(dataframe3, lsuffix='_nothing', rsuffix='_third')

Where temp and dataframe3 have no overlapping columns, and hence lsuffix and rsuffix aren't being applied.

Comment From: jschendel

xref #17834, which I think is proposing something similar to what you'd need

Comment From: jorisvandenbossche

Something you can do in the meantime to solve this is using concat and massaging the columns afterwards a bit:

In [22]: res = pd.concat([dataframe1, dataframe2, dataframe3], axis=1, keys=['first', 'second', 'third'])

In [23]: res.columns = [i + '_' + j for i, j in res.columns.swaplevel(0, 1)]

In [24]: res.head()
Out[24]: 
    a_first  b_first  a_second  b_second   a_third  b_third
0  0.369871      1.0  0.318002       1.0  0.676463      1.0
1  0.573191      1.0  0.877985       1.0  0.419603      1.0
2  0.810390      1.0  0.216200       1.0  0.198755      1.0
3  0.782245      1.0  0.968843       1.0  0.199693      1.0
4  0.162413      1.0  0.522495       1.0  0.852937      1.0

Comment From: jorisvandenbossche

Let's close this issue, as indeed a possible solution is #17834, so let's continue the discussion whether we want to add something like that or not there.

Comment From: barisser

Makes sense. Thanks.