Code Sample
import pandas as pd
import numpy as np
dataframe1 = pd.DataFrame(data={'a': np.random.rand(100), 'b': np.ones(100)})
dataframe2 = pd.DataFrame(data={'a': np.random.rand(100), 'b': np.ones(100)})
dataframe3 = pd.DataFrame(data={'a': np.random.rand(100), 'b': np.ones(100)})
joined_dataframe = dataframe1.join(dataframe2, lsuffix='_first', rsuffix='_second').join(dataframe3, lsuffix='_nothing', rsuffix='_third')
print joined_dataframe.columns
This outputs the columns on the final dataframe. These have the wrong suffixes. The above script prints the following with pandas 0.21.0
Index([u'a_first', u'b_first', u'a_second', u'b_second', u'a', u'b'], dtype='object')
Problem description
In my above example, the first join operation gives the correct column suffixes. But the second operation does not. In my experience, it does not matter how you divide them steps, the second join operation never assigns the correct column suffixes.
The joined_dataframe should of course have columns which reflect the suffixes in my second join, namely "_nothing" on columns from dataframe 1 and 2, and "_third" on columns from dataframe 3.
Expected Output
Output of pd.show_versions()
pd.show_versions()
INSTALLED VERSIONS
commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-40-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None
pandas: 0.21.0 pytest: 3.2.5 pip: 8.1.1 setuptools: 20.7.0 Cython: None numpy: 1.13.3 scipy: 1.0.0 pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 2.1.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999 sqlalchemy: 1.1.15 pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Comment From: jschendel
This is the expected behavior. Per the documentation of DataFrame.join
, lsuffix
and rsuffix
are only applied to overlapping columns. There are no overlapping columns at the time you're performing the join
against dataframe3
, as you're joining against the joined version of dataframe1
and dataframe2
, which have had their columns renamed per the lsuffix
and rsuffix
specifications of the first join
.
In other words, what you're doing is essentially being broken down analogous to:
temp = dataframe1.join(dataframe2, lsuffix='_first', rsuffix='_second').
joined_dataframe = temp.join(dataframe3, lsuffix='_nothing', rsuffix='_third')
Where temp
and dataframe3
have no overlapping columns, and hence lsuffix
and rsuffix
aren't being applied.
Comment From: jschendel
xref #17834, which I think is proposing something similar to what you'd need
Comment From: jorisvandenbossche
Something you can do in the meantime to solve this is using concat and massaging the columns afterwards a bit:
In [22]: res = pd.concat([dataframe1, dataframe2, dataframe3], axis=1, keys=['first', 'second', 'third'])
In [23]: res.columns = [i + '_' + j for i, j in res.columns.swaplevel(0, 1)]
In [24]: res.head()
Out[24]:
a_first b_first a_second b_second a_third b_third
0 0.369871 1.0 0.318002 1.0 0.676463 1.0
1 0.573191 1.0 0.877985 1.0 0.419603 1.0
2 0.810390 1.0 0.216200 1.0 0.198755 1.0
3 0.782245 1.0 0.968843 1.0 0.199693 1.0
4 0.162413 1.0 0.522495 1.0 0.852937 1.0
Comment From: jorisvandenbossche
Let's close this issue, as indeed a possible solution is #17834, so let's continue the discussion whether we want to add something like that or not there.
Comment From: barisser
Makes sense. Thanks.