Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
print("Pandas version: ", pd.__version__) # mine: 0.21.0
print("Numpy version: ", np.__version__) # mine: 1.13.3

# generate some data
data = np.random.randn(1,20)
target = np.array( ['Test'] * len(data) ).reshape(-1,1)

data_and_target = np.hstack((data, target))
data_and_target_cols = ["c" + str(i) for i in range(data.shape[1])] + ['target']
data_and_target_df = pd.DataFrame(data_and_target, columns=data_and_target_cols)

print(data_and_target_df[data_and_target_df.columns.difference(['target'])].columns)

# the column order I would have expected / "workaround"
print(data_and_target_df[data_and_target_df.columns.values[~data_and_target_df.columns.str.contains('target')]].columns )

Problem description

  • I assumed that pandas.columns.difference has the same functionality as the longer expression given above under workaround.

  • It took me a long time to discover that "Index.difference" does not only remove the column but also sorts them in a lexical order.

  • This can cause major issues e.g. when transforming the table into numpy via DataFrame.values while not tracking columns.

  • It took me a long time to figure out that the degraded performance in my system was due to this call. I now know that the sorting is a documented feature; but maybe this is not an ideal behavior.

  • By reporting this I would like to put this to the discussion

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.21.0 pytest: None pip: 9.0.1 setuptools: 33.1.1.post20170320 Cython: None numpy: 1.13.3 scipy: 0.19.0 pyarrow: None xarray: None IPython: 5.3.0 sphinx: None patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.5 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: jreback

duplicate of https://github.com/pandas-dev/pandas/issues/17839

we are going to add a sort= option. not sure we can change the default though.

this IS documented.

Comment From: scholz

Thanks Jeff - sorry I overlooked the dupe.