Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
print("Pandas version: ", pd.__version__) # mine: 0.21.0
print("Numpy version: ", np.__version__) # mine: 1.13.3
# generate some data
data = np.random.randn(1,20)
target = np.array( ['Test'] * len(data) ).reshape(-1,1)
data_and_target = np.hstack((data, target))
data_and_target_cols = ["c" + str(i) for i in range(data.shape[1])] + ['target']
data_and_target_df = pd.DataFrame(data_and_target, columns=data_and_target_cols)
print(data_and_target_df[data_and_target_df.columns.difference(['target'])].columns)
# the column order I would have expected / "workaround"
print(data_and_target_df[data_and_target_df.columns.values[~data_and_target_df.columns.str.contains('target')]].columns )
Problem description
-
I assumed that pandas.columns.difference has the same functionality as the longer expression given above under workaround.
-
It took me a long time to discover that "Index.difference" does not only remove the column but also sorts them in a lexical order.
-
This can cause major issues e.g. when transforming the table into numpy via DataFrame.values while not tracking columns.
-
It took me a long time to figure out that the degraded performance in my system was due to this call. I now know that the sorting is a documented feature; but maybe this is not an ideal behavior.
-
By reporting this I would like to put this to the discussion
Output of pd.show_versions()
Comment From: jreback
duplicate of https://github.com/pandas-dev/pandas/issues/17839
we are going to add a sort=
option. not sure we can change the default though.
this IS documented.
Comment From: scholz
Thanks Jeff - sorry I overlooked the dupe.