Code Sample, a copy-pastable example if possible
This SO questions asks the simple question of how to recode strings in a data frame as numerical categories http://stackoverflow.com/questions/39475187/how-to-speed-up-recoding-into-integers .
The pandas solution x = df.apply(lambda x: x.astype('category').cat.codes) Is by far the fastest. However it doesn't give a consistent answer if the data frame has more than one column.
E.g.
g,k a,h c,i j,e d,i i,h b,b d,d i,a d,h
gets recoded to:
0 1 0 4 6 1 0 4 2 2 5 3 6 3 4 3 5 5 5 4 6 1 1 7 3 2 8 5 0 9 3 4
Notice that 'd' is mapped to 3 in the first column but 2 in the second.
It would be great if pandas could do this recoding consistently.
Expected Output
output of pd.show_versions()
Comment From: jorisvandenbossche
@lesshaste The fact that df.apply(lambda x: x.astype('category').cat.codes) does this column by column is expected.
But see https://github.com/pydata/pandas/issues/12860 for some discussion on how to be able to do this on multiple columns at once (using the same categories for all columns).
The workaround listed over there is:
uniques = np.sort(pd.unique(df.values.ravel()))
df.apply(lambda x: x.astype('category', categories=uniques))
Comment From: lesshaste
That is very nice and I had no idea you could do that. Thank you.