from #8626 discussion cc @fkaufer
allow pd.factorize
to take a Dataframe and process with tuples of columns.
further allow DataFrame.factorize
with a subset
argument which just calls pd.factorize
impl is below (e.g. simple fast-zipping then factorizing after dense conversion) just needs some tests
In [41]: df = pd.DataFrame({'A':['a1','a1','a2','a2','a1'], 'B':['b1','b2','b1','b2','b1']})
In [42]: df
Out[42]:
A B
0 a1 b1
1 a1 b2
2 a2 b1
3 a2 b2
4 a1 b1
In [43]: cols_as_tuples = pd.lib.fast_zip([df[col].get_values() for col in df.columns])
In [44]: cols_as_tuples
Out[44]: array([('a1', 'b1'), ('a1', 'b2'), ('a2', 'b1'), ('a2', 'b2'), ('a1', 'b1')], dtype=object)
In [47]: pd.factorize(cols_as_tuples)
Out[47]:
(array([0, 1, 2, 3, 0]),
array([('a1', 'b1'), ('a1', 'b2'), ('a2', 'b1'), ('a2', 'b2')], dtype=object))
In [48]: pd.Categorical(cols_as_tuples)
Out[48]:
[(a1, b1), (a1, b2), (a2, b1), (a2, b2), (a1, b1)]
Categories (4, object): [(a1, b1) < (a1, b2) < (a2, b1) < (a2, b2)]
In [59]: pd.Categorical(df.to_records(index=False))
Out[59]:
[(a1, b1), (a1, b2), (a2, b1), (a2, b2), (a1, b1)]
Categories (4, object): [(a1, b1) < (a1, b2) < (a2, b1) < (a2, b2)]
Comment From: simonjayhawkins
@jreback : just a thought. after reading #12860, if factorize were to be implemented on a Dataframe, it would be necessary to distinguish the difference between sharing a category between columns and creating a category across columns. Using a subset argument alone maybe insufficient.
Comment From: jbrockmendel
Not obvious to me what the use case is here.
Comment From: mroeschke
Yeah agreed I think especially since factorize has centralized around the 1D input this wouldn't fit well. Closing