Pandas ENH: pd.factorize to accept a Dataframe

from #8626 discussion cc @fkaufer

allow pd.factorize to take a Dataframe and process with tuples of columns. further allow DataFrame.factorize with a subset argument which just calls pd.factorize

impl is below (e.g. simple fast-zipping then factorizing after dense conversion) just needs some tests

In [41]: df = pd.DataFrame({'A':['a1','a1','a2','a2','a1'], 'B':['b1','b2','b1','b2','b1']})

In [42]: df
Out[42]: 
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
4  a1  b1

In [43]: cols_as_tuples = pd.lib.fast_zip([df[col].get_values() for col in df.columns])

In [44]: cols_as_tuples 
Out[44]: array([('a1', 'b1'), ('a1', 'b2'), ('a2', 'b1'), ('a2', 'b2'), ('a1', 'b1')], dtype=object)

In [47]: pd.factorize(cols_as_tuples)
Out[47]: 
(array([0, 1, 2, 3, 0]),
 array([('a1', 'b1'), ('a1', 'b2'), ('a2', 'b1'), ('a2', 'b2')], dtype=object))

In [48]: pd.Categorical(cols_as_tuples)
Out[48]: 
[(a1, b1), (a1, b2), (a2, b1), (a2, b2), (a1, b1)]
Categories (4, object): [(a1, b1) < (a1, b2) < (a2, b1) < (a2, b2)]

In [59]: pd.Categorical(df.to_records(index=False))
Out[59]: 
[(a1, b1), (a1, b2), (a2, b1), (a2, b2), (a1, b1)]
Categories (4, object): [(a1, b1) < (a1, b2) < (a2, b1) < (a2, b2)]

Comment From: simonjayhawkins

@jreback : just a thought. after reading #12860, if factorize were to be implemented on a Dataframe, it would be necessary to distinguish the difference between sharing a category between columns and creating a category across columns. Using a subset argument alone maybe insufficient.

Comment From: jbrockmendel

Not obvious to me what the use case is here.

Comment From: mroeschke

Yeah agreed I think especially since factorize has centralized around the 1D input this wouldn't fit well. Closing