Can it be possible to improve drop_duplicates()
method to handle list
column datatpyes? Now if I have a column with list
elements then drop_duplicates()
throws the following exception:
TypeError: type object argument after * must be a sequence, not map
Example:
df = pd.DataFrame({'a':[10,10,10,11,12], 'b':[['a', 'b'], ['a', 'b'], ['x', 'y'], ['a', 'b'], ['x', 'y']]})
df.drop_duplicates()
Now I have to create a new column that contains the original list
elements as strings and I have to use this new column at drop_duplicates()
.
Thank you!
Comment From: jreback
lists are not hashable (nor is list
a real datatype in pandas anyhow).
but you can use tuples
In [7]: df['c'] = df['b'].apply(tuple)
In [8]: df
Out[8]:
a b c
0 10 [a, b] (a, b)
1 10 [a, b] (a, b)
2 10 [x, y] (x, y)
3 11 [a, b] (a, b)
4 12 [x, y] (x, y)
In [9]: df.drop_duplicates(subset=['c'])
Out[9]:
a b c
0 10 [a, b] (a, b)
2 10 [x, y] (x, y)
In [12]: df.drop_duplicates(subset=['a', 'c'])
Out[12]:
a b c
0 10 [a, b] (a, b)
2 10 [x, y] (x, y)
3 11 [a, b] (a, b)
4 12 [x, y] (x, y)