I have a performance question that is a follow-up to this SO question. I apologize for the repeat, but my question is more focused on the performance issue, not how to do it.
So suppose I have a categorical
with many categories. I'm trying to reduce the number of categories efficiently.
For example,
import pandas as pd
import numpy as np
x = pd.Categorical(np.random.choice(['apple', 'banana', 'cheese', 'dates', 'eggs'], 10000000))
I want group all the fruits together, so I'd like to do something like this...
x.map({'apple': 'fruit', 'banana': 'fruit', 'cheese': 'cheese', 'dates': 'fruit', 'eggs': 'eggs'})
The above line actually throws an error because Categorical.map
requires a function, not a dict.
I can do something like this...
%time x = pd.Series(x).map({'apple': 'fruit', 'banana': 'fruit', 'cheese': 'cheese', 'dates': 'fruit', 'eggs': 'eggs'})
%time pd.Categorical(x)
However, there feels like there is large overhead in converting the categorical to a series, doing the map, then converting back to a categorical.
I tried some other tricks like using Categorical.remove_categories
for the old categories, then add the new category, then fillna
, and its a little better, but feels a little hacky.
Is there a more efficient way?
Output of pd.show_versions()
Comment From: TomAugspurger
We could maybe let Categorical.rename
accept a dict whose values are not unique. Would have to think about this more.
For now, you'll be best of doing this "manually" with categories and codes.
In [34]: old_to_new = pd.Index(np.array([0, 0, 1, 0, 2]))
In [35]: old_to_new
Out[35]: Int64Index([0, 0, 1, 0, 2], dtype='int64')
In [36]: new_codes = old_to_new.take(x.codes)
In [37]: new_cat = pd.Categorical.from_codes(new_codes, ['fruit', 'cheese', 'eggs'])
In [38]: new_cat
Out[38]:
[fruit, fruit, fruit, cheese, fruit, ..., eggs, eggs, eggs, fruit, cheese]
Length: 10000000
Categories (3, object): [fruit, cheese, eggs]
old_to_new
maps the old categories' code to the new categories codes.
Comment From: gwerbin
FWIW @TomAugspurger non-unique dict values in rename
has precedent from R factors. See here: https://stackoverflow.com/q/19410108
Comment From: mroeschke
Categorical.map
is supports dictionaries so I don't think needs to be worked around performance wise now so closing
Comment From: hottwaj
I don't think think Categorical.map
implements what is described in this issue.
The problem is to map values in a Series of dtype category to new, potentially non-unique values, but still return a Series of dtype Category without the overhead of using object/other dtype in an intermediate step
Categorical.map
returns an Index mapping old categorical values to new ones. It does not provide a way to map values in a Series of dtype category as far as I can see?
Thanks!
Comment From: hottwaj
(or maybe the resulting Index can be turned straight into a categorical with low overhead? if so great its just not very clear that would be the case). Thanks!