Pandas performance when "combining" categories

I have a performance question that is a follow-up to this SO question. I apologize for the repeat, but my question is more focused on the performance issue, not how to do it.

So suppose I have a categorical with many categories. I'm trying to reduce the number of categories efficiently.

For example,

import pandas as pd
import numpy as np

x = pd.Categorical(np.random.choice(['apple', 'banana', 'cheese', 'dates', 'eggs'], 10000000))

I want group all the fruits together, so I'd like to do something like this...

x.map({'apple': 'fruit', 'banana': 'fruit', 'cheese': 'cheese', 'dates': 'fruit', 'eggs': 'eggs'})

The above line actually throws an error because Categorical.map requires a function, not a dict.

I can do something like this...

%time x = pd.Series(x).map({'apple': 'fruit', 'banana': 'fruit', 'cheese': 'cheese', 'dates': 'fruit', 'eggs': 'eggs'})
%time pd.Categorical(x)

However, there feels like there is large overhead in converting the categorical to a series, doing the map, then converting back to a categorical.

I tried some other tricks like using Categorical.remove_categories for the old categories, then add the new category, then fillna, and its a little better, but feels a little hacky.

Is there a more efficient way?

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.22.0 pytest: 3.5.0 pip: 9.0.1 setuptools: 36.5.0.post20170921 Cython: 0.26.1 numpy: 1.13.3 scipy: 0.19.1 pyarrow: 0.8.0 xarray: 0.10.0 IPython: 6.1.0 sphinx: 1.6.3 patsy: 0.5.0+dev dateutil: 2.6.1 pytz: 2017.2 blosc: 1.5.1 bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.1.0 openpyxl: 2.4.8 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.2 lxml: 4.1.0 bs4: 4.6.0 html5lib: 0.999999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: 0.1.5 pandas_gbq: None pandas_datareader: None

Comment From: TomAugspurger

We could maybe let Categorical.rename accept a dict whose values are not unique. Would have to think about this more.

For now, you'll be best of doing this "manually" with categories and codes.

In [34]: old_to_new = pd.Index(np.array([0, 0, 1, 0, 2]))

In [35]: old_to_new
Out[35]: Int64Index([0, 0, 1, 0, 2], dtype='int64')

In [36]: new_codes = old_to_new.take(x.codes)

In [37]: new_cat = pd.Categorical.from_codes(new_codes, ['fruit', 'cheese', 'eggs'])

In [38]: new_cat
Out[38]:
[fruit, fruit, fruit, cheese, fruit, ..., eggs, eggs, eggs, fruit, cheese]
Length: 10000000
Categories (3, object): [fruit, cheese, eggs]

old_to_new maps the old categories' code to the new categories codes.

Comment From: gwerbin

FWIW @TomAugspurger non-unique dict values in rename has precedent from R factors. See here: https://stackoverflow.com/q/19410108

Comment From: mroeschke

Categorical.map is supports dictionaries so I don't think needs to be worked around performance wise now so closing

Comment From: hottwaj

I don't think think Categorical.map implements what is described in this issue.

The problem is to map values in a Series of dtype category to new, potentially non-unique values, but still return a Series of dtype Category without the overhead of using object/other dtype in an intermediate step

Categorical.map returns an Index mapping old categorical values to new ones. It does not provide a way to map values in a Series of dtype category as far as I can see?

Thanks!

Comment From: hottwaj

(or maybe the resulting Index can be turned straight into a categorical with low overhead? if so great its just not very clear that would be the case). Thanks!

Pandas performance when "combining" categories

Output of pd.show_versions()

Output of `pd.show_versions()`