Pandas PERF: perform .str operations on categoricals

So huge win to perform .str operations on the .categories of a Categorical (versus actually doing these on an object array), when you have << number of categories relative to the number of objects.

Comment From: shoyer

I agree this is a good idea, but note that categoricals intentionally do not support arithmetic. That seems inconsistent to me. So I would either consider adding support for arithmetic with numeric categories, or create a more specialized "interned string" array type, which has slightly different meaning than a categorical.

Comment From: jreback

I was thinking more along the lines of this:

s.str.startswith('a') is MUCH more performant if the number of factorized categories is relatively small to the size of the object.

FYI, cc @JanSchultz

I think it might be nice to have a .density method on Categoricals? essentially 100*len(categories)/float(len(codes)) ? is this called something in R land?

Comment From: shoyer

@jreback I totally agree with you, but s + 1 has equivalent performance gains.

Here's what the categorical docs say about numeric operations:

In contrast to statistical categorical variables, categorical data might have an order (e.g. ‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, ...) are not possible.

Maybe this is more a practical statement than a principled one?

Comment From: jreback

not talking about numeric ops but rather str ops here (like in the decode issue)

Comment From: jreback

side issue - you interested in speaking at pydata in November in NYC ? (I think. 22-23)

Comment From: shoyer

@jreback OK, made a new issue for arithmetic. As for PyData NYC, I am sort of intrigued but already doing a lot of travel these days.

Comment From: jreback

@shoyer ok cool np.

Comment From: jankatins

Again I think that this is more an argument to implement a 'pandas-string' type. If one wants to have the above gains then doing a np.asarray(cat.cat.set_categories(cat.cat.categories.str.whatever(...))). On the other hand if this can essentially be done in 3 lines of additional code in the .str implementation, then why not...

PS: Schulz with 'z' and not 'tz'. :-)

Comment From: jreback

@JanSchulz sorry about that git is annoying with that :)

Comment From: jankatins

IMO, this should be closed in favor of #8640...

Comment From: jreback

@JanSchulz you mean #10661 right? (which mostly solves the problem ,though does blow back to object arrays), I suspect you get almost all of the perf gains. I fact want to add a little benchmark in the perf suite?

Comment From: jankatins

Nope: I understand this issue that you want all .str function work on a shrinked down dataset of unique values.

You could actually do that now by simple calling the codes, categories = factorize(values, sort=True) and using that as a kind of categorical. On the other hand, that would do a factorice on all string methods (and does not cache it), so the penalty is still there.

Today (as of #10661) you can do s.astype("categorical").str.whatever() and get that behaviour (if you cache the cat yourself). But this has the problem that you get the categorical behaviour: no s+s, no arbitrary new string values,...

IMO the real solution is to build a "PandasString" class in the same spirit as "Categorical" and use that: s.astype("pdstring").str.whatever(). And that's #8640

Or get a real numpy string type...

Comment From: jreback

closing in favor of #8640