So huge win to perform .str operations on the .categories
of a Categorical
(versus actually doing these on an object array), when you have << number of categories relative to the number of objects.
Comment From: shoyer
I agree this is a good idea, but note that categoricals intentionally do not support arithmetic. That seems inconsistent to me. So I would either consider adding support for arithmetic with numeric categories, or create a more specialized "interned string" array type, which has slightly different meaning than a categorical.
Comment From: jreback
I was thinking more along the lines of this:
s.str.startswith('a')
is MUCH more performant if the number of factorized categories is relatively small to the size of the object.
FYI, cc @JanSchultz
I think it might be nice to have a .density
method on Categoricals? essentially
100*len(categories)/float(len(codes))
? is this called something in R land?
Comment From: shoyer
@jreback I totally agree with you, but s + 1
has equivalent performance gains.
Here's what the categorical docs say about numeric operations:
In contrast to statistical categorical variables, categorical data might have an order (e.g. ‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, ...) are not possible.
Maybe this is more a practical statement than a principled one?
Comment From: jreback
not talking about numeric ops but rather str ops here (like in the decode issue)
Comment From: jreback
side issue - you interested in speaking at pydata in November in NYC ? (I think. 22-23)
Comment From: shoyer
@jreback OK, made a new issue for arithmetic. As for PyData NYC, I am sort of intrigued but already doing a lot of travel these days.
Comment From: jreback
@shoyer ok cool np.
Comment From: jankatins
Again I think that this is more an argument to implement a 'pandas-string' type. If one wants to have the above gains then doing a np.asarray(cat.cat.set_categories(cat.cat.categories.str.whatever(...)))
. On the other hand if this can essentially be done in 3 lines of additional code in the .str
implementation, then why not...
PS: Schulz with 'z' and not 'tz'. :-)
Comment From: jreback
@JanSchulz sorry about that git is annoying with that :)
Comment From: jankatins
IMO, this should be closed in favor of #8640...
Comment From: jreback
@JanSchulz you mean #10661 right? (which mostly solves the problem ,though does blow back to object
arrays), I suspect you get almost all of the perf gains. I fact want to add a little benchmark in the perf suite?
Comment From: jankatins
Nope: I understand this issue that you want all .str
function work on a shrinked down dataset of unique values.
You could actually do that now by simple calling the codes, categories = factorize(values, sort=True)
and using that as a kind of categorical. On the other hand, that would do a factorice on all string methods (and does not cache it), so the penalty is still there.
Today (as of #10661) you can do s.astype("categorical").str.whatever()
and get that behaviour (if you cache the cat yourself). But this has the problem that you get the categorical behaviour: no s+s
, no arbitrary new string values,...
IMO the real solution is to build a "PandasString" class in the same spirit as "Categorical" and use that: s.astype("pdstring").str.whatever()
. And that's #8640
Or get a real numpy string type...
Comment From: jreback
closing in favor of #8640