The parameter for categories is documented as being index-like
.
Does that include dictionaries?
If so, some unexpected results occurred when working with integers and dictionaries in version 0.17.1
The following sets out what I found
sex = [1,2,0,1]
categories = {1:'Male', 2:'Female', 0:'Unknown'}
pd.Categorical.from_codes(sex, categories=categories)
returns the the array with keys instead of the values (though i suspect that was luck given outcomes below)
[1, 2, 0, 1]
Categories (3, int64): [0, 1, 2]
Swapping the dictionary key/values around shows doesn't match values to keys though:
sex = [1,2,0,1]
categories = {'Male':1, 'Female':2, 'Unknown':0}
pd.Categorical.from_codes(sex, categories=categories)
returns an array that has incorrectly mapped codes and categories
[Unknown, Male, Female, Unknown]
Categories (3, object): [Female, Unknown, Male]
Using a non-sequential numerical ordering for the codes fails with a dictionary
sex = [1,2,9,1]
categories = {'Male':1, 'Female':2, 'Unknown':9}
pd.Categorical.from_codes(sex, categories=categories)
Fails with ValueError: codes need to be between -1 and len(categories)-1
at line 386 in categorical.py at if len(codes) and (codes.max() >= len(categories) or codes.min() < -1)
presumably because the second logical element.
I'm guessing it's meant to be a cheque on the count of the number of unique elements in codes
being greater than or equal to the number of items in categories
.
Comment From: jreback
a dict is not index-like
(which is list-like). it will be coerced with
list(dict)`` which yields the keys. not sure why you would pass that.
Comment From: ChristopherShort
Great thanks - though the docs say 'index like'.
perhaps I should try and make a correction to the docs? And put an example in to. (I'll see if I can figure it out).
Comment From: jreback
index like
is correct. but an index
is again not a dict, it is very much like a list
Comment From: ChristopherShort
ahh... pandas index
- silly me - thanks.
Comment From: jreback
@ChristopherShort as an aside. generally you shouldn't be providing codes. yes this is a public method, but only in cases where you already have the codes (eg. say you are coding incrementally), should you use this.
Comment From: ChristopherShort
Thanks - makes sense.
My use case here was a dataset with 20 million obs - several category variables are already integer coded. One in particular variables takes 358 different values on a range of ints from 10 to 998.
The method struck me as being able to map those integers to those categories for potentially some display purposes in a notebook (there are other ways to do what I want anyway)
It was my silly error on reading index like
and thinking indexable python objects instead of pandas index
.
Again - thanks for taking the time here - and also a quick note to mention my appreciation for your tutorials on pandas performance and developments - they have really helped me. (And to all those that make pandas a fantastic tool)
Comment From: jreback
@ChristopherShort gr8. glad its working for you.