I've searched for issues, but haven't found a similar one - apologies if there are.
Current behaviour:
import pandas as pd
df = pd.DataFrame({'col': ['a', None]})
df['col'] = df.col.astype('category')
dict(enumerate(df.col.cat.categories))
# {0: 'a'}
Expected behaviour would be to include the NaN
e.g.
{-1: NaN, 0: 'a'}
This would match the behaviour of df.col.cat.codes
and df.col.unique()
, both of which include the NaN
.
Output of pd.show_versions()
Comment From: jreback
`Categoricals` can only take on only a limited, and usually fixed, number
of possible values (`categories`). In contrast to statistical categorical
variables, a `Categorical` might have an order, but numerical operations
(additions, divisions, ...) are not possible.
All values of the `Categorical` are either in `categories` or `np.nan`.
Assigning values outside of `categories` will raise a `ValueError`. Order
is defined by the order of the `categories`, not lexical order of the
values.
this is by-definition; nan is not included in categories
. .codes
is an implementation detail and .unique()
returns the categorical itself.
Comment From: jorisvandenbossche
See also https://pandas.pydata.org/pandas-docs/stable/categorical.html#missing-data
Comment From: kodonnell
OK, fair enough, though I'd tend to think including it would be the more 'obvious' behaviour. (Does the user lose anything by doing this?)
Using unique
is fine, but a lot slower - I'll submit a separate issue on this (if there isn't one already).
Comment From: jorisvandenbossche
I'd tend to think including it would be the more 'obvious' behaviour. (Does the user lose anything by doing this?)
But it would make the implementation much harder + the fact that your categories can change when missing values are introduced can also be confusing.
Comment From: kodonnell
the fact that your categories can change when missing values are introduced can also be confusing
That's new to me - I thought categories couldn't change.
But it would make the implementation much harder
But 'does the user lose anything'? Anyway, since I'm not proposing a PR to do all the hard work, we can probably leave this unless others revive it.
Comment From: TomAugspurger
It's often convenient, even in user-land, to know that the codes range from [0, n_categories)
, and that missing values are always coded as -1
. Included NaN
in the categories breaks that assumption.
Comment From: kodonnell
@TomAugspurger - I'm not sure I follow. Knowing codes range from [0, n_categories)
(excluding -1: NaN
), is the same as knowing codes range from [-1, n_categories)
, including NaN
. For my use cases, I'd prefer to have NaN
, since then I don't have to continually check for NaN
.
Comment From: TomAugspurger
Including nan in the categories would mess up the relationship between position in categories and codes then.
To be clear, it’s not unreasonable to want to include Nan in the categories. But we’ve chosen not to.
Get Outlook for iOShttps://aka.ms/o0ukef
From: kodonnell notifications@github.com Sent: Sunday, December 3, 2017 2:31:48 AM To: pandas-dev/pandas Cc: Tom Augspurger; Mention Subject: Re: [pandas-dev/pandas] NaN not included in col.cat.categories (#18574)
@TomAugspurgerhttps://github.com/tomaugspurger - I'm not sure I follow. Knowing codes range from [0, n_categories) (excluding -1: NaN), is the same as knowing codes range from [-1, n_categories), including NaN. For my use cases, I'd prefer to have NaN, since then I don't have to continually check for NaN.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/18574#issuecomment-348748879, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHInj_ywx1-4YGvSd_uO8NqyC2UCJLks5s8lx0gaJpZM4Qv8vv.
Comment From: kodonnell
Including nan in the categories would mess up the relationship between position in categories and codes then.
For me it's messed up already, as len(codes) != len(categories)
. One could always add an include_nan
option so we could get both, if required.
To be clear, it’s not unreasonable to want to include Nan in the categories. But we’ve chosen not to.
Fair enough. Will leave this one be, unless revived.