Pandas NaN not included in col.cat.categories

I've searched for issues, but haven't found a similar one - apologies if there are.

Current behaviour:

import pandas as pd
df = pd.DataFrame({'col': ['a', None]})
df['col'] = df.col.astype('category')
dict(enumerate(df.col.cat.categories))
# {0: 'a'}

Expected behaviour would be to include the NaN e.g.

{-1: NaN, 0: 'a'}

This would match the behaviour of df.col.cat.codes and df.col.unique(), both of which include the NaN.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.21.0 pytest: None pip: 9.0.1 setuptools: 36.5.0.post20170921 Cython: None numpy: 1.13.3 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: None patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.1.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: jreback

`Categoricals` can only take on only a limited, and usually fixed, number
of possible values (`categories`). In contrast to statistical categorical
variables, a `Categorical` might have an order, but numerical operations
(additions, divisions, ...) are not possible.

All values of the `Categorical` are either in `categories` or `np.nan`.
Assigning values outside of `categories` will raise a `ValueError`. Order
is defined by the order of the `categories`, not lexical order of the
values.

this is by-definition; nan is not included in categories. .codes is an implementation detail and .unique() returns the categorical itself.

Comment From: jorisvandenbossche

See also https://pandas.pydata.org/pandas-docs/stable/categorical.html#missing-data

Comment From: kodonnell

OK, fair enough, though I'd tend to think including it would be the more 'obvious' behaviour. (Does the user lose anything by doing this?)

Using unique is fine, but a lot slower - I'll submit a separate issue on this (if there isn't one already).

Comment From: jorisvandenbossche

I'd tend to think including it would be the more 'obvious' behaviour. (Does the user lose anything by doing this?)

But it would make the implementation much harder + the fact that your categories can change when missing values are introduced can also be confusing.

Comment From: kodonnell

the fact that your categories can change when missing values are introduced can also be confusing

That's new to me - I thought categories couldn't change.

But it would make the implementation much harder

But 'does the user lose anything'? Anyway, since I'm not proposing a PR to do all the hard work, we can probably leave this unless others revive it.

Comment From: TomAugspurger

It's often convenient, even in user-land, to know that the codes range from [0, n_categories), and that missing values are always coded as -1. Included NaN in the categories breaks that assumption.

Comment From: kodonnell

@TomAugspurger - I'm not sure I follow. Knowing codes range from [0, n_categories) (excluding -1: NaN), is the same as knowing codes range from [-1, n_categories), including NaN. For my use cases, I'd prefer to have NaN, since then I don't have to continually check for NaN.

Comment From: TomAugspurger

Including nan in the categories would mess up the relationship between position in categories and codes then.

To be clear, it’s not unreasonable to want to include Nan in the categories. But we’ve chosen not to.

Get Outlook for iOShttps://aka.ms/o0ukef

From: kodonnell notifications@github.com Sent: Sunday, December 3, 2017 2:31:48 AM To: pandas-dev/pandas Cc: Tom Augspurger; Mention Subject: Re: [pandas-dev/pandas] NaN not included in col.cat.categories (#18574)

@TomAugspurgerhttps://github.com/tomaugspurger - I'm not sure I follow. Knowing codes range from [0, n_categories) (excluding -1: NaN), is the same as knowing codes range from [-1, n_categories), including NaN. For my use cases, I'd prefer to have NaN, since then I don't have to continually check for NaN.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/18574#issuecomment-348748879, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHInj_ywx1-4YGvSd_uO8NqyC2UCJLks5s8lx0gaJpZM4Qv8vv.

Comment From: kodonnell

Including nan in the categories would mess up the relationship between position in categories and codes then.

For me it's messed up already, as len(codes) != len(categories). One could always add an include_nan option so we could get both, if required.

To be clear, it’s not unreasonable to want to include Nan in the categories. But we’ve chosen not to.

Fair enough. Will leave this one be, unless revived.

Pandas NaN not included in col.cat.categories

Output of pd.show_versions()

Output of `pd.show_versions()`