Pandas Applying set to NaN returns list of NaNs, not a set

Code Sample, a copy-pastable example if possible

dftest = pd.DataFrame({"a":[np.nan]*5, "b":[1,1,2,3,3]})
dftest.groupby("b").agg(lambda x: set(x))

[Out]
         a
b   
1   {nan, nan}
2   {nan}
3   {nan, nan}

dftest.apply(set)

# [Out]
# a    {nan, nan, nan, nan, nan}
# b              {1.0, 2.0, 3.0}
# dtype: object

set([np.nan]*5)

Problem description

I am trying to count the number of distinct values in a group. When I apply lambda x: len(set(x)) and a column contains NaN, I am getting unexpected result: each NaN instance is counted as independent value. The expected output is a better solution as it follows the common definition of what SET is, i.e. the abovementioned function to sets:

set([np.nan]*5)
# [Out:]  {np.nan}

It might be rooted in a numpy:

set(np.asarray([np.nan]*5))
# [Out] {nan, nan, nan, nan, nan}

However:

set(np.asarray([np.nan]*5 + ["a"]))
# [Out] {nan, "a"}

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.2.0-42-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.2 pytest: None pip: 9.0.1 setuptools: 36.0.1 Cython: None numpy: 1.13.0 scipy: 0.19.0 xarray: None IPython: 6.1.0 sphinx: None patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: chris-b1

The method you want is nunique, which will handle this for you, using the dropna parameter to tell it how to count nulls.

In [24]: dftest.groupby('b')['a'].nunique()
Out[24]: 
b
1    0
2    0
3    0
Name: a, dtype: int64

In [25]: dftest.groupby('b')['a'].nunique(dropna=False)
Out[25]: 
b
1    1
2    1
3    1
Name: a, dtype: int64

Generally using NaN in any context requiring equality (such as a set) can result in strange behavior, due to the fact that NaN != NaN. In particular, the issue here seems to be due to differences in a python float NaN and the numpy boxed float64 scalar version.

In [45]: a = [np.nan] * 5

In [46]: b = list(np.asarray([np.nan] * 5))

In [47]: a
Out[47]: [nan, nan, nan, nan, nan]

In [48]: b
Out[48]: [nan, nan, nan, nan, nan]

In [49]: set(a)
Out[49]: {nan}

In [50]: set(b)
Out[50]: {nan, nan, nan, nan, nan}

In [54]: type(a[0])
Out[54]: float

In [55]: type(b[0])
Out[55]: numpy.float64