Code Sample, a copy-pastable example if possible
dftest = pd.DataFrame({"a":[np.nan]*5, "b":[1,1,2,3,3]})
dftest.groupby("b").agg(lambda x: set(x))
[Out]
a
b
1 {nan, nan}
2 {nan}
3 {nan, nan}
dftest.apply(set)
# [Out]
# a {nan, nan, nan, nan, nan}
# b {1.0, 2.0, 3.0}
# dtype: object
set([np.nan]*5)
Problem description
I am trying to count the number of distinct values in a group. When I apply lambda x: len(set(x))
and a column contains NaN, I am getting unexpected result: each NaN instance is counted as independent value. The expected output is a better solution as it follows the common definition of what SET is, i.e. the abovementioned function to sets:
set([np.nan]*5)
# [Out:] {np.nan}
It might be rooted in a numpy:
set(np.asarray([np.nan]*5))
# [Out] {nan, nan, nan, nan, nan}
However:
set(np.asarray([np.nan]*5 + ["a"]))
# [Out] {nan, "a"}
Expected Output
Output of pd.show_versions()
Comment From: chris-b1
The method you want is nunique
, which will handle this for you, using the dropna
parameter to tell it how to count nulls.
In [24]: dftest.groupby('b')['a'].nunique()
Out[24]:
b
1 0
2 0
3 0
Name: a, dtype: int64
In [25]: dftest.groupby('b')['a'].nunique(dropna=False)
Out[25]:
b
1 1
2 1
3 1
Name: a, dtype: int64
Generally using NaN
in any context requiring equality (such as a set) can result in strange behavior, due to the fact that NaN != NaN
. In particular, the issue here seems to be due to differences in a python float
NaN and the numpy boxed float64 scalar version.
In [45]: a = [np.nan] * 5
In [46]: b = list(np.asarray([np.nan] * 5))
In [47]: a
Out[47]: [nan, nan, nan, nan, nan]
In [48]: b
Out[48]: [nan, nan, nan, nan, nan]
In [49]: set(a)
Out[49]: {nan}
In [50]: set(b)
Out[50]: {nan, nan, nan, nan, nan}
In [54]: type(a[0])
Out[54]: float
In [55]: type(b[0])
Out[55]: numpy.float64
Comment From: chris-b1
SO answer with a more detailed explanation https://stackoverflow.com/a/6441990/3657742