Pandas QST: Categorical NaN moving to strings. Unexpected behavior?

Research

[X] I have searched the [pandas] tag on StackOverflow for similar questions.
[X] I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/75406055/weird-behavior-on-string-vs-categorical-dtypes/75406177#75406177

Question about pandas

Hello guys, I´ve been struggling on a single code Where I had to move an categorical column back to a string column. After discussing it on stackoverflow they helped me to understand the behavior of such command:

With category the type of nan value is 'float' (as a typical nan) and you can use it in comparison, so: data.partner !='A' will be True for all the rows with NaN.

When converting the type to string, the type of nan value is converted to pandas._libs.missing.NAType, and you can't use this in comparison, so now: data.partner !='A' returns which is not True and the result differs.

My question is. Shouldn't be more "common sense" or expected that a categorical NaN being parsed to a string just become a blank/empty string instead of a "pandas._libs.missing.NAType"?

Thanks. Sorry for bothering you. Just want to understand the reasons behind currently behavior so I won't miss it again!

Comment From: jbrockmendel

@jorisvandenbossche @Dr-Irv getting pd.NA is causing this user confusion. can you help out

Comment From: Dr-Irv

My question is. Shouldn't be more "common sense" or expected that a categorical NaN being parsed to a string just become a blank/empty string instead of a "pandas._libs.missing.NAType"?

There are two representations of "missing values" in pandas. For "traditional" dtypes (float, int, object, Categorical), the value np.nan is used to represent a missing value. For the "nullable" dtypes (StringDtype, Int64Dtype, etc.), the value pd.NA is used to represent a missing value. Since np.nan is used with Categorical, when you convert that to a string using StringDtype, then np.nan is converted to pd.NA. If you wanted to preserve the np.nan value, then you should use object dtype. See below:

In [1]: import pandas as pd

In [2]: s = pd.Series(pd.Categorical(["a", "b", "c", "a", "a", "b"]))

In [3]: s
Out[3]:
0    a
1    b
2    c
3    a
4    a
5    b
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [4]: s2 = s.shift(1)

In [5]: s2
Out[5]:
0    NaN
1      a
2      b
3      c
4      a
5      a
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [6]: s2.iloc[0]
Out[6]: nan

In [7]: s3 = s2.astype(pd.StringDtype())

In [8]: s3.iloc[0]
Out[8]: <NA>

In [9]: s4 = s2.astype(object)

In [10]: s4.iloc[0]
Out[10]: nan

See https://github.com/pandas-dev/pandas/issues/50711 for how this may change in the future.

Comment From: frbelotto

My question is. Shouldn't be more "common sense" or expected that a categorical NaN being parsed to a string just become a blank/empty string instead of a "pandas._libs.missing.NAType"?

There are two representations of "missing values" in pandas. For "traditional" dtypes (float, int, object, Categorical), the value np.nan is used to represent a missing value. For the "nullable" dtypes (StringDtype, Int64Dtype, etc.), the value pd.NA is used to represent a missing value. Since np.nan is used with Categorical, when you convert that to a string using StringDtype, then np.nan is converted to pd.NA. If you wanted to preserve the np.nan value, then you should use object dtype. See below:

```python In [1]: import pandas as pd

In [2]: s = pd.Series(pd.Categorical(["a", "b", "c", "a", "a", "b"]))

In [3]: s Out[3]: 0 a 1 b 2 c 3 a 4 a 5 b dtype: category Categories (3, object): ['a', 'b', 'c']

In [4]: s2 = s.shift(1)

In [5]: s2 Out[5]: 0 NaN 1 a 2 b 3 c 4 a 5 a dtype: category Categories (3, object): ['a', 'b', 'c']

In [6]: s2.iloc[0] Out[6]: nan

In [7]: s3 = s2.astype(pd.StringDtype())

In [8]: s3.iloc[0] Out[8]:

In [9]: s4 = s2.astype(object)

In [10]: s4.iloc[0] Out[10]: nan ```

See #50711 for how this may change in the future.

Thanks. You explanation is great!

One last point to discuss, despite my np.nan categorical becoming an pd.na, when I try to compare it to a simple "string" value won´t work because "it is not comparable", but if you try something that not comparable, maybe the "default" value should be not equals, right?

Comment From: frbelotto

Anyway. Thats fine. Thanks.

Comment From: Dr-Irv

One last point to discuss, despite my np.nan categorical becoming an pd.na, when I try to compare it to a simple "string" value won´t work because "it is not comparable", but if you try something that not comparable, maybe the "default" value should be not equals, right?

See discussion here: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#propagation-in-arithmetic-and-comparison-operations