Research
-
[X] I have searched the [pandas] tag on StackOverflow for similar questions.
-
[X] I have asked my usage related question on StackOverflow.
Link to question on StackOverflow
https://stackoverflow.com/questions/75406055/weird-behavior-on-string-vs-categorical-dtypes/75406177#75406177
Question about pandas
Hello guys, I´ve been struggling on a single code Where I had to move an categorical column back to a string column. After discussing it on stackoverflow they helped me to understand the behavior of such command:
With category the type of nan value is 'float' (as a typical nan) and you can use it in comparison, so: data.partner !='A' will be True for all the rows with NaN.
When converting the type to string, the type of nan value is converted to pandas._libs.missing.NAType, and you can't use this in comparison, so now: data.partner !='A' returns
which is not True and the result differs.
My question is. Shouldn't be more "common sense" or expected that a categorical NaN being parsed to a string just become a blank/empty string instead of a "pandas._libs.missing.NAType"?
Thanks. Sorry for bothering you. Just want to understand the reasons behind currently behavior so I won't miss it again!
Comment From: jbrockmendel
@jorisvandenbossche @Dr-Irv getting pd.NA is causing this user confusion. can you help out
Comment From: Dr-Irv
My question is. Shouldn't be more "common sense" or expected that a categorical NaN being parsed to a string just become a blank/empty string instead of a "pandas._libs.missing.NAType"?
There are two representations of "missing values" in pandas. For "traditional" dtypes (float
, int
, object
, Categorical
), the value np.nan
is used to represent a missing value. For the "nullable" dtypes (StringDtype
, Int64Dtype
, etc.), the value pd.NA
is used to represent a missing value. Since np.nan
is used with Categorical
, when you convert that to a string using StringDtype
, then np.nan
is converted to pd.NA
. If you wanted to preserve the np.nan
value, then you should use object
dtype. See below:
In [1]: import pandas as pd
In [2]: s = pd.Series(pd.Categorical(["a", "b", "c", "a", "a", "b"]))
In [3]: s
Out[3]:
0 a
1 b
2 c
3 a
4 a
5 b
dtype: category
Categories (3, object): ['a', 'b', 'c']
In [4]: s2 = s.shift(1)
In [5]: s2
Out[5]:
0 NaN
1 a
2 b
3 c
4 a
5 a
dtype: category
Categories (3, object): ['a', 'b', 'c']
In [6]: s2.iloc[0]
Out[6]: nan
In [7]: s3 = s2.astype(pd.StringDtype())
In [8]: s3.iloc[0]
Out[8]: <NA>
In [9]: s4 = s2.astype(object)
In [10]: s4.iloc[0]
Out[10]: nan
See https://github.com/pandas-dev/pandas/issues/50711 for how this may change in the future.
Comment From: frbelotto
My question is. Shouldn't be more "common sense" or expected that a categorical NaN being parsed to a string just become a blank/empty string instead of a "pandas._libs.missing.NAType"?
There are two representations of "missing values" in pandas. For "traditional" dtypes (
float
,int
,object
,Categorical
), the valuenp.nan
is used to represent a missing value. For the "nullable" dtypes (StringDtype
,Int64Dtype
, etc.), the valuepd.NA
is used to represent a missing value. Sincenp.nan
is used withCategorical
, when you convert that to a string usingStringDtype
, thennp.nan
is converted topd.NA
. If you wanted to preserve thenp.nan
value, then you should useobject
dtype. See below:```python In [1]: import pandas as pd
In [2]: s = pd.Series(pd.Categorical(["a", "b", "c", "a", "a", "b"]))
In [3]: s Out[3]: 0 a 1 b 2 c 3 a 4 a 5 b dtype: category Categories (3, object): ['a', 'b', 'c']
In [4]: s2 = s.shift(1)
In [5]: s2 Out[5]: 0 NaN 1 a 2 b 3 c 4 a 5 a dtype: category Categories (3, object): ['a', 'b', 'c']
In [6]: s2.iloc[0] Out[6]: nan
In [7]: s3 = s2.astype(pd.StringDtype())
In [8]: s3.iloc[0] Out[8]:
In [9]: s4 = s2.astype(object)
In [10]: s4.iloc[0] Out[10]: nan ```
See #50711 for how this may change in the future.
Thanks. You explanation is great!
One last point to discuss, despite my np.nan categorical becoming an pd.na, when I try to compare it to a simple "string" value won´t work because "it is not comparable", but if you try something that not comparable, maybe the "default" value should be not equals, right?
Comment From: frbelotto
Anyway. Thats fine. Thanks.
Comment From: Dr-Irv
One last point to discuss, despite my np.nan categorical becoming an pd.na, when I try to compare it to a simple "string" value won´t work because "it is not comparable", but if you try something that not comparable, maybe the "default" value should be not equals, right?
See discussion here: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#propagation-in-arithmetic-and-comparison-operations