Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
def ordered(categories):
#print(categories)
from pandas.api.types import CategoricalDtype
return CategoricalDtype(categories=categories, ordered=True)
def unordered(categories):
from pandas.api.types import CategoricalDtype
return CategoricalDtype(categories=categories, ordered=False)
u1 = pd.Series(['c', 'b', 'b', 'a'], dtype=ordered(['a', 'b', 'c']))
x1 = pd.Series(['c', 'b', 'b', 'a'], dtype=unordered(['a', 'b', 'c']))
x1.dtype == "category", u1.dtype == "category"
## True, True
x1.dtype == CategoricalDtype(None, False), u1.dtype == CategoricalDtype(None, False)
## False, False
Issue Description
Doc all categorical dtypes equal to a string "category" and CategoricalDtype(None, False) but they don't as the result above shows
Expected Behavior
x1.dtype == CategoricalDtype(None, False), u1.dtype == CategoricalDtype(None, False)
## True, True
Installed Versions
Comment From: kwhkim
I read the source and I suspect the following code is the origin,
def __eq__(self, other: Any) -> bool:
"""
Rules for CDT equality:
1) Any CDT is equal to the string 'category'
2) Any CDT is equal to itself
3) Any CDT is equal to a CDT with categories=None regardless of ordered
4) A CDT with ordered=True is only equal to another CDT with
ordered=True and identical categories in the same order
5) A CDT with ordered={False, None} is only equal to another CDT with
ordered={False, None} and identical categories, but same order is
not required. There is no distinction between False/None.
6) Any other comparison returns False
"""
if isinstance(other, str):
return other == self.name
elif other is self:
return True
elif not (hasattr(other, "ordered") and hasattr(other, "categories")):
return False
elif self.categories is None or other.categories is None:
# For non-fully-initialized dtypes, these are only equal to
# - the string "category" (handled above)
# - other CategoricalDtype with categories=None
return self.categories is other.categories
The code is from line 362-385 in pandas/pandas/core/dtypes/dtypes.py.
According to the Rules for CDT equlity 3. Any CDT is equal to a CDT with categories=None regardless of ordered, it should be changed to
elif self.categories is None or other.categories is None:
# For non-fully-initialized dtypes, these are only equal to
# - the string "category" (handled above)
# - other CategoricalDtype with categories=None
return True
and the documentation needs to be modified to corporate the fact that it should be True when categories of any one of the two is None, regardless of ordered
I am going to take
Comment From: simonjayhawkins
Thanks @kwhkim for the report. can you update the code sample so that it is reproducible, ordered and unordered are not defined.
Comment From: kwhkim
@simonjayhawkins oh my bad. it's just a short-cut I defined. I modified the code to include functions needed.
Comment From: topper-123
I don't think there is any good reason why any CategoricalDtype should compare equal to CategoricalDtype(None, False).
So I'd say that docs snippet is just wrong here IMO.