Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
def ordered(categories):
#print(categories)
from pandas.api.types import CategoricalDtype
return CategoricalDtype(categories=categories, ordered=True)
def unordered(categories):
from pandas.api.types import CategoricalDtype
return CategoricalDtype(categories=categories, ordered=False)
u1 = pd.Series(['c', 'b', 'b', 'a'], dtype=ordered(['a', 'b', 'c']))
x1 = pd.Series(['c', 'b', 'b', 'a'], dtype=unordered(['a', 'b', 'c']))
x1.dtype == "category", u1.dtype == "category"
## True, True
x1.dtype == CategoricalDtype(None, False), u1.dtype == CategoricalDtype(None, False)
## False, False
Issue Description
Doc all categorical dtypes equal to a string "category"
and CategoricalDtype(None, False)
but they don't as the result above shows
Expected Behavior
x1.dtype == CategoricalDtype(None, False), u1.dtype == CategoricalDtype(None, False)
## True, True
Installed Versions
Comment From: kwhkim
I read the source and I suspect the following code is the origin,
def __eq__(self, other: Any) -> bool:
"""
Rules for CDT equality:
1) Any CDT is equal to the string 'category'
2) Any CDT is equal to itself
3) Any CDT is equal to a CDT with categories=None regardless of ordered
4) A CDT with ordered=True is only equal to another CDT with
ordered=True and identical categories in the same order
5) A CDT with ordered={False, None} is only equal to another CDT with
ordered={False, None} and identical categories, but same order is
not required. There is no distinction between False/None.
6) Any other comparison returns False
"""
if isinstance(other, str):
return other == self.name
elif other is self:
return True
elif not (hasattr(other, "ordered") and hasattr(other, "categories")):
return False
elif self.categories is None or other.categories is None:
# For non-fully-initialized dtypes, these are only equal to
# - the string "category" (handled above)
# - other CategoricalDtype with categories=None
return self.categories is other.categories
The code is from line 362-385 in pandas/pandas/core/dtypes/dtypes.py
.
According to the Rules for CDT equlity 3. Any CDT is equal to a CDT with categories=None regardless of ordered, it should be changed to
elif self.categories is None or other.categories is None:
# For non-fully-initialized dtypes, these are only equal to
# - the string "category" (handled above)
# - other CategoricalDtype with categories=None
return True
and the documentation needs to be modified to corporate the fact that it should be True
when categories of any one of the two is None
, regardless of ordered
I am going to take
Comment From: simonjayhawkins
Thanks @kwhkim for the report. can you update the code sample so that it is reproducible, ordered
and unordered
are not defined.
Comment From: kwhkim
@simonjayhawkins oh my bad. it's just a short-cut I defined. I modified the code to include functions needed.
Comment From: topper-123
I don't think there is any good reason why any CategoricalDtype
should compare equal to CategoricalDtype(None, False)
.
So I'd say that docs snippet is just wrong here IMO.