Code Sample, a copy-pastable example if possible
import pandas as pd
from pandas.api.types import CategoricalDtype
cat1_type = CategoricalDtype(categories=[101, 202, 303], ordered=True)
cat2_type = CategoricalDtype(categories=["a", "b", "c", "d"], ordered=True)
df = pd.DataFrame({'cat1': [101, 101, 101, 101, 101, 101],
'vals':[1, 2, 3, 4, 5, 6],
'cat2': ["a", "b", "d", "d", "a", "c"]
})
# Original DataFrame
print(df)
# Everything works as expected when "cat1" and "cat2" are not categories
df['cat_max'] = df.groupby(["cat1", "cat2"]).transform(max)
print(df)
# Convert to cat1 and cat2 to categories
df = df.astype({'cat1': cat1_type, 'cat2': cat2_type, 'vals': 'float64'})
# Everything works when there are no missing intermediate categories
df.cat_max = df.groupby(["cat1", "cat2"]).transform(max)
print(df)
# With missing "c" category everything works if observed=True is specified
df_miss_c = df[df.cat2 != "c"].copy()
df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"], observed=True).transform(max)
print(df_miss_c)
# With missing "c" category the next group picks up its NaNs
df_miss_c = df[df.cat2 != "c"].copy()
df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"]).transform(max)
print(df_miss_c)
# With with missing "202" category you get two different NaNs
# If you group by cat1 then cat2 you get missing values like above
df_miss_202 = df.copy()
df_miss_202.loc[5, 'cat1'] = 303
df_miss_202.cat_max = df_miss_202.groupby(["cat2", "cat1"]).transform(max)
print(df_miss_202)
# With with missing "101" everything works as expected as long as only one value for cat1 is used
df_miss_101 = df.copy()
df_miss_101.cat1 = 202
df_miss_101.cat_max = df_miss_101.groupby(["cat2", "cat1"]).transform(max)
print(df_miss_101)
print(f'pandas version {pd.show_versions()}')
Problem description
From the documentation: Transformation: perform some group-specific computations and return a like-indexed object.
Given that transformation's purpose is to return a like-indexed object is seems that it should do this be default, even without specifying observed=True when there are missing categories. Also the default behavior with missing categories changes depending on which category is missing.
Expected Output
The expected output and output provided if the grouper values are not categories or if observed = True is specified follow.
>>> # Original DataFrame
>>> print(df)
cat1 vals cat2
0 101 1 a
1 101 2 b
2 101 3 d
3 101 4 d
4 101 5 a
5 101 6 c
>>>
>>> # Everything works when "cat1" and "cat2" are not categories
>>> df['cat_max'] = df.groupby(["cat1", "cat2"]).transform(max)
>>> print(df)
cat1 vals cat2 cat_max
0 101 1 a 5
1 101 2 b 2
2 101 3 d 4
3 101 4 d 4
4 101 5 a 5
5 101 6 c 6
>>> # Convert to cat1 and cat2 to categories
>>> df = df.astype({'cat1': cat1_type, 'cat2': cat2_type, 'vals': 'float64'})
>>> # With missing "c" category everything works if observed=True is specified
>>> df_miss_c = df[df.cat2 != "c"].copy()
>>> df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"], observed=True).transform(max)
>>> print(df_miss_c)
cat1 vals cat2 cat_max
0 101 1.0 a 5.0
1 101 2.0 b 2.0
2 101 3.0 d 4.0
3 101 4.0 d 4.0
4 101 5.0 a 5.0
>>> # With with missing "101" everything works as expected as long as only one value for cat1 is used
>>> df_miss_101 = df.copy()
>>> df_miss_101.cat1 = 202
>>> df_miss_101.cat_max = df_miss_101.groupby(["cat2", "cat1"]).transform(max)
>>> print(df_miss_101)
cat1 vals cat2 cat_max
0 202 1.0 a 5.0
1 202 2.0 b 2.0
2 202 3.0 d 4.0
3 202 4.0 d 4.0
4 202 5.0 a 5.0
5 202 6.0 c 6.0
When an intermediate categorical is missing, "202", or "c" the indexing is thrown off.
>>> # With missing "c" category the next group picks up its NaNs
>>> df_miss_c = df[df.cat2 != "c"].copy()
>>> df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"]).transform(max)
>>> print(df_miss_c)
cat1 vals cat2 cat_max
0 101 1.0 a 5.0
1 101 2.0 b 2.0
2 101 3.0 d NaN
3 101 4.0 d NaN
4 101 5.0 a 5.0
>>>
>>> # With with missing "202" category you get two different NaNs
>>> # If you group by cat1 then cat2 you get missing values like above
>>> df_miss_202 = df.copy()
>>> df_miss_202.loc[5, 'cat1'] = 303
>>> df_miss_202.cat_max = df_miss_202.groupby(["cat2", "cat1"]).transform(max)
>>> print(df_miss_202)
cat1 vals cat2 cat_max
0 101 1.0 a 5.0
1 101 2.0 b NaN
2 101 3.0 d 2.0
3 101 4.0 d 2.0
4 101 5.0 a 5.0
5 303 6.0 c NaN
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit : None python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None
pandas : 0.25.1 numpy : 1.17.2 pytz : 2019.2 dateutil : 2.8.0 pip : 19.2.3 setuptools : 41.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.10 IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : 0.2.1 gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.12.0 pytables : None s3fs : None scipy : 1.2.0 sqlalchemy : None tables : None xarray : None xlrd : None xlwt : None xlsxwriter : None
Comment From: rhshadrach
This is now fixed on main and is tested in test_category_order_transformer