Pandas Groupby.transform returns data with incorrect indexing when grouping on categoricals with missing values

Code Sample, a copy-pastable example if possible

import pandas as pd
from pandas.api.types import CategoricalDtype

cat1_type = CategoricalDtype(categories=[101, 202, 303], ordered=True)
cat2_type = CategoricalDtype(categories=["a", "b", "c", "d"], ordered=True)

df = pd.DataFrame({'cat1': [101, 101, 101, 101, 101, 101], 
                  'vals':[1, 2, 3, 4, 5, 6],
                  'cat2': ["a", "b", "d", "d", "a", "c"]
                  })

# Original DataFrame
print(df)

# Everything works as expected when "cat1" and "cat2" are not categories
df['cat_max'] = df.groupby(["cat1", "cat2"]).transform(max)
print(df)

# Convert to cat1 and cat2 to categories
df = df.astype({'cat1': cat1_type, 'cat2': cat2_type, 'vals': 'float64'})

# Everything works when there are no missing intermediate categories
df.cat_max = df.groupby(["cat1", "cat2"]).transform(max)
print(df)

# With missing "c" category everything works if observed=True is specified
df_miss_c = df[df.cat2 != "c"].copy()
df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"], observed=True).transform(max)
print(df_miss_c)

# With missing "c" category the next group picks up its NaNs
df_miss_c = df[df.cat2 != "c"].copy()
df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"]).transform(max)
print(df_miss_c)

# With with missing "202" category you get two different NaNs
# If you group by cat1 then cat2 you get missing values like above
df_miss_202 = df.copy()
df_miss_202.loc[5, 'cat1'] = 303
df_miss_202.cat_max = df_miss_202.groupby(["cat2", "cat1"]).transform(max)
print(df_miss_202)

# With with missing "101" everything works as expected as long as only one value for cat1 is used
df_miss_101 = df.copy()
df_miss_101.cat1 = 202
df_miss_101.cat_max = df_miss_101.groupby(["cat2", "cat1"]).transform(max)
print(df_miss_101)

print(f'pandas version {pd.show_versions()}')

Problem description

From the documentation: Transformation: perform some group-specific computations and return a like-indexed object.

Given that transformation's purpose is to return a like-indexed object is seems that it should do this be default, even without specifying observed=True when there are missing categories. Also the default behavior with missing categories changes depending on which category is missing.

Expected Output

The expected output and output provided if the grouper values are not categories or if observed = True is specified follow.

>>> # Original DataFrame
>>> print(df)
    cat1  vals cat2
0   101     1    a
1   101     2    b
2   101     3    d
3   101     4    d
4   101     5    a
5   101     6    c
>>>
>>> # Everything works when "cat1" and "cat2" are not categories
>>> df['cat_max'] = df.groupby(["cat1", "cat2"]).transform(max)
>>> print(df)
    cat1  vals cat2  cat_max
0   101     1    a        5
1   101     2    b        2
2   101     3    d        4
3   101     4    d        4
4   101     5    a        5
5   101     6    c        6

>>> # Convert to cat1 and cat2 to categories
>>> df = df.astype({'cat1': cat1_type, 'cat2': cat2_type, 'vals': 'float64'})

>>> # With missing "c" category everything works if observed=True is specified
>>> df_miss_c = df[df.cat2 != "c"].copy()
>>> df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"], observed=True).transform(max)
>>> print(df_miss_c)
   cat1  vals cat2  cat_max
0  101   1.0    a      5.0
1  101   2.0    b      2.0
2  101   3.0    d      4.0
3  101   4.0    d      4.0
4  101   5.0    a      5.0

>>> # With with missing "101" everything works as expected as long as only one value for cat1 is used
>>> df_miss_101 = df.copy()
>>> df_miss_101.cat1 = 202
>>> df_miss_101.cat_max = df_miss_101.groupby(["cat2", "cat1"]).transform(max)
>>> print(df_miss_101)
    cat1  vals cat2  cat_max
0   202   1.0    a      5.0
1   202   2.0    b      2.0
2   202   3.0    d      4.0
3   202   4.0    d      4.0
4   202   5.0    a      5.0
5   202   6.0    c      6.0

When an intermediate categorical is missing, "202", or "c" the indexing is thrown off.

>>> # With missing "c" category the next group picks up its NaNs
>>> df_miss_c = df[df.cat2 != "c"].copy()
>>> df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"]).transform(max)
>>> print(df_miss_c)
  cat1  vals cat2  cat_max
0  101   1.0    a      5.0
1  101   2.0    b      2.0
2  101   3.0    d      NaN
3  101   4.0    d      NaN
4  101   5.0    a      5.0
>>>
>>> # With with missing "202" category you get two different NaNs
>>> # If you group by cat1 then cat2 you get missing values like above
>>> df_miss_202 = df.copy()
>>> df_miss_202.loc[5, 'cat1'] = 303
>>> df_miss_202.cat_max = df_miss_202.groupby(["cat2", "cat1"]).transform(max)
>>> print(df_miss_202)
  cat1  vals cat2  cat_max
0  101   1.0    a      5.0
1  101   2.0    b      NaN
2  101   3.0    d      2.0
3  101   4.0    d      2.0
4  101   5.0    a      5.0
5  303   6.0    c      NaN

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 0.25.1 numpy : 1.17.2 pytz : 2019.2 dateutil : 2.8.0 pip : 19.2.3 setuptools : 41.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.10 IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : 0.2.1 gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.12.0 pytables : None s3fs : None scipy : 1.2.0 sqlalchemy : None tables : None xarray : None xlrd : None xlwt : None xlsxwriter : None

Comment From: rhshadrach

This is now fixed on main and is tested in test_category_order_transformer