Pandas QST: Pandas .agg vs .aggregate behavior

Research

[X] I have searched the [pandas] tag on StackOverflow for similar questions.
[X] I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/74417232/pandas-groupy-aggregate-does-not-see-column

Question about pandas

Hello guys, I am not getting the difference about .agg and .agregate behavir on this situation. I´ve posted it on stackoverflow an time ago but no one helped me yet.

I am working on a huge database where I did a pandas apply to categorize the type of cliente based on the type of the product he consumed:

Sample DF:

import pandas as pd
columns = pd.MultiIndex(
    levels=[['id', 'product', 'type'], ['', 'afiliates', 'giftcards']],
    codes=[[0, 1, 1, 2], [0, 1, 2, 0]],
    names=[None, 'product'],
)
base = pd.DataFrame([[2, 1, 0, 'afiliates'], [4, 0, 1, 'gift']], columns=columns)

So far, so good. If I run an groupby.agg, I get these results:

results = base[['type','id']].groupby(['type'], dropna=False).agg('count')

Pandas QST: Pandas .agg vs .aggregate behavior

but if instead of agg I try an aggregate, it does not work.

results = base[['type','id']].groupby(['type']).aggregate({'id': 'count'})

Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[10], line 2
      1 #results = base[['type','id']].groupby(['type'], dropna=False).agg('count')
----> 2 results = base[['type','id']].groupby(['type']).aggregate({'id': 'count'})

File c:\Users\fabio\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\groupby\generic.py:894, in DataFrameGroupBy.aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
    891 func = maybe_mangle_lambdas(func)
    893 op = GroupByApply(self, func, args, kwargs)
--> 894 result = op.agg()
    895 if not is_dict_like(func) and result is not None:
    896     return result

File c:\Users\fabio\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\apply.py:169, in Apply.agg(self)
    166     return self.apply_str()
    168 if is_dict_like(arg):
--> 169     return self.agg_dict_like()
    170 elif is_list_like(arg):
    171     # we require a list, but not a 'str'
    172     return self.agg_list_like()

File c:\Users\fabio\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\apply.py:478, in Apply.agg_dict_like(self)
    475     selected_obj = obj._selected_obj
    476     selection = obj._selection
--> 478 arg = self.normalize_dictlike_arg("agg", selected_obj, arg)
...
    606 # eg. {'A' : ['mean']}, normalize all to
    607 # be list-likes
    608 # Cannot use func.values() because arg may be a Series

KeyError: "Column(s) ['id'] do not exist"

What am I missing?

Comment From: frbelotto

I believe is something related to the "tuple" columns due to unstack, but I couldn't make it work!

Comment From: phofl

Can you trim down your example? Please post only necessary steps to reproduce and reduce the size of your DataFrames. Also, which pandas version are you on?

Comment From: frbelotto

Hello I've posted all the codes needed to create a data sample and codes for execution.

I am using 1.5.2

Comment From: phofl

Yes, but your example is huge, please reduce your code to the smallest code snippet possible to reproduce

Comment From: frbelotto

I´ve removed all code lines I could without breaking the reproducible steps.

Comment From: MarcoGorelli

your dataframe has 1000 rows - is that necessary for the example to fail? does it also fail with, say, 5 rows?

Comment From: frbelotto

As it creating a sample Df for exemplifying the question, I've just set a number of rows to seem like a real dataframe. Adjust the code to just 5 rows would make it smaller, so I thought it wasn't something to worry.

Comment From: MarcoGorelli

I think the issue is just that base columns are a multiindex - a simple reproducible example would be a small dataframe with multiindex columns which throws the above error

Comment From: frbelotto

I agree that it must be something related to the multiindex, that is way I keep all track of how I've "constructed" the base Df.

I've tried to change the last "group by aggregate" for columns with names composed of tuples (the way I've learned to access columns from multiindex), but I couldn't get it to work.

Comment From: frbelotto

The columns names appers in this form with a base.info()

Pandas QST: Pandas .agg vs .aggregate behavior

So I´ve tried using this code

results = base[['(type, )','(id, )']].groupby(['(type, )']).aggregate({'(id, )': 'count'})

The error message changes but still does not work

Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?5ce7f723-4688-4692-a679-32f940231fe3)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[8], line 2
      1 #results = base[['type','id']].groupby(['type'], dropna=False).agg('count')
----> 2 results = base[['(type, )','(id, )']].groupby(['(type, )']).aggregate({'(id, )': 'count'})
      3 #results = base[['type','id']].groupby(['type']).aggregate({'id': 'count'})

File c:\Users\fabio\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py:3811, in DataFrame.__getitem__(self, key)
   3809     if is_iterator(key):
   3810         key = list(key)
-> 3811     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   3813 # take() does not accept boolean indexers
   3814 if getattr(indexer, "dtype", None) == bool:

File c:\Users\fabio\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\multi.py:2623, in MultiIndex._get_indexer_strict(self, key, axis_name)
   2620 if len(keyarr) and not isinstance(keyarr[0], tuple):
   2621     indexer = self._get_indexer_level_0(keyarr)
-> 2623     self._raise_if_missing(key, indexer, axis_name)
   2624     return self[indexer], indexer
   2626 return super()._get_indexer_strict(key, axis_name)

File c:\Users\fabio\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\multi.py:2641, in MultiIndex._raise_if_missing(self, key, indexer, axis_name)
   2639 cmask = check == -1
   2640 if cmask.any():
-> 2641     raise KeyError(f"{keyarr[cmask]} not in index")
...
   2642 # We get here when levels still contain values which are not
   2643 # actually in Index anymore
   2644 raise KeyError(f"{keyarr} not in index")

KeyError: "['(type, )' '(id, )'] not in index"

Comment From: MarcoGorelli

This doesn't look right:

https://github.com/pandas-dev/pandas/blob/073e48b7cfe85b7a1173a275296471b51276826d/pandas/tests/tools/test_to_datetime.py#L142-L148

If the input is invalid, errors='ignore' should return the input

Comment From: rhshadrach

@frbelotto - regarding a minimal reproducible example, from the discussion above it seems to me the issue is the line:

results = base[['type','id']].groupby(['type']).aggregate({'id': 'count'})

So for a minimal example, all the steps that go into constructing base are very likely unnecessary, and only presenting base, constructed directly, is sufficient. Also, being able to see all the data (e.g. 2 rows) goes a long way in understanding the issue, assuming it can be reproduced like that.

columns = pd.MultiIndex(
    levels=[['id', 'product', 'type'], ['', 'afiliates', 'giftcards']],
    codes=[[0, 1, 1, 2], [0, 1, 2, 0]],
    names=[None, 'product'],
)
base = pd.DataFrame([[2, 1, 0, 'afiliates'], [4, 0, 1, 'gift']], columns=columns)
print(base)
#         id   product                 type
# product    afiliates giftcards           
# 0        2         1         0  afiliates
# 1        4         0         1       gift

results = base[['type','id']].groupby(['type']).aggregate({'id': 'count'})

Comment From: frbelotto

This doesn't look right:

https://github.com/pandas-dev/pandas/blob/073e48b7cfe85b7a1173a275296471b51276826d/pandas/tests/tools/test_to_datetime.py#L142-L148

If the input is invalid, errors='ignore' should return the input

Was this message for this discussion?

Comment From: frbelotto

columns = pd.MultiIndex( levels=[['id', 'product', 'type'], ['', 'afiliates', 'giftcards']], codes=[[0, 1, 1, 2], [0, 1, 2, 0]], names=[None, 'product'], ) base = pd.DataFrame([[2, 1, 0, 'afiliates'], [4, 0, 1, 'gift']], columns=columns) print(base)

id product type

product afiliates giftcards

0 2 1 0 afiliates

1 4 0 1 gift

results = base[['type','id']].groupby(['type']).aggregate({'id': 'count'})

Thanks. I am not really good handling multiindex, I really didn´t know how to recreate it without all steps that I´ve made. I will update the code above using your code given it seems to fully reproduce the error.

Comment From: MarcoGorelli

Was this message for this discussion?

so sorry, had too many tabs open 😳

Comment From: rhshadrach

@frbelotto - you can see how to access your columns using:

print(base.columns.tolist())
[('id', ''), ('product', 'afiliates'), ('product', 'giftcards'), ('type', '')]

When you have a MultiIndex for columns, you need to specify each level as a tuple. So you can do:

base[['type','id']].groupby(['type']).aggregate({('id', ''): 'count'})

Regarding the title of this issue - agg and aggregate are aliases, they do not behave differently.

Comment From: frbelotto

Thanks! That worked! It was really confusing given that the "base[['type','id']].groupby(['type'])" are "simple" columns names and the aggregate({('id', '') uses the tuple name.

Comment From: rhshadrach

I suppose there is a bit of an oddity here - why can you do base[['id']] but not specify {'id': ...} in agg? The reason is because column selection can return multiple columns (e.g. in the example here, base[['product']] returns a DataFrame with two columns), whereas agg must have one column and one column only. Thus, it is necessary to specify all levels in agg.

We could make it so that if specifying a subset of levels uniquely identifies the column it would work in agg, but I'm not certain this is a good idea.

Comment From: frbelotto

Got it. Thanks. Now I understood the reason for each behavior, I don't seem any change as needed.