Pandas cryptic DataFrame.agg error when using dictionaries

Not sure if this is a bug. This works:

(
    pd.DataFrame({"u": [2,1,4,2,5], "a": ["a", "a", "b", "a", "b"]})
    .groupby("a")
    .agg(lambda x: np.mean(x)/np.std(x))
)

while this returns an error:

(
    pd.DataFrame({"u": [2,1,4,2,5], "a": ["a", "a", "b", "a", "b"]})
    .groupby("a")
    .agg({"blah": lambda x: np.mean(x)/np.std(x)})
)

error: KeyError: 'blah'

## LONG ERROR MESSAGE

KeyError                                  Traceback (most recent call last)
/opt/local/lib/python3.6/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   2103             try:
-> 2104                 return self._engine.get_loc(key)
   2105             except KeyError:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13161)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13115)()

KeyError: 'blah'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-100-4f93bf630ec4> in <module>()
      2     pd.DataFrame({"u": [2,1,4,2,5], "a": ["a", "a", "b", "a", "b"]})
      3     .groupby("a")
----> 4     .agg({"blah": lambda x: np.mean(x)/np.std(x)})
      5 )

/opt/local/lib/python3.6/site-packages/pandas/core/groupby.py in aggregate(self, arg, *args, **kwargs)
   3697     @Appender(SelectionMixin._agg_doc)
   3698     def aggregate(self, arg, *args, **kwargs):
-> 3699         return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
   3700 
   3701     agg = aggregate

/opt/local/lib/python3.6/site-packages/pandas/core/groupby.py in aggregate(self, arg, *args, **kwargs)
   3195 
   3196         _level = kwargs.pop('_level', None)
-> 3197         result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
   3198         if how is None:
   3199             return result

/opt/local/lib/python3.6/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
    547 
    548                 try:
--> 549                     result = _agg(arg, _agg_1dim)
    550                 except SpecificationError:
    551 

/opt/local/lib/python3.6/site-packages/pandas/core/base.py in _agg(arg, func)
    498                 result = compat.OrderedDict()
    499                 for fname, agg_how in compat.iteritems(arg):
--> 500                     result[fname] = func(fname, agg_how)
    501                 return result
    502 

/opt/local/lib/python3.6/site-packages/pandas/core/base.py in _agg_1dim(name, how, subset)
    477                 aggregate a 1-dim with how
    478                 """
--> 479                 colg = self._gotitem(name, ndim=1, subset=subset)
    480                 if colg.ndim != 1:
    481                     raise SpecificationError("nested dictionary is ambiguous "

/opt/local/lib/python3.6/site-packages/pandas/core/groupby.py in _gotitem(self, key, ndim, subset)
   3724         elif ndim == 1:
   3725             if subset is None:
-> 3726                 subset = self.obj[key]
   3727             return SeriesGroupBy(subset, selection=key,
   3728                                  grouper=self.grouper)

/opt/local/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2055             return self._getitem_multilevel(key)
   2056         else:
-> 2057             return self._getitem_column(key)
   2058 
   2059     def _getitem_column(self, key):

/opt/local/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2062         # get column
   2063         if self.columns.is_unique:
-> 2064             return self._get_item_cache(key)
   2065 
   2066         # duplicate columns & possible reduce dimensionality

/opt/local/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1384         res = cache.get(item)
   1385         if res is None:
-> 1386             values = self._data.get(item)
   1387             res = self._box_item_values(item, values)
   1388             cache[item] = res

/opt/local/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3518 
   3519             if not isnull(item):
-> 3520                 loc = self.items.get_loc(item)
   3521             else:
   3522                 indexer = np.arange(len(self.items))[isnull(self.items)]

/opt/local/lib/python3.6/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   2104                 return self._engine.get_loc(key)
   2105             except KeyError:
-> 2106                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2107 
   2108         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13161)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13115)()

KeyError: 'blah'```


INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.alpha.3
python-bits: 64
OS: Linux
OS-release: 3.14.32-xxxx-grs-ipv6-64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8
LOCALE: en_IE.UTF-8

pandas: 0.19.0
nose: None
pip: 8.1.2
setuptools: 28.3.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0b3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
pandas_datareader: None
</details>

**Comment From: jreback** pls show the error note we have no support for 3.6 atm (it may work but not tested at all) **Comment From: myyc** sorry, updated with the error message. i believe python 3.6.0 is a detail, i encountered this error on 3.5 too. **Comment From: jreback** yes 3.6 doesn't matter how is that cryptic? you get a KeyError blah doesn't exist **Comment From: jreback** http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-different-functions-to-dataframe-columns you have to have a valid column name when aggregating in a dataframe and a dict of name to function **Comment From: myyc** So the name of the column in the dict has to be the name of a column in the `DataFrame`. However if I do things like `.agg([np.sum, lambda x: func1(x), func2])` i get three columns, the second of which will have `` as name. Wouldn't it be wise to have (as an option) something that allows the opposite behaviour? As in, passing a dictionary like `{"col": lambda x: func1(x)}` to obtain a column named `col` with `func1` as its result? Apologies if there is already an option to do so. **Comment From: jreback** a list works because it applies to all columns while a dict is selective at mapping column to function pretty straightforward / you are suggesting something even more confusing **Comment From: myyc** Sure. Until you try doing this

func1 = lambda x: x.sum()/x.std()
func2 = lambda x: x.mean()/x.std()

(
    pd.DataFrame({"u": [2,1,4,2,5], "v": [3,1,5,2,3], "a": ["a", "a", "b", "a", "b"]})
    .groupby("a")
    .agg([np.sum, np.mean, func1, func2])
)

and you see it throwing this error: `SpecificationError: Function names must be unique, found multiple named `. I don't think that aggregating by some random function and renaming the aggregated column is that weird ... **Comment From: myyc** What I mean is something like this

(
    pd.DataFrame({"u": [2,1,4,2,5], "v": [3,1,5,2,3], "a": ["a", "a", "b", "a", "b"]})
    .groupby("a")
    .agg2({"s": np.sum, "m": np.mean, "f1": func1, "f2": func2})
)

Returning an object with (in this case) a multi-index on the column level, with s, m, f1 and f2 below each column – which would be exactly like `agg`'s behaviour with `[np.sum, np.mean]` as an argument. **Comment From: jreback** pls read the docs you can do exactly that if u use a series groupby **Comment From: myyc** I know it works _exactly this way_ with `Series`. I am not sure why there can't be a similar option for `DataFrame`s? **Comment From: jreback** you can pass a nested dictionary **Comment From: myyc** Ok so this does exactly what I wanted:

func1 = lambda x: x.sum()/x.std()
func2 = lambda x: x.mean()/x.std()

d = {"s": np.sum, "m": np.mean, "f1": func1, "f2": func2}

(
    pd.DataFrame({"u": [2,1,4,2,5], "v": [3,1,5,2,3], "a": ["a", "a", "b", "a", "b"]})
    .groupby("a")
    .agg({"u": d, "v": d})
)

Would you still think it's completely pointless to have something like `.agg2(d)` with the same behaviour? **Comment From: myyc** Pointless / confusing **Comment From: jreback** too confusing to have another function which does exactly the same thing however the documentation could use an update on the nested dict case with a dataframe if you would like to do a pr **Comment From: myyc** I can update the documentation. I didn't mean to necessarily create another function but to try and non-destructively add this option to `agg`, as the above example falls to pieces when a `DataFrame` has more columns than just `u` and `v`. Thanks anyway. **Comment From: jorisvandenbossche** @myyc I certainly understand the usecase, but the problem is basically that it is not possible to support this with the same API, as you wouldn't be able to make the distinction anymore between"do you want to apply this function to the specific column?" or "do you want to give this name to the result of this function?" We already had some discussion about this before, see eg https://github.com/pandas-dev/pandas/issues/9052 (specifically this post https://github.com/pandas-dev/pandas/issues/9052 where I give an overview of the different possibilities) I was just thinking. A possible way to be able to give a list of functions to be applied to all columns ánd being able to name the result would be to accept dicts of tuples in this list of functions. Currently you can pass a function or a string (for certain known functions), we could maybe extend this to dicts or typles like:

df.groupby(..).agg(['mean', np.mean, {'average': np.mean}, ('average', np.mean)])

**Comment From: myyc** @jorisvandenbossche Yeah I think that would do the trick, although the syntax is a bit weird (e.g. what happens if you pass stuff like `[np.mean, {'average': np.mean, 'avg': np.mean}]`?). I understand it's impossible with the current system, but don't you think that my proposal (i.e. hypothetically changing `DataFrameGroupBy.agg` altogether) would make it consistent with `SeriesGroupBy.agg`? I don't have usage data to back this change obviously. **Comment From: jorisvandenbossche** > the syntax is a bit weird (e.g. what happens if you pass stuff like [np.mean, {'average': np.mean, 'avg': np.mean}]?). Well yes, that would be a bit strange, as I think we should only accept dicts of length 1 (therefore tuples are maybe more natural, but the dicts are more like how it works for series) IMO, both usages of the dict are useful (eg myself, I think I more use the dict to specify different functions for different columns, certainly if you have different data types in different column that should be aggregated differently, this is very handy. For correct naming I just use a `def ..` function instead of lambda functions). And unfortunately, both usages are not compatible with each other. So switching the one usage with the other, does not seem like a clear win to me (breaking code + how do you then specify different functions for different columns?) It would make it indeed more consistent with the series version, but IMO that's not enough to be worth it.