Pandas ValueError when summing over column level with skipna=False (Python 3 only)

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df = pd.DataFrame(np.ones([2, 8]))
df.iloc[0, 0] = np.nan
df.columns = pd.MultiIndex.from_product([list('ab'), list('cd'), list('ef')])
df.sum(axis=1, level=[0, 1], skipna=False)

Problem description

In Python 3.6 a ValueError is raised: ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'>. In Python 2.7 the above works as expected.

Expected Output

The same as in Python 2.7, namely the dataframe:

     a         b     
     c    d    c    d
0  NaN  2.0  2.0  2.0
1  2.0  2.0  2.0  2.0

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 32 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.22.0 pytest: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.14.0 scipy: 1.0.0 pyarrow: None xarray: 0.10.0 IPython: 6.2.1 sphinx: None patsy: 0.5.0 dateutil: 2.6.1 pytz: 2018.3 blosc: None bottleneck: 1.2.1 tables: None numexpr: None feather: None matplotlib: 2.1.2 openpyxl: 2.5.0 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml: 4.1.1 bs4: 4.6.0 html5lib: None sqlalchemy: 1.2.2 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None

Comment From: shangyian

I was able to reproduce this discrepancy between Python 2.7 and 3.6 as well.

Tried to dig into the code, and the main differences I noticed between the two were that in _python_agg_general, it calls _iterate_slices, which only has a single slice in Python 3.6, but has 2 in 2.7: https://github.com/pandas-dev/pandas/blob/15bbb28865fa1a259e4d61b85604dea1c5f2f249/pandas/core/groupby.py#L1043-L1049 Somehow self._selection isn't being populated correctly.

Comment From: TomAugspurger

Thanks for looking @shangyian. Are you interested in digging further? Do you need any pointers on where to look next?

Comment From: shangyian

Sure, I can try to dig further.

My main issue right now is that I don't quite understand where _selection in _iterate_slices() is being populated. Any help with pinning that down would be much appreciated.

Comment From: TomAugspurger

I'm not sure if this is the root cause, but in https://github.com/pandas-dev/pandas/blob/6ef4be3f8f269f147b5abedecf7da6f19af305d3/pandas/core/groupby.py#L1051

For Py2 that throws a TypeError, which is caught

(Pdb) self.grouper.agg_series(obj, f)
*** TypeError: unbound method sum() must be called with DataFrame instance as first argument (got Series instance instead)

For Py3, that throws a ValueError, which is not

ipdb> self.grouper.agg_series(obj, f)
*** ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'>

I'm not sure what the correct behavior is here, but maybe that helps :)

Comment From: shangyian

Ah, there it is - thanks! 👍

If we just have the resulting function catch both TypeError and ValueError, it'll produce the same result as Py2. It seems that the crux of the issue is that it tries to do series-level aggregation, which doesn't work, and it's supposed to move onto the _python_apply_general method after catching the errors.

I'll add a PR for this.

Comment From: jbrockmendel

The level keyword is no longer present in Series/DataFrame reductions. Closing.

Pandas ValueError when summing over column level with skipna=False (Python 3 only)

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`