Pandas groupby function: different printed result after irrelevant change

(1)

def groupby_func(x):
        #computing speedup, relative to first line
        return x.iloc[0, :] / x.iloc[0: , :] # (with '0') 

sdf_size = odf.loc[:, cols].groupby(by="size").apply(groupby_func)

(2)

def groupby_func(x):
        #computing speedup, relative to first line
        return x.iloc[0, :] / x.iloc[: , :] # (without '0')

sdf_size = odf.loc[:, cols].groupby(by="size").apply(groupby_func)

Problem description

For the version (1) and (2) I get two different outputs. When I print 'sdf_size' on the console:

(1) with

(2) without

Somehow, with '0:' the (printed) result is how I wanted it to be (see screenshots, grouping of size in index). But after deleting the 0, which I expected to be unnecessary, I got a different result, which I didn't expect to be different (I think '0:' and ':' to be same -- correct me if I am wrong on this). An explicit setting of "group_keys=True" didn't change anything.

Just ask if something is unclear.

Thank you.

Output of `pd.show_versions()`

commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-51-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.19.1 nose: 1.3.7 pip: 9.0.1 setuptools: 0.6 Cython: None numpy: 1.11.2 scipy: 0.18.0 statsmodels: None xarray: None IPython: 5.1.0 sphinx: None patsy: None dateutil: 2.5.3 pytz: 2016.7 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.3 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999 httplib2: 0.9.1 apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: None

Comment From: TomAugspurger

Reproducible example:

In [3]: df = pd.DataFrame(np.random.randn(10, 2), columns=['a', 'b'])

In [4]: df['c'] = np.random.choice([0, 1], size=len(df))

In [5]: df
Out[5]:
          a         b  c
0  1.993989 -0.443380  0
1  0.451656 -1.374338  1
2 -0.341937  0.095889  1
3  0.831831 -0.119458  0
4  0.506889 -0.405047  0
5 -1.802596 -0.409155  0
6 -0.085620 -2.005494  1
7 -0.230276 -0.709994  1
8 -0.337890  1.010063  1
9 -0.900651  0.611446  0

In [6]: df.groupby('c').apply(lambda x: x.iloc[0, :] / x.iloc[:, :])
Out[6]:
          a          b    c
0  1.000000   1.000000  NaN
1  1.000000   1.000000  1.0
2 -1.320873 -14.332663  1.0
3  2.397109   3.711587  NaN
4  3.933775   1.094639  NaN
5 -1.106177   1.083648  NaN
6 -5.275115   0.685287  1.0
7 -1.961371   1.935705  1.0
8 -1.336693  -1.360647  1.0
9 -2.213942  -0.725134  NaN

In [7]: df.groupby('c').apply(lambda x: x.iloc[0, :] / x.iloc[0:, :])
Out[7]:
            a          b    c
c
0 0  1.000000   1.000000  NaN
  3  2.397109   3.711587  NaN
  4  3.933775   1.094639  NaN
  5 -1.106177   1.083648  NaN
  9 -2.213942  -0.725134  NaN
1 1  1.000000   1.000000  1.0
  2 -1.320873 -14.332663  1.0
  6 -5.275115   0.685287  1.0
  7 -1.961371   1.935705  1.0
  8 -1.336693  -1.360647  1.0

@elDan101 does changing your code to sdf_size = odf.loc[:, cols].groupby(by="size").transform(groupby_func) solve your issue (using transform instead of apply)? .apply does quite a bit of inference to make things come out "right", but this is a clear case of a transform. I don't know for sure, but I'm guessing there's some kind of identity check in .apply, and df.iloc[0:, :] is not df.

Comment From: elDan101

I c&p your suggestion, but an exception was thrown which I wasn't able to quickly solve

> Traceback (most recent call last): > File "viz.py", line 205, in > sdf = create_speedup_dataframe(odf) > File "viz.py", line 183, in create_speedup_dataframe > sdf_size = odf.loc[:, cols].groupby(by="size").transform(groupby_func) > File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3541, in transform > return self._transform_general(func, *args, **kwargs) > File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3474, in _transform_general > path, res = self._choose_path(fast_path, slow_path, group) > File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3588, in _choose_path > res = slow_path(group) > File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3583, in > lambda x: func(x, *args, **kwargs), axis=self.axis) > File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4163, in apply > return self._apply_standard(f, axis, reduce=reduce) > File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4259, in _apply_standard > results[i] = func(v) > File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 3583, in > lambda x: func(x, *args, **kwargs), axis=self.axis) > File "viz.py", line 179, in groupby_func > x = x.iloc[0, :] / x.iloc[0: , :] # computing speedup > File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1309, in __getitem__ > return self._getitem_tuple(key) > File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1559, in _getitem_tuple > self._has_valid_tuple(tup) > File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 149, in _has_valid_tuple > raise IndexingError('Too many indexers') > pandas.core.indexing.IndexingError: ('Too many indexers', u'occurred at index procs') >

Comment From: TomAugspurger

.transform works column by column, so change your indexers from x.iloc[0, :] to just x.iloc[0].

Comment From: jreback

This is the idiomatic way to transform by groups

In [18]: df.groupby('c').transform('first')/df[['a', 'b']]
Out[18]:
           a          b
0   1.000000   1.000000
1  -0.680665  -0.774614
2   1.000000   1.000000
3  -0.398770   0.821029
4   1.708528  14.396555
5   0.857692  -2.335223
6  20.181797   1.131629
7  -0.794080  -0.447658
8   3.881273  -1.540272
9   7.046538 -18.276593
````

**Comment From: elDan101**

@TomAugspurger 
When I use transform the output is for both cases the same and corresponds to the (2) output (screenshot in issue opener). Actually, I am expecting an output like screenshot (1), because this is what is returned when I use inbuilt groupby-functions (like 'mean').

I was not aware of the transform function, there is also no documentation:
[http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.transform.html?highlight=transform#pandas.core.groupby.GroupBy.transform](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.transform.html?highlight=transform#pandas.core.groupby.GroupBy.transform)

**Comment From: TomAugspurger**

tranform is documented in the prose section: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation, though we could add it to the API docs if desired.

Aggregations like `'mean'` set the index to the keys. If you're actual function is an aggregation, then use `.agg`, if you're doing a 1:1 transformation use `.transform`.

Closing for now, but ask if you have questions. Please include an actual copy-pastable example though (no screenshots). Hard to help out with your actual problem when I can't run the code.

**Comment From: elDan101**

Thank you. Actually, one question, I am still wondering why "0:" and ":" makes a difference. 
Is there an intuition that helps me to see why this is different? 

> transform is documented in the prose section: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation, though we could add it to the API docs if desired.

Maybe a "see also: link" to the link you provided is enough. Thanks for this, I was new to grouping with pandas, will become handy again in future I suppose.

**Comment From: TomAugspurger**

>  I am still wondering why "0:" and ":" makes a difference.
Is there an intuition that helps me to see why this is different?

I don't know off the top of my head. My guess is that it has to do with the inference `df.apply` does to try to guess the output shape, and the fact that `df.iloc[0:]` returns a different object that `df`:

```python

In [5]: df = pd.DataFrame(np.random.randn(10, 2), columns=['a', 'b'])

In [6]: df.iloc[:] is df
Out[6]: True

In [7]: df.iloc[0:] is df
Out[7]: False

Feel free to dig through the source code in pandas.core.groupby if you're interested :)

Comment From: jorisvandenbossche

It is true that the identity is not the same, but both are equal

In [53]: df.iloc[0:].equals(df.iloc[:])
Out[53]: True

so I think this can be considered a bug that it does not behave the same in groupby.apply

Comment From: jreback

equality is not enough

it has to be identical this is a view

the reason is we r testing for mutation and there is no good way

Comment From: jorisvandenbossche

equality is not enough, it has to be identical

But in the original issue, it was not about identity I think, as the identical/equal dataframe was only in the denominator: x.iloc[0, :] / x.iloc[:, :] vs x.iloc[0, :] / x.iloc[0:, :]

Pandas groupby function: different printed result after irrelevant change

Problem description

Output of pd.show_versions()

Output of `pd.show_versions()`