While writing a piece of code similar to the example below, I stumbled on a problematic interaction between groupby, diff and merge.
My debugging efforts showed that this problem is likely related to the "fast_apply" optimisation Pandas uses when using apply().
As shown below, when using string values as grouping keys, the first value encountered overrides all the others. Removing either the diff call, the merge, or replacing "a" and "b" by numerical values all solve the issue, and give a correct (albeit obviously incomplete for my purpose) result.
Code
import pandas as pd
import numpy as np
bar = np.random.randint(0, 10, 4)
foo = ['a'] * 2 + ['b'] * 2
values = zip(foo, bar)
df = pd.DataFrame(values, columns=['foo', 'bar'])
print df
def try_diff_merge(df):
df['inf_bar'] = df.bar < 6
df['inf_bar_diff'] = df['inf_bar'].diff()
df_merge = df.groupby(['inf_bar']).count()
df_merge.drop(['inf_bar_diff'], axis=1, inplace=True)
df_merge = pd.merge(df, df_merge, how='left', right_index=True,
left_on=['inf_bar'])
return df_merge
df_group = df.groupby(['foo']).apply(try_diff_merge)
print df_group
Output
foo bar
0 a 2
1 a 0
2 b 6
3 b 8
foo_x bar_x inf_bar inf_bar_diff foo_y bar_y
0 a 2 True NaN 2 2
1 a 0 True False 2 2
2 a 6 False NaN 2 2
3 a 8 False False 2 2
Expected Output
foo bar
0 a 2
1 a 0
2 b 6
3 b 8
foo_x bar_x inf_bar inf_bar_diff foo_y bar_y
0 a 2 True NaN 2 2
1 a 0 True False 2 2
2 b 6 False NaN 2 2
3 b 8 False False 2 2
output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.39.0
None
Comment From: jreback
you are doing WAY WAY too much in a groupby. This is completely inefficient.
I get output like this which seems ok to me. (your expected and actual are the same)
In [43]: df_group
Out[43]:
foo_x bar_x inf_bar inf_bar_diff foo_y bar_y
0 a 9 False NaN 1 1
1 a 3 True True 1 1
2 a 9 False NaN 2 2
3 a 9 False False 2 2
Something like this is way more readable and way more efficient.
In [44]: df = pd.DataFrame(values, columns=['foo', 'bar'])
In [45]: df['inf_bar'] = df['bar']<6
In [46]: df['inf_bar_diff'] = df.groupby('foo').inf_bar.diff()
In [50]: df['count'] = df.groupby('foo').inf_bar_diff.transform('count')
In [51]: df
Out[51]:
foo bar inf_bar inf_bar_diff count
0 a 9 False NaN 1
1 a 3 True True 1
2 b 9 False NaN 1
3 b 9 False False 1
Comment From: rmoutie
Thx for the answer but it does not really solve my issue. My expected and actual are different, columns foo_x are not the same ! My example is indeed very inefficient but it is a much more simplified version of my initial code. To finish with, your more readable and more efficient way is different from my expected output. Can you reopen the issue and have a deeper look ? Thx !
Comment From: jreback
@rmoutie This is not a bug. You are better off asking on SO.
Comment From: rmoutie
@jreback I don't agree. This simple example (even if not very efficient) shows a wrong behaviour of the apply function. The column foo_x should look like the expected one, and this is not the case ! The results of the method 'apply_frame_axis0' from the pandas.lib are not recombined accordingly to the argument: names, as already said, the first value of the names overrides all the other... If you don't take time to look deeper, who is going to do it on SO ?
Comment From: jreback
@rmoutie your example is way too complicated and doesn't even reproduce, so not sure what to tell you. You need a clear convincing example to show its a bug.