Pandas Improve poor performance of groupby resampling

When trying to resample transactions data where there are infrequent transactions for a large number of people, I get horrible performance. Compare the function annualize with the clunkier but faster annualize2 below. Ideally resample should be able to handle multiindex data and resample on 1 of the dimensions without the need to resort to groupby.

Code Sample, a copy-pastable example if possible

import pandas
import numpy as np

def annualize(df):
    def resample(df):
        df = df.set_index('DATE')
        df = df.resample('A').ffill()
        df = df.reset_index()
        return df
    df = df.groupby('ID', as_index = False).apply(resample)
    df['YEAR'] = df['DATE'].dt.year
    df = df.reset_index(drop = True)
    return df

def annualize2(df):
    #DF must be sorted by ID then DATE
    df = df.sort_values(['ID','DATE']).reset_index(drop = True)
    #Extract the year from the date and keep last observation from each year
    df['YEAR'] = df['DATE'].dt.year
    df = df.drop_duplicates(subset = ['ID','YEAR'], keep = 'last')
    #Get the YEAR and ID of the next record down
    ndf = df[['ID', 'YEAR']].shift(-1)
    #Want to duplicate records where ID's between record pairs match and spacing is more than year
    match = (ndf['ID'] == df['ID'])
    reps = np.array(np.maximum(match*(ndf['YEAR'] - df['YEAR']),1), dtype = int)
    reps[-1] = 1 # the last value will by -inf
    #Do the duplication
    df = df.loc[np.repeat(df.index, reps)]
    #Now we need to fix the year field
    cs = np.ones(len(df))
    cs[np.cumsum(reps[:-1])] -= reps[:-1]
    df['YEAR'] = df['YEAR'] + np.cumsum(cs)-1
    df = df.reset_index(drop = True)
    return df

if __name__ == '__main__':
    import StringIO
    import timeit
    #Generate some transaction record data
    data='''ID,DATE,STATUS
    1,3/1/2005,A
    1,4/30/2005,B
    1,6/30/2010,C
    1,8/20/2010,D
    '''
    df = pandas.read_csv(StringIO.StringIO(data))
    df['DATE'] = pandas.to_datetime(df['DATE'])
    print(annualize(df))
    print(annualize2(df))

    #then repeat the rows of the table to get a more realistic sample size for measuring performance
    df = df.loc[np.tile(df.index.values, 10000)]
    df['ID'] = np.repeat(np.arange(10000), 4)

    print('annualize')
    print(timeit.timeit('annualize(df)', setup = 'from __main__ import annualize, df', number = 1))
    print('annualize2')
    print(timeit.timeit('annualize2(df)', setup = 'from __main__ import annualize2, df', number = 1))

Expected Output

        DATE  ID STATUS  YEAR
0 2005-12-31   1      B  2005
1 2006-12-31   1      B  2006
2 2007-12-31   1      B  2007
3 2008-12-31   1      B  2008
4 2009-12-31   1      B  2009
5 2010-12-31   1      D  2010
   ID       DATE STATUS    YEAR
0   1 2005-04-30      B  2005.0
1   1 2005-04-30      B  2006.0
2   1 2005-04-30      B  2007.0
3   1 2005-04-30      B  2008.0
4   1 2005-04-30      B  2009.0
5   1 2010-08-20      D  2010.0
annualize
26.4788940619
annualize2
0.0899452309903

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 2.7.11.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel byteorder: little LC_ALL: None LANG: None

pandas: 0.18.1 nose: 1.3.7 pip: 8.1.2 setuptools: 20.3 Cython: 0.23.4 numpy: 1.11.0 scipy: 0.17.0 statsmodels: 0.6.1 xarray: None IPython: 4.1.2 sphinx: 1.3.5 patsy: 0.4.0 dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: 1.0.0 tables: 3.2.2 numexpr: 2.5 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.8.4 lxml: 3.6.0 bs4: 4.4.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.12 pymysql: None psycopg2: None jinja2: 2.8 boto: 2.39.0 pandas_datareader: None

Comment From: jorisvandenbossche

I didn't look into detail to your example, but note that the two functions do return a different result, so do not seem to be equivalent.

Comment From: jreback

duplicate of https://github.com/pydata/pandas/issues/11296

pull-requests are welcome!

Comment From: jreback

The actual grouping / resample are not bounding this here at all, rather it is the fact that .fillna (and cousins) are not in cython, meaning this is basically a python loop.

Comment From: spillz

Yeah, to be clear I wasn't proposing the other code as a replacement, just to show the performance issue (it gets worse on even bigger data sets). I didn't think it was worth the extra fiddling to get the column ordering and dtypes to be the same. The general application is converting transaction records into an unbalanced panel data set, which I can then run standard regression methods on.

I had assumed that it was mostly the groupby that was causing the slowness. Is that wrong?

Based on the assumption I thought that one solution for my use case would be if resample could support pad operations for non-unique indexes (or more complete support for resampling multiindexes, which is essentially what my DATE and ID pairs represent). Currently if I do:

import pandas
import numpy as np

if __name__ == '__main__':
    import StringIO
    import timeit
    #Generate some transaction record data
    data='''ID,DATE,STATUS
    1,3/1/2005,A
    1,4/30/2005,B
    1,6/30/2010,C
    1,8/20/2010,D
    '''
    df = pandas.read_csv(StringIO.StringIO(data))
    df['DATE'] = pandas.to_datetime(df['DATE'])

    #then repeat the rows of the table to get a more realistic sample size for measuring performance
    df = df.loc[np.tile(df.index.values, 10000)]
    df['ID'] = np.repeat(np.arange(10000), 4)

    df = df.set_index('DATE')
    df = df.resample('A').ffill()
    print(df)

I get:

Traceback (most recent call last):
  File "C:\Users\#####\Source\PANDAS~2.PY", line 56, in <module>
    df = df.resample('A').ffill()
  File "c:\anaconda2\lib\site-packages\pandas\tseries\resample.py", line 424, in pad
    return self._upsample('pad', limit=limit)
  File "c:\anaconda2\lib\site-packages\pandas\tseries\resample.py", line 704, in _upsample
    limit=limit)
  File "c:\anaconda2\lib\site-packages\pandas\core\frame.py", line 2744, in reindex
    **kwargs)
  File "c:\anaconda2\lib\site-packages\pandas\core\generic.py", line 2229, in reindex
    fill_value, copy).__finalize__(self)
  File "c:\anaconda2\lib\site-packages\pandas\core\frame.py", line 2690, in _reindex_axes
    fill_value, limit, tolerance)
  File "c:\anaconda2\lib\site-packages\pandas\core\frame.py", line 2698, in _reindex_index
    tolerance=tolerance)
  File "c:\anaconda2\lib\site-packages\pandas\indexes\base.py", line 2342, in reindex
    raise ValueError("cannot reindex a non-unique index "
ValueError: cannot reindex a non-unique index with a method or limit

Whereas it would be nice if it was able to handle non-unique indexes by treating the non-ascending dates as boundaries and return something like:

         ID STATUS  YEAR
2005   1      B  2005
2006   1      B  2006
2007   1      B  2007
2008   1      B  2008
2009   1      B  2009
2010   1      D  2010
2005   2      B  2005
2006   2      B  2006
2007   2      B  2007
2008   2      B  2008
2009   2      B  2009
2010   2      D  2010
2005   3      B  2005
2006   3      B  2006
2007   3      B  2007
2008   3      B  2008
2009   3      B  2009
2010   3      D  2010
...

But I guess that's moot if the problem is that reindex filling relies on looping.

As for pull requests, I found the code that does the reindexing / resampling to be pretty opaque so would need a few pointers to get started.

Comment From: eromoe

Number of row : 960,000 cost half hour , still hanging . What's wrong with groupby resampling?

Pandas Improve poor performance of groupby resampling

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

output of `pd.show_versions()`