When trying to resample transactions data where there are infrequent transactions for a large number of people, I get horrible performance. Compare the function annualize with the clunkier but faster annualize2 below. Ideally resample should be able to handle multiindex data and resample on 1 of the dimensions without the need to resort to groupby.
Code Sample, a copy-pastable example if possible
import pandas
import numpy as np
def annualize(df):
def resample(df):
df = df.set_index('DATE')
df = df.resample('A').ffill()
df = df.reset_index()
return df
df = df.groupby('ID', as_index = False).apply(resample)
df['YEAR'] = df['DATE'].dt.year
df = df.reset_index(drop = True)
return df
def annualize2(df):
#DF must be sorted by ID then DATE
df = df.sort_values(['ID','DATE']).reset_index(drop = True)
#Extract the year from the date and keep last observation from each year
df['YEAR'] = df['DATE'].dt.year
df = df.drop_duplicates(subset = ['ID','YEAR'], keep = 'last')
#Get the YEAR and ID of the next record down
ndf = df[['ID', 'YEAR']].shift(-1)
#Want to duplicate records where ID's between record pairs match and spacing is more than year
match = (ndf['ID'] == df['ID'])
reps = np.array(np.maximum(match*(ndf['YEAR'] - df['YEAR']),1), dtype = int)
reps[-1] = 1 # the last value will by -inf
#Do the duplication
df = df.loc[np.repeat(df.index, reps)]
#Now we need to fix the year field
cs = np.ones(len(df))
cs[np.cumsum(reps[:-1])] -= reps[:-1]
df['YEAR'] = df['YEAR'] + np.cumsum(cs)-1
df = df.reset_index(drop = True)
return df
if __name__ == '__main__':
import StringIO
import timeit
#Generate some transaction record data
data='''ID,DATE,STATUS
1,3/1/2005,A
1,4/30/2005,B
1,6/30/2010,C
1,8/20/2010,D
'''
df = pandas.read_csv(StringIO.StringIO(data))
df['DATE'] = pandas.to_datetime(df['DATE'])
print(annualize(df))
print(annualize2(df))
#then repeat the rows of the table to get a more realistic sample size for measuring performance
df = df.loc[np.tile(df.index.values, 10000)]
df['ID'] = np.repeat(np.arange(10000), 4)
print('annualize')
print(timeit.timeit('annualize(df)', setup = 'from __main__ import annualize, df', number = 1))
print('annualize2')
print(timeit.timeit('annualize2(df)', setup = 'from __main__ import annualize2, df', number = 1))
Expected Output
DATE ID STATUS YEAR
0 2005-12-31 1 B 2005
1 2006-12-31 1 B 2006
2 2007-12-31 1 B 2007
3 2008-12-31 1 B 2008
4 2009-12-31 1 B 2009
5 2010-12-31 1 D 2010
ID DATE STATUS YEAR
0 1 2005-04-30 B 2005.0
1 1 2005-04-30 B 2006.0
2 1 2005-04-30 B 2007.0
3 1 2005-04-30 B 2008.0
4 1 2005-04-30 B 2009.0
5 1 2010-08-20 D 2010.0
annualize
26.4788940619
annualize2
0.0899452309903
output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 2.7.11.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel byteorder: little LC_ALL: None LANG: None
pandas: 0.18.1 nose: 1.3.7 pip: 8.1.2 setuptools: 20.3 Cython: 0.23.4 numpy: 1.11.0 scipy: 0.17.0 statsmodels: 0.6.1 xarray: None IPython: 4.1.2 sphinx: 1.3.5 patsy: 0.4.0 dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: 1.0.0 tables: 3.2.2 numexpr: 2.5 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.8.4 lxml: 3.6.0 bs4: 4.4.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.12 pymysql: None psycopg2: None jinja2: 2.8 boto: 2.39.0 pandas_datareader: None
Comment From: jorisvandenbossche
I didn't look into detail to your example, but note that the two functions do return a different result, so do not seem to be equivalent.
Comment From: jreback
duplicate of https://github.com/pydata/pandas/issues/11296
pull-requests are welcome!
Comment From: jreback
The actual grouping / resample are not bounding this here at all, rather it is the fact that .fillna
(and cousins) are not in cython, meaning this is basically a python loop.
Comment From: spillz
Yeah, to be clear I wasn't proposing the other code as a replacement, just to show the performance issue (it gets worse on even bigger data sets). I didn't think it was worth the extra fiddling to get the column ordering and dtypes to be the same. The general application is converting transaction records into an unbalanced panel data set, which I can then run standard regression methods on.
I had assumed that it was mostly the groupby that was causing the slowness. Is that wrong?
Based on the assumption I thought that one solution for my use case would be if resample could support pad operations for non-unique indexes (or more complete support for resampling multiindexes, which is essentially what my DATE and ID pairs represent). Currently if I do:
import pandas
import numpy as np
if __name__ == '__main__':
import StringIO
import timeit
#Generate some transaction record data
data='''ID,DATE,STATUS
1,3/1/2005,A
1,4/30/2005,B
1,6/30/2010,C
1,8/20/2010,D
'''
df = pandas.read_csv(StringIO.StringIO(data))
df['DATE'] = pandas.to_datetime(df['DATE'])
#then repeat the rows of the table to get a more realistic sample size for measuring performance
df = df.loc[np.tile(df.index.values, 10000)]
df['ID'] = np.repeat(np.arange(10000), 4)
df = df.set_index('DATE')
df = df.resample('A').ffill()
print(df)
I get:
Traceback (most recent call last):
File "C:\Users\#####\Source\PANDAS~2.PY", line 56, in <module>
df = df.resample('A').ffill()
File "c:\anaconda2\lib\site-packages\pandas\tseries\resample.py", line 424, in pad
return self._upsample('pad', limit=limit)
File "c:\anaconda2\lib\site-packages\pandas\tseries\resample.py", line 704, in _upsample
limit=limit)
File "c:\anaconda2\lib\site-packages\pandas\core\frame.py", line 2744, in reindex
**kwargs)
File "c:\anaconda2\lib\site-packages\pandas\core\generic.py", line 2229, in reindex
fill_value, copy).__finalize__(self)
File "c:\anaconda2\lib\site-packages\pandas\core\frame.py", line 2690, in _reindex_axes
fill_value, limit, tolerance)
File "c:\anaconda2\lib\site-packages\pandas\core\frame.py", line 2698, in _reindex_index
tolerance=tolerance)
File "c:\anaconda2\lib\site-packages\pandas\indexes\base.py", line 2342, in reindex
raise ValueError("cannot reindex a non-unique index "
ValueError: cannot reindex a non-unique index with a method or limit
Whereas it would be nice if it was able to handle non-unique indexes by treating the non-ascending dates as boundaries and return something like:
ID STATUS YEAR
2005 1 B 2005
2006 1 B 2006
2007 1 B 2007
2008 1 B 2008
2009 1 B 2009
2010 1 D 2010
2005 2 B 2005
2006 2 B 2006
2007 2 B 2007
2008 2 B 2008
2009 2 B 2009
2010 2 D 2010
2005 3 B 2005
2006 3 B 2006
2007 3 B 2007
2008 3 B 2008
2009 3 B 2009
2010 3 D 2010
...
But I guess that's moot if the problem is that reindex filling relies on looping.
As for pull requests, I found the code that does the reindexing / resampling to be pretty opaque so would need a few pointers to get started.
Comment From: eromoe
Number of row : 960,000 cost half hour , still hanging . What's wrong with groupby resampling?