Hello there,

Have I said that Pandas is awesome? yes, many times ;-)

I have a question, I am working with a very large dataframe of trades, timestamped at the millisecond precision. Latest Pandas 19.2 here.

I need to resample the dataframe every 200 ms, but given that my data spans several years and I am only interested in resampling data between 10:00 am and 12:00 am every day (handled by between_time()), using a plain resample will crash and burn my machine.

Instead, I tried the sparse resampling shown in the http://pandas.pydata.org/pandas-docs/stable/timeseries.html#sparse-resampling, but it fails when i provide it with a dictionary of columns.

Is that expected? Is it a bug?

import pandas as pd
import numpy as np

rng = pd.date_range('2014-1-1', periods=100, freq='D') + pd.Timedelta('1s')
ts = pd.DataFrame({'value' : range(100)}, index=rng)


from functools import partial
from pandas.tseries.frequencies import to_offset

def round(t, freq):
 freq = to_offset(freq)
 return pd.Timestamp((t.value // freq.delta.value) * freq.delta.value)

# works
ts.groupby(partial(round, freq='3T')).value.sum()

# does not work
ts.groupby(partial(round, freq='3T')).apply({'value' : 'sum'})

ts.groupby(partial(round, freq='3T')).apply({'value' : 'sum'})
Traceback (most recent call last):

  File "<ipython-input-104-6004b307a469>", line 1, in <module>
    ts.groupby(partial(round, freq='3T')).apply({'value' : 'sum'})

  File "C:\Users\m1hxb02\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 674, in apply
    func = self._is_builtin_func(func)

  File "C:\Users\m1hxb02\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\base.py", line 644, in _is_builtin_func
    return self._builtin_table.get(arg, arg)

TypeError: unhashable type: 'dict'

Problem is: I need to resample several columns at once in my dataframe, eventually using different functions (sum, mean, max). Is anything wrong here?

Thanks~

Comment From: chris-b1

You want to be using .agg here. e.g.

ts.groupby(partial(round, freq='3T')).agg({'value' : ['sum', 'mean']})

To re-purpose this issue - not sure when, but DatetimeIndex now has a vectorized round method which will be significantly faster - doc example should be updated.

In [149]: %timeit ts.groupby(partial(round, freq='3T')).agg({'value' : 'sum'})
100 loops, best of 3: 6.56 ms per loop

In [150]: %timeit ts.groupby(ts.index.round('3T')).agg({'value' : 'sum'})
1000 loops, best of 3: 1.83 ms per loop

Comment From: randomgambit

@chris-b1 thanks! but the syntax for the regular resample is with apply right?

ts.resample('5Min').apply({'value' : 'sum'})

seems to work correctly

Comment From: chris-b1

To be honest I had no idea that worked, I think .agg would also be the idiomatic way with resample. @jreback ?

Comment From: randomgambit

@chris-b1 summoning the great master @jreback in my experience, pandas is smart enough (most of the time) to guess what apply is doing. That is, an agg or a transform. But Jeff knows better here

Comment From: jreback

this will be handled in #14668

.apply does not accept a dictionary, see #14464

Comment From: randomgambit

@chris-b1 @jreback nice. it DOES appear to work, though, in the case of resample

ts.resample('5Min').apply({'value' : 'sum'}) gives the same output as ts.resample('5Min').agg({'value' : 'sum'})